Open howudodat opened 3 weeks ago
ok, the above error went away after a refresh of the repos (cm repo pull)
However 3 times in a row I get this same error...it just dies during the test
command:
cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1 \
--model=gptj-99 \
--implementation=reference \
--framework=pytorch \
--category=edge \
--scenario=Offline \
--execution_mode=test \
--device=cpu \
--docker --quiet \
--test_query_count=50
error:
Constructing QSL
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████| 619/619 [00:00<00:00, 5.62MB/s]
vocab.json: 100%|██████████████████████████████████████████████████████████████████████████████████████| 798k/798k [00:00<00:00, 3.41MB/s]
merges.txt: 100%|██████████████████████████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 2.07MB/s]
added_tokens.json: 100%|█████████████████████████████████████████████████████████████████████████████| 4.04k/4.04k [00:00<00:00, 29.7MB/s]
special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████| 357/357 [00:00<00:00, 3.60MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████| 1.37M/1.37M [00:00<00:00, 5.25MB/s]
/home/cmuser/.local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
warnings.warn(
Encoding Samples
Finished constructing QSL.
Loading PyTorch model...
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████| 3/3 [02:35<00:00, 51.99s/it]
/home/cmuser/.local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
warnings.warn(
100%|███████████████████████████████████████████████████████████████████████████████████████████████| 285/285 [00:00<00:00, 477959.47it/s]
Running LoadGen test...
Number of Samples in query_samples : 50
12%|███████████▉ | 6/50 [39:35<4:02:43, 330.99s/it]
./run.sh: line 54: 477 Killed
/usr/bin/python3 main.py --model-path=/home/cmuser/CM/repos/local/cache/5de735f7d99448f8/checkpoint/checkpoint-final --dataset-path=/home/cmuser/CM/repos/local/cache/fdb93082bc3a466c/install/cnn_eval.json --scenario Offline --max_examples 50 --mlperf_conf '/home/cmuser/CM/repos/local/cache/861bf247a96946bd/inference/mlperf.conf' --dtype float32 --user_conf '/home/cmuser/CM/repos/mlcommons@cm4mlops/script/generate-mlperf-inference-user-conf/tmp/72f174c5e1f5481ebf2e33a55b03f0d1.conf' 2>&1
./run.sh: line 59: 137: command not found
./run.sh: line 65: 137: command not found
Looks like an OS kill. Do you have sufficient RAM + swap space? float32 run needs about 75 GB of memory. You can try --precision=bfloat16
which will need about ~40GB of memory. --beam_size=2
(official requirement is --beam_size=4
) can also be used to further reduce the memory requirement if official compliance is not a requirement.
command:
error: