Open howudodat opened 3 months ago
Long waiting is expected here as the download is more than 300 GB. Can you please try without --docker
so that the download happens on the host?
cm script needs to install pandas if not running in docker
CM error: can't load Python module code (path=/home/peter/CM/repos/mlcommons@cm4mlops/script/app-mlperf-inference, name=customize, err=No module named 'pandas')!
pip install pandas
but after that, the same thing happens without the docker build, git clone just gets killed
Cloning into 'repo'...
Username for 'https://huggingface.co': howudodat
Password for 'https://howudodat@huggingface.co':
remote: Enumerating objects: 58, done.
remote: Counting objects: 100% (58/58), done.
remote: Compressing objects: 100% (57/57), done.
remote: Total 58 (delta 9), reused 40 (delta 1), pack-reused 0 (from 0)
Unpacking objects: 100% (58/58), 510.20 KiB | 3.21 MiB/s, done.
Username for 'https://huggingface.co': howudodat
Password for 'https://howudodat@huggingface.co':
/home/peter/CM/repos/mlcommons@cm4mlops/script/get-git-repo/run.sh: line 51: 67850 Killed ${CM_GIT_CLONE_CMD}
CM error: Portable CM script failed (name = get-git-repo, return code = 256)
and then re-running the script bombs
Traceback (most recent call last):
File "/home/peter/CM/repos/local/cache/95ebaee7a6754167/inference/language/llama2-70b/main.py", line 97, in <module>
main()
File "/home/peter/CM/repos/local/cache/95ebaee7a6754167/inference/language/llama2-70b/main.py", line 69, in main
sut = sut_cls(
File "/home/peter/CM/repos/local/cache/95ebaee7a6754167/inference/language/llama2-70b/SUT.py", line 116, in __init__
self.data_object = Dataset(self.model_path,
File "/home/peter/CM/repos/local/cache/95ebaee7a6754167/inference/language/llama2-70b/dataset.py", line 30, in __init__
self.load_tokenizer()
File "/home/peter/CM/repos/local/cache/95ebaee7a6754167/inference/language/llama2-70b/dataset.py", line 38, in load_tokenizer
self.tokenizer = AutoTokenizer.from_pretrained(
File "/home/peter/cm/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 897, in from_pretrained
return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/home/peter/cm/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2271, in from_pretrained
return cls._from_pretrained(
File "/home/peter/cm/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2505, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/home/peter/cm/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama.py", line 171, in __init__
self.sp_model = self.get_spm_processor(kwargs.pop("from_slow", False))
File "/home/peter/cm/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama.py", line 201, in get_spm_processor
with open(self.vocab_file, "rb") as f:
TypeError: expected str, bytes or os.PathLike object, not NoneType
./run.sh: line 59: 1: command not found
./run.sh: line 65: 1: command not found
at least with the docker build I can restart it until it gets further along (takes about 6 restarts before it completely bombs)
so far I am 2 for 6 in trying the difference scripts with them actually working. resnet and retina worked (docker builds) dlrm and llama both bomb in both docker and non-docker builds. We really need to profile this board / gpu combo on a decent language model, edge device
@howudodat we have fixed the pandas
issue and also the issue with broken git clones. Should be fine if you do cm pull repo
.
But looks like the real problem is not this. What's the memory size of the machine? LLAMA2-70B is a really large model and it needs 300GB+ memory.
If memory is low, gptj-6b is a better model to try. bert is an even smaller model.
running:
cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1 --model=llama2-70b-99 --implementation=reference --framework=pytorch --category=datacenter --scenario=Offline --execution_mode=test --device=cpu --docker --quiet --test_query_count=50
results in several hours of silence after which this error is produced
Any ideas?