mlcommons / inference

Reference implementations of MLPerf™ inference benchmarks
https://mlcommons.org/en/groups/inference
Apache License 2.0
1.24k stars 536 forks source link

script aborts with 521 Killed #1824

Open howudodat opened 3 months ago

howudodat commented 3 months ago

running:

cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1 --model=llama2-70b-99 --implementation=reference --framework=pytorch --category=datacenter --scenario=Offline --execution_mode=test --device=cpu --docker --quiet --test_query_count=50

results in several hours of silence after which this error is produced

git clone  --recurse-submodules https://huggingface.co/meta-llama/Llama-2-70b-chat-hf --depth 5 repo

Cloning into 'repo'...
Username for 'https://huggingface.co': howudodat
Password for 'https://howudodat@huggingface.co': 
remote: Enumerating objects: 58, done.
remote: Counting objects: 100% (58/58), done.
remote: Compressing objects: 100% (56/56), done.
remote: Total 58 (delta 9), reused 42 (delta 2), pack-reused 0 (from 0)
Unpacking objects: 100% (58/58), 511.53 KiB | 5.12 MiB/s, done.
Username for 'https://huggingface.co': howudodat
Password for 'https://howudodat@huggingface.co': 
/home/cmuser/CM/repos/mlcommons@cm4mlops/script/get-git-repo/run.sh: line 51:   521 Killed                  ${CM_GIT_CLONE_CMD}

CM error: Portable CM script failed (name = get-git-repo, return code = 256)

Any ideas?

arjunsuresh commented 3 months ago

Long waiting is expected here as the download is more than 300 GB. Can you please try without --docker so that the download happens on the host?

howudodat commented 3 months ago

cm script needs to install pandas if not running in docker

CM error: can't load Python module code (path=/home/peter/CM/repos/mlcommons@cm4mlops/script/app-mlperf-inference, name=customize, err=No module named 'pandas')!

pip install pandas

but after that, the same thing happens without the docker build, git clone just gets killed

Cloning into 'repo'...
Username for 'https://huggingface.co': howudodat
Password for 'https://howudodat@huggingface.co': 
remote: Enumerating objects: 58, done.
remote: Counting objects: 100% (58/58), done.
remote: Compressing objects: 100% (57/57), done.
remote: Total 58 (delta 9), reused 40 (delta 1), pack-reused 0 (from 0)
Unpacking objects: 100% (58/58), 510.20 KiB | 3.21 MiB/s, done.
Username for 'https://huggingface.co': howudodat
Password for 'https://howudodat@huggingface.co': 
/home/peter/CM/repos/mlcommons@cm4mlops/script/get-git-repo/run.sh: line 51: 67850 Killed                  ${CM_GIT_CLONE_CMD}

CM error: Portable CM script failed (name = get-git-repo, return code = 256)

and then re-running the script bombs

Traceback (most recent call last):
  File "/home/peter/CM/repos/local/cache/95ebaee7a6754167/inference/language/llama2-70b/main.py", line 97, in <module>
    main()
  File "/home/peter/CM/repos/local/cache/95ebaee7a6754167/inference/language/llama2-70b/main.py", line 69, in main
    sut = sut_cls(
  File "/home/peter/CM/repos/local/cache/95ebaee7a6754167/inference/language/llama2-70b/SUT.py", line 116, in __init__
    self.data_object = Dataset(self.model_path,
  File "/home/peter/CM/repos/local/cache/95ebaee7a6754167/inference/language/llama2-70b/dataset.py", line 30, in __init__
    self.load_tokenizer()
  File "/home/peter/CM/repos/local/cache/95ebaee7a6754167/inference/language/llama2-70b/dataset.py", line 38, in load_tokenizer
    self.tokenizer = AutoTokenizer.from_pretrained(
  File "/home/peter/cm/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 897, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/home/peter/cm/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2271, in from_pretrained
    return cls._from_pretrained(
  File "/home/peter/cm/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2505, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home/peter/cm/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama.py", line 171, in __init__
    self.sp_model = self.get_spm_processor(kwargs.pop("from_slow", False))
  File "/home/peter/cm/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama.py", line 201, in get_spm_processor
    with open(self.vocab_file, "rb") as f:
TypeError: expected str, bytes or os.PathLike object, not NoneType
./run.sh: line 59: 1: command not found
./run.sh: line 65: 1: command not found

at least with the docker build I can restart it until it gets further along (takes about 6 restarts before it completely bombs)

so far I am 2 for 6 in trying the difference scripts with them actually working. resnet and retina worked (docker builds) dlrm and llama both bomb in both docker and non-docker builds. We really need to profile this board / gpu combo on a decent language model, edge device

arjunsuresh commented 3 months ago

@howudodat we have fixed the pandas issue and also the issue with broken git clones. Should be fine if you do cm pull repo.

But looks like the real problem is not this. What's the memory size of the machine? LLAMA2-70B is a really large model and it needs 300GB+ memory.

If memory is low, gptj-6b is a better model to try. bert is an even smaller model.