Getting error while MLPerf Reference Implementation in docker

jaiswackhv commented 1 month ago

Below error in datacenter category, llama2-70b-99.9 model, Offline Scenario in docker container and used this command script.

cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1 \ --model=llama2-70b-99.9 \ --implementation=reference \ --framework=pytorch \ --category=datacenter \ --scenario=Offline \ --docker_os=ubuntu \ --docker_os_version=22.04 \ --execution_mode=test \ --device=cpu \ --quiet \ --test_query_count=50

INFO:Llama-70B-MAIN:Starting Benchmark run IssueQuery started with 50 samples IssueQuery done /hana/data/perfadmin/.local/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:540: UserWarning: do_sample is set to False. However, temperature is set to 0.6 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset temperature. warnings.warn( /hana/data/perfadmin/.local/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:545: UserWarning: do_sample is set to False. However, top_p is set to 0.9 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset top_p. warnings.warn( /hana/data/perfadmin/.local/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:588: UserWarning: num_beams is set to 1. However, early_stopping is set to True -- this flag is only used in beam-based generation modes. You should set num_beams>1 or unset early_stopping. warnings.warn( Saving outputs to run_outputs/q4118.pkl Samples run: 1 BatchMaker time: 0.0008885860443115234 Inference time: 3116.8550243377686 Postprocess time: 0.0014901161193847656 ==== Total time: 3116.8574030399323 /usr/include/c++/11/bits/stl_vector.h:1045: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) [with _Tp = long int; _Alloc = std::allocator; std::vector<_Tp, _Alloc>::reference = long int&; std::vector<_Tp, _Alloc>::size_type = long unsigned int]: Assertion '__n < this->size()' failed. /hana/data/perfadmin/CM/repos/mlcommons@cm4mlops/script/benchmark-program/run.sh: line 51: 707920 Aborted (core dumped) /usr/bin/python3 main.py --scenario Offline --dataset-path /hana/data/perfadmin/CM/repos/local/cache/07479f00e3d34030/processed-openorca/open_orca_gpt4_tokenized_llama.sampled_24576.pkl --device cpu --mlperf-conf '/hana/data/perfadmin/CM/repos/local/cache/4dd15300ae384fd9/inference/mlperf.conf' --user-conf '/hana/data/perfadmin/CM/repos/mlcommons@cm4mlops/script/generate-mlperf-inference-user-conf/tmp/30a4e38f3f8e4780aece96b0cda3763e.conf' --output-log-dir /hana/data/perfadmin/CM/repos/local/cache/a79a5065bda44b1c/test_results/ha820g3hcrhel.aselab.org-reference-cpu-pytorch-v2.3.1-default_config/llama2-70b-99.9/offline/performance/run_1 --dtype float32 --model-path /hana/data/perfadmin/CM/repos/local/cache/2ffcd3e8d0b84740/repo 2>&1 /hana/data/perfadmin/CM/repos/mlcommons@cm4mlops/script/benchmark-program/run.sh: line 56: 134: command not found

CM error: Portable CM script failed (name = benchmark-program, return code = 256)

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Note that it is often a portability issue of a third-party tool or a native script wrapped and unified by this CM script (automation recipe). Please re-run this script with --repro flag and report this issue with the original command line, cm-repro directory and full log here:

https://github.com/mlcommons/cm4mlops/issues

The CM concept is to collaboratively fix such issues inside portable CM scripts to make existing tools and native scripts more portable, interoperable and deterministic. Thank you!

arjunsuresh commented 1 month ago

The run is crashing. Are you running on a system with sufficient RAM?

jaiswackhv commented 1 month ago

yes, server have 1TB memory, and 2 socket 4th Gen Intel cpu with 60core each.

           total        used        free      shared  buff/cache   available

Mem: 1.0Ti 26Gi 623Gi 6.0Mi 362Gi 980Gi Swap: 19Gi 4.0Mi 19Gi

arjunsuresh commented 1 month ago

Thank you for confirming.

c>::size_type = long unsigned int]: Assertion '__n < this->size()' failed.
/hana/data/perfadmin/CM/repos/mlcommons@cm4mlops/script/benchmark-program/run.sh: line 51: 707920 Aborted

This is the issue. May be try adding --docker option to run the script inside a docker container?

jaiswackhv commented 1 month ago

I tried that also, as per the below link. when used llama2-70b-99 model then docker was there and used llama2-70b-99.9 in the same environment. But in both cases same error came.

https://docs.mlcommons.org/inference/benchmarks/language/llama2-70b/#__tabbed_4_1

cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1 \ --model=llama2-70b-99 \ --implementation=reference \ --framework=pytorch \ --category=datacenter \ --scenario=Offline \ --execution_mode=test \ --device=cpu \ --docker --quiet \ --test_query_count=50

arjunsuresh commented 1 month ago

I'm not sure what could be the issue here. Does --precision=bfloat16 help?

mlcommons / cm4mlops

Getting error while MLPerf Reference Implementation in docker #116