mlcommons / cm4mlops

A collection of portable, reusable and cross-platform automation recipes (CM scripts) with a human-friendly interface and minimal dependencies to make it easier to build, run, benchmark and optimize AI, ML and other applications and systems across diverse and continuously changing models, data sets, software and hardware (cloud/edge)
http://docs.mlcommons.org/cm4mlops/
Apache License 2.0
8 stars 11 forks source link

Getting error while MLPerf Reference Implementation in docker #116

Open jaiswackhv opened 1 month ago

jaiswackhv commented 1 month ago

Below error in datacenter category, llama2-70b-99.9 model, Offline Scenario in docker container and used this command script.

cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1 \ --model=llama2-70b-99.9 \ --implementation=reference \ --framework=pytorch \ --category=datacenter \ --scenario=Offline \ --docker_os=ubuntu \ --docker_os_version=22.04 \ --execution_mode=test \ --device=cpu \ --quiet \ --test_query_count=50

INFO:Llama-70B-MAIN:Starting Benchmark run IssueQuery started with 50 samples IssueQuery done /hana/data/perfadmin/.local/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:540: UserWarning: do_sample is set to False. However, temperature is set to 0.6 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset temperature. warnings.warn( /hana/data/perfadmin/.local/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:545: UserWarning: do_sample is set to False. However, top_p is set to 0.9 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset top_p. warnings.warn( /hana/data/perfadmin/.local/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:588: UserWarning: num_beams is set to 1. However, early_stopping is set to True -- this flag is only used in beam-based generation modes. You should set num_beams>1 or unset early_stopping. warnings.warn( Saving outputs to run_outputs/q4118.pkl Samples run: 1 BatchMaker time: 0.0008885860443115234 Inference time: 3116.8550243377686 Postprocess time: 0.0014901161193847656 ==== Total time: 3116.8574030399323 /usr/include/c++/11/bits/stl_vector.h:1045: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) [with _Tp = long int; _Alloc = std::allocator; std::vector<_Tp, _Alloc>::reference = long int&; std::vector<_Tp, _Alloc>::size_type = long unsigned int]: Assertion '__n < this->size()' failed. /hana/data/perfadmin/CM/repos/mlcommons@cm4mlops/script/benchmark-program/run.sh: line 51: 707920 Aborted (core dumped) /usr/bin/python3 main.py --scenario Offline --dataset-path /hana/data/perfadmin/CM/repos/local/cache/07479f00e3d34030/processed-openorca/open_orca_gpt4_tokenized_llama.sampled_24576.pkl --device cpu --mlperf-conf '/hana/data/perfadmin/CM/repos/local/cache/4dd15300ae384fd9/inference/mlperf.conf' --user-conf '/hana/data/perfadmin/CM/repos/mlcommons@cm4mlops/script/generate-mlperf-inference-user-conf/tmp/30a4e38f3f8e4780aece96b0cda3763e.conf' --output-log-dir /hana/data/perfadmin/CM/repos/local/cache/a79a5065bda44b1c/test_results/ha820g3hcrhel.aselab.org-reference-cpu-pytorch-v2.3.1-default_config/llama2-70b-99.9/offline/performance/run_1 --dtype float32 --model-path /hana/data/perfadmin/CM/repos/local/cache/2ffcd3e8d0b84740/repo 2>&1 /hana/data/perfadmin/CM/repos/mlcommons@cm4mlops/script/benchmark-program/run.sh: line 56: 134: command not found

CM error: Portable CM script failed (name = benchmark-program, return code = 256)

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Note that it is often a portability issue of a third-party tool or a native script wrapped and unified by this CM script (automation recipe). Please re-run this script with --repro flag and report this issue with the original command line, cm-repro directory and full log here:

https://github.com/mlcommons/cm4mlops/issues

The CM concept is to collaboratively fix such issues inside portable CM scripts to make existing tools and native scripts more portable, interoperable and deterministic. Thank you!

arjunsuresh commented 1 month ago

The run is crashing. Are you running on a system with sufficient RAM?

jaiswackhv commented 1 month ago

yes, server have 1TB memory, and 2 socket 4th Gen Intel cpu with 60core each.

           total        used        free      shared  buff/cache   available

Mem: 1.0Ti 26Gi 623Gi 6.0Mi 362Gi 980Gi Swap: 19Gi 4.0Mi 19Gi

arjunsuresh commented 1 month ago

Thank you for confirming.

c>::size_type = long unsigned int]: Assertion '__n < this->size()' failed.
/hana/data/perfadmin/CM/repos/mlcommons@cm4mlops/script/benchmark-program/run.sh: line 51: 707920 Aborted

This is the issue. May be try adding --docker option to run the script inside a docker container?

jaiswackhv commented 1 month ago

I tried that also, as per the below link. when used llama2-70b-99 model then docker was there and used llama2-70b-99.9 in the same environment. But in both cases same error came.

https://docs.mlcommons.org/inference/benchmarks/language/llama2-70b/#__tabbed_4_1

cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1 \ --model=llama2-70b-99 \ --implementation=reference \ --framework=pytorch \ --category=datacenter \ --scenario=Offline \ --execution_mode=test \ --device=cpu \ --docker --quiet \ --test_query_count=50

arjunsuresh commented 1 month ago

I'm not sure what could be the issue here. Does --precision=bfloat16 help?