Open jaiswackhv opened 1 month ago
The run is crashing. Are you running on a system with sufficient RAM?
yes, server have 1TB memory, and 2 socket 4th Gen Intel cpu with 60core each.
total used free shared buff/cache available
Mem: 1.0Ti 26Gi 623Gi 6.0Mi 362Gi 980Gi Swap: 19Gi 4.0Mi 19Gi
Thank you for confirming.
c>::size_type = long unsigned int]: Assertion '__n < this->size()' failed.
/hana/data/perfadmin/CM/repos/mlcommons@cm4mlops/script/benchmark-program/run.sh: line 51: 707920 Aborted
This is the issue. May be try adding --docker
option to run the script inside a docker container?
I tried that also, as per the below link. when used llama2-70b-99 model then docker was there and used llama2-70b-99.9 in the same environment. But in both cases same error came.
https://docs.mlcommons.org/inference/benchmarks/language/llama2-70b/#__tabbed_4_1
cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1 \ --model=llama2-70b-99 \ --implementation=reference \ --framework=pytorch \ --category=datacenter \ --scenario=Offline \ --execution_mode=test \ --device=cpu \ --docker --quiet \ --test_query_count=50
I'm not sure what could be the issue here. Does --precision=bfloat16
help?
Below error in datacenter category, llama2-70b-99.9 model, Offline Scenario in docker container and used this command script.
cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1 \ --model=llama2-70b-99.9 \ --implementation=reference \ --framework=pytorch \ --category=datacenter \ --scenario=Offline \ --docker_os=ubuntu \ --docker_os_version=22.04 \ --execution_mode=test \ --device=cpu \ --quiet \ --test_query_count=50
INFO:Llama-70B-MAIN:Starting Benchmark run IssueQuery started with 50 samples IssueQuery done /hana/data/perfadmin/.local/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:540: UserWarning:; std::vector<_Tp, _Alloc>::reference = long int&; std::vector<_Tp, _Alloc>::size_type = long unsigned int]: Assertion '__n < this->size()' failed.
/hana/data/perfadmin/CM/repos/mlcommons@cm4mlops/script/benchmark-program/run.sh: line 51: 707920 Aborted (core dumped) /usr/bin/python3 main.py --scenario Offline --dataset-path /hana/data/perfadmin/CM/repos/local/cache/07479f00e3d34030/processed-openorca/open_orca_gpt4_tokenized_llama.sampled_24576.pkl --device cpu --mlperf-conf '/hana/data/perfadmin/CM/repos/local/cache/4dd15300ae384fd9/inference/mlperf.conf' --user-conf '/hana/data/perfadmin/CM/repos/mlcommons@cm4mlops/script/generate-mlperf-inference-user-conf/tmp/30a4e38f3f8e4780aece96b0cda3763e.conf' --output-log-dir /hana/data/perfadmin/CM/repos/local/cache/a79a5065bda44b1c/test_results/ha820g3hcrhel.aselab.org-reference-cpu-pytorch-v2.3.1-default_config/llama2-70b-99.9/offline/performance/run_1 --dtype float32 --model-path /hana/data/perfadmin/CM/repos/local/cache/2ffcd3e8d0b84740/repo 2>&1
/hana/data/perfadmin/CM/repos/mlcommons@cm4mlops/script/benchmark-program/run.sh: line 56: 134: command not found
do_sample
is set toFalse
. However,temperature
is set to0.6
-- this flag is only used in sample-based generation modes. You should setdo_sample=True
or unsettemperature
. warnings.warn( /hana/data/perfadmin/.local/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:545: UserWarning:do_sample
is set toFalse
. However,top_p
is set to0.9
-- this flag is only used in sample-based generation modes. You should setdo_sample=True
or unsettop_p
. warnings.warn( /hana/data/perfadmin/.local/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:588: UserWarning:num_beams
is set to 1. However,early_stopping
is set toTrue
-- this flag is only used in beam-based generation modes. You should setnum_beams>1
or unsetearly_stopping
. warnings.warn( Saving outputs to run_outputs/q4118.pkl Samples run: 1 BatchMaker time: 0.0008885860443115234 Inference time: 3116.8550243377686 Postprocess time: 0.0014901161193847656 ==== Total time: 3116.8574030399323 /usr/include/c++/11/bits/stl_vector.h:1045: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) [with _Tp = long int; _Alloc = std::allocatorCM error: Portable CM script failed (name = benchmark-program, return code = 256)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Note that it is often a portability issue of a third-party tool or a native script wrapped and unified by this CM script (automation recipe). Please re-run this script with --repro flag and report this issue with the original command line, cm-repro directory and full log here:
https://github.com/mlcommons/cm4mlops/issues
The CM concept is to collaboratively fix such issues inside portable CM scripts to make existing tools and native scripts more portable, interoperable and deterministic. Thank you!