Automated command for llama2-70b: Changing Batch Size fails

philross commented 1 month ago

Hello mlcommons team,

I want to run the "Automated command to run the benchmark via MLCommons CM" (from the example: https://github.com/mlcommons/inference/tree/master/language/llama2-70b) with a different batch size, but I am getting the following error:

Run Directory: /root/CM/repos/local/cache/19f3466c31404fb9/inference/language/llama2-70b

CMD: /root/CM/repos/local/cache/9798222eb1384e65/mlperf/bin/python3 main.py  --scenario Offline --dataset-path /root/CM/repos/local/cache/cf3f035c15414140/open_orca/open_orca_gpt4_tokenized_llama.sampled_24576.pkl.gz --device cuda:0   --max-batchsize 100 --batch-size 8 --mlperf-conf '/root/CM/repos/local/cache/19f3466c31404fb9/inference/mlperf.conf' --user-conf '/root/CM/repos/mlcommons@cm4mlops/script/generate-mlperf-inference-user-conf/tmp/7fd03acd2fb849d3ad4d7b62d8bbbed1.conf' --output-log-dir /root/CM/repos/local/cache/e905aba5cd2047cf/test_results/969a89282f1c-reference-gpu-pytorch-v2.4.0-default_config/llama2-70b-99/offline/performance/run_1 --dtype float16 --model-path /Llama-2-70b-chat-hf 2>&1 ; echo \$? > exitstatus | tee '/root/CM/repos/local/cache/e905aba5cd2047cf/test_results/969a89282f1c-reference-gpu-pytorch-v2.4.0-default_config/llama2-70b-99/offline/performance/run_1/console.out'

INFO:root:         ! cd /root/CM/repos/local/cache/05e8741e40c349bf
INFO:root:         ! call /root/CM/repos/mlcommons@cm4mlops/script/benchmark-program/run-ubuntu.sh from tmp-run.sh
/root/CM/repos/local/cache/9798222eb1384e65/mlperf/bin/python3 main.py  --scenario Offline --dataset-path /root/CM/repos/local/cache/cf3f035c15414140/open_orca/open_orca_gpt4_tokenized_llama.sampled_24576.pkl.gz --device cuda:0   --max-batchsize 100 --batch-size 8 --mlperf-conf '/root/CM/repos/local/cache/19f3466c31404fb9/inference/mlperf.conf' --user-conf '/root/CM/repos/mlcommons@cm4mlops/script/generate-mlperf-inference-user-conf/tmp/7fd03acd2fb849d3ad4d7b62d8bbbed1.conf' --output-log-dir /root/CM/repos/local/cache/e905aba5cd2047cf/test_results/969a89282f1c-reference-gpu-pytorch-v2.4.0-default_config/llama2-70b-99/offline/performance/run_1 --dtype float16 --model-path /Llama-2-70b-chat-hf 2>&1 ; echo $? > exitstatus | tee '/root/CM/repos/local/cache/e905aba5cd2047cf/test_results/969a89282f1c-reference-gpu-pytorch-v2.4.0-default_config/llama2-70b-99/offline/performance/run_1/console.out'
usage: main.py [-h] [--scenario {Offline,Server}] [--model-path MODEL_PATH] [--dataset-path DATASET_PATH] [--accuracy] [--dtype DTYPE] [--device {cpu,cuda:0}] [--audit-conf AUDIT_CONF] [--mlperf-conf MLPERF_CONF]
               [--user-conf USER_CONF] [--total-sample-count TOTAL_SAMPLE_COUNT] [--batch-size BATCH_SIZE] [--output-log-dir OUTPUT_LOG_DIR] [--enable-log-trace] [--num-workers NUM_WORKERS]
main.py: error: unrecognized arguments: --max-batchsize 100
./run.sh: line 56: 2: command not found

CM error: Portable CM script failed (name = benchmark-program, return code = 256)

I seems that it sets the max-batchsize even tho I specified the --batch-size=100

I am running the following command:

cm run script --tags=run-mlperf,_full,inference,_r4.1 \
   --model=llama2-70b-99 \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --device=cuda  \
   --quiet \
   --test_query_count=1000 --batch-size=100 --execution_mode=test \
   --precision=float16 --env.LLAMA2_CHECKPOINT_PATH=/Llama-2-70b-chat-hf

arjunsuresh commented 1 month ago

Can you please do cm pull repo and retry the command?

philross commented 1 month ago

Thanks for the swift reply @arjunsuresh. However, I am still getting the same error. Is it possible that I need to checkout a special branch or are the changes already in main?

arjunsuresh commented 1 month ago

mlperf-inference is the default branch for MLPerf inference. If you're in the main branch, please switch to here.

philross commented 1 month ago

Ah okay @arjunsuresh.

Now the behavior is different, but I see two "batch-size" commands now:

INFO:root:         ! call /root/CM/repos/mlcommons@cm4mlops/script/benchmark-program/run-ubuntu.sh from tmp-run.sh
/root/CM/repos/local/cache/13c4acd3a9ce4f47/mlperf/bin/python3 main.py  --scenario Offline --dataset-path /root/CM/repos/local/cache/742e52c1c7fb499b/open_orca/open_orca_gpt4_tokenized_llama.sampled_24576.pkl.gz --device cuda:0   --batch-size 100 --batch-size 8 --mlperf-conf '/root/CM/repos/local/cache/e2331993ea3b4b0e/inference/mlperf.conf' --user-conf '/root/CM/repos/mlcommons@cm4mlops/script/generate-mlperf-inference-user-conf/tmp/2770ec2ee30a466ca2390923ab647c33.conf' --output-log-dir /root/CM/repos/local/cache/7df07c822ee14316/test_results/fa15046c2056-reference-gpu-pytorch-v2.4.0-default_config/llama2-70b-99/offline/performance/run_1 --dtype float16 --model-path /Llama-2-70b-chat-hf 2>&1 ; echo $? > exitstatus | tee '/root/CM/repos/local/cache/7df07c822ee14316/test_results/fa15046c2056-reference-gpu-pytorch-v2.4.0-default_config/llama2-70b-99/offline/performance/run_1/console.out'

--batch-size 100 --batch-size 8

And it seems to respect the latter:

Samples run: 8
    BatchMaker time: 0.004299163818359375
    Inference time: 43.85818552970886
    Postprocess time: 0.0008432865142822266
    ==== Total time: 43.863327980041504

arjunsuresh commented 1 month ago

Sorry for that. Can you please try now?

But the reference implementation is not tested for different batch sizes and the behaviour can be unpredictable. The reference implementations are only meant for showcasing the benchmark requirements.

philross commented 1 month ago

Now it seems to ignore the setting completely and is using batch size 1:


INFO:root:         ! call /root/CM/repos/mlcommons@cm4mlops/script/benchmark-program/run-ubuntu.sh from tmp-run.sh
/root/CM/repos/local/cache/13c4acd3a9ce4f47/mlperf/bin/python3 main.py  --scenario Offline --dataset-path /root/CM/repos/local/cache/742e52c1c7fb499b/open_orca/open_orca_gpt4_tokenized_llama.sampled_24576.pkl.gz --device cuda:0   --mlperf-conf '/root/CM/repos/local/cache/e2331993ea3b4b0e/inference/mlperf.conf' --user-conf '/root/CM/repos/mlcommons@cm4mlops/script/generate-mlperf-inference-user-conf/tmp/afe7f8cc6680467da2cb0190f82adc63.conf' --output-log-dir /root/CM/repos/local/cache/7df07c822ee14316/test_results/fa15046c2056-reference-gpu-pytorch-v2.4.0-default_config/llama2-70b-99/offline/performance/run_1 --dtype float16 --model-path /Llama-2-70b-chat-hf 2>&1 ; echo $? > exitstatus | tee '/root/CM/repos/local/cache/7df07c822ee14316/test_results/fa15046c2056-reference-gpu-pytorch-v2.4.0-default_config/llama2-70b-99/offline/performance/run_1/console.out'
Loading dataset...
Finished loading dataset.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:27<00:00,  1.83s/it]
Loaded model
Loaded tokenizer
INFO:Llama-70B-MAIN:Starting Benchmark run
IssueQuery started with 1000 samples
IssueQuery done
Saving outputs to run_outputs/q4118.pkl
Samples run: 1
    BatchMaker time: 0.0007042884826660156
    Inference time: 31.035701990127563
    Postprocess time: 0.0006494522094726562
    ==== Total time: 31.037055730819702

arjunsuresh commented 1 month ago

Should work now - final time.

philross commented 1 month ago

Now it is set in a correct manner! Thanks @arjunsuresh.

Now I am running into this error, but as you said the behavior is unpredictable:


Exception in thread Thread-2:
Traceback (most recent call last):
  File "/root/mambaforge/envs/myenv/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/root/mambaforge/envs/myenv/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/root/CM/repos/local/cache/e2331993ea3b4b0e/inference/language/llama2-70b/SUT.py", line 205, in process_queries
    processed_output = self.data_object.postProcess(pred_output_tokens,
  File "/root/CM/repos/local/cache/e2331993ea3b4b0e/inference/language/llama2-70b/dataset.py", line 91, in postProcess
    with open(fname, mode='wb') as f:
OSError: [Errno 36] File name too long: 'run_outputs/q4118_3770_22617_4660_21859_18301_9778_6453_8562_18955_7100_13731_4407_19701_7475_10848_18702_4084_193_23006_3177_20890_10818_18814_200_2737_5746_8701_9518_4396_326_11445_10594_15303_17265_16644_9158_19060_316_21895_5409_1781_4312_13741_9429_17230_9285_20113_2111_12954_4186_1925_14560_9625_17576_2828_20126_13464_20755_14847_1833_10862_14586_6106_11723_7494_16183_22833_10566_23337_4972_7147_15613_16167_22714_18997_20852_21283_6598_22134_22665_4125_20359_5553_1787_22688_16035_10054_6777_18189_16263_14135_4653_13306_20705_20630_12614_17635_12577_3587.pkl'

arjunsuresh commented 1 month ago

Yes, I think we can reduce the batch size but not increase for the reference implementation.

Are you trying for v4.1 submissions?

philross commented 1 month ago

No, I am not going to submit v4.1, but my goal is to evaluate different hardware options. But then probably the best option is to use the v4.1 implementations by the different vendors correct?

arjunsuresh commented 1 month ago

@philross yes, thats correct. Also, using vllm is a good option- we'll add this support in the documentation next week. By August 28 we'll have the 4.1 implementations public also.

philross commented 3 weeks ago

@arjunsuresh Thanks again, is the vllm option already released?

arjunsuresh commented 3 weeks ago

You're welcome @philross The docs website is being updated by @anandhu-eng including the vllm changes. Hopefully the changes will be merged by EOD today.

arjunsuresh commented 3 weeks ago

Hi @philross we couldn't merge the PR during the last inference WG meeting. But you can see the instructions here

philross commented 5 days ago

Thanks @arjunsuresh @anandhu-eng! If I try to run the instructions, the "Run the Inference Server" works, but for "# Performance Estimation for Offline Scenario" I am getting this error:


hf-FP8 --api-model-name nm-testing/Llama-2-70b-chat-hf-FP8 --vllm  2>&1 ; echo $? > exitstatus | tee '/root/CM/repos/local/cache/8095c510167e4472/test_results/613bbaa74484-reference-gpu-pytorch-v2.4.0-default_config/nm-testing_Llama-2-70b-chat-hf-FP8/offline/performance/run_1/console.out'
usage: main.py [-h] [--scenario {Offline,Server}] [--model-path MODEL_PATH] [--dataset-path DATASET_PATH] [--accuracy] [--dtype DTYPE] [--device {cpu,cuda:0}] [--audit-conf AUDIT_CONF] [--mlperf-conf MLPERF_CONF] [--user-conf USER_CONF]
               [--total-sample-count TOTAL_SAMPLE_COUNT] [--batch-size BATCH_SIZE] [--output-log-dir OUTPUT_LOG_DIR] [--enable-log-trace] [--num-workers NUM_WORKERS]
main.py: error: unrecognized arguments: --api-server http://localhost:8000 --api-model-name nm-testing/Llama-2-70b-chat-hf-FP8 --vllm
./run.sh: line 59: 2: command not found
./run.sh: line 65: 2: command not found

Do I maybe need to checkout a specific branch for this?

Edit:

I think I made it work by running:

 cm pull repo mlcommons@cm4mlops --checkout=6d87fa795ef001d7d76fe10217d2e2fa5e9b9742

cm run script \
    --tags=run-mlperf,inference,_full,_compliance \
    --model=llama2-70b-99 \
    --implementation=reference \
    --device=cpu \
    --quiet \
    --api_server=http://localhost:8000 \
    --adr.mlperf-implementation.tags=_repo.https://github.com/neuralmagic/inference,_branch.vllm \
    --vllm_model_name=nm-testing/Llama-2-70b-chat-hf-FP8 \
    --test_query_count=1 \
    --server_target_qps=1 \
    --num_workers=1 \
    --scenario=Server \
    --max_test_duration=2000 \
    --execution_mode=valid \
    --offline_target_qps=3 \
    --division=closed \
    -category=datacenter

arjunsuresh commented 5 days ago

Sorry @philross . We need --adr.mlperf-implementation.tags=_repo.https://github.com/neuralmagic/inference,_branch.vllm for vllm.

@anandhu-eng Can you please add this in the docs?

anandhu-eng commented 5 days ago

Thank you @philross for pointing out the issue. The additional required tag has been added here.

mlcommons / inference

Automated command for llama2-70b: Changing Batch Size fails #1806