Open philross opened 1 month ago
Can you please do cm pull repo
and retry the command?
Thanks for the swift reply @arjunsuresh. However, I am still getting the same error. Is it possible that I need to checkout a special branch or are the changes already in main?
mlperf-inference
is the default branch for MLPerf inference. If you're in the main branch, please switch to here.
Ah okay @arjunsuresh.
Now the behavior is different, but I see two "batch-size" commands now:
INFO:root: ! call /root/CM/repos/mlcommons@cm4mlops/script/benchmark-program/run-ubuntu.sh from tmp-run.sh
/root/CM/repos/local/cache/13c4acd3a9ce4f47/mlperf/bin/python3 main.py --scenario Offline --dataset-path /root/CM/repos/local/cache/742e52c1c7fb499b/open_orca/open_orca_gpt4_tokenized_llama.sampled_24576.pkl.gz --device cuda:0 --batch-size 100 --batch-size 8 --mlperf-conf '/root/CM/repos/local/cache/e2331993ea3b4b0e/inference/mlperf.conf' --user-conf '/root/CM/repos/mlcommons@cm4mlops/script/generate-mlperf-inference-user-conf/tmp/2770ec2ee30a466ca2390923ab647c33.conf' --output-log-dir /root/CM/repos/local/cache/7df07c822ee14316/test_results/fa15046c2056-reference-gpu-pytorch-v2.4.0-default_config/llama2-70b-99/offline/performance/run_1 --dtype float16 --model-path /Llama-2-70b-chat-hf 2>&1 ; echo $? > exitstatus | tee '/root/CM/repos/local/cache/7df07c822ee14316/test_results/fa15046c2056-reference-gpu-pytorch-v2.4.0-default_config/llama2-70b-99/offline/performance/run_1/console.out'
--batch-size 100 --batch-size 8
And it seems to respect the latter:
Samples run: 8
BatchMaker time: 0.004299163818359375
Inference time: 43.85818552970886
Postprocess time: 0.0008432865142822266
==== Total time: 43.863327980041504
Sorry for that. Can you please try now?
But the reference implementation is not tested for different batch sizes and the behaviour can be unpredictable. The reference implementations are only meant for showcasing the benchmark requirements.
Now it seems to ignore the setting completely and is using batch size 1:
INFO:root: ! call /root/CM/repos/mlcommons@cm4mlops/script/benchmark-program/run-ubuntu.sh from tmp-run.sh
/root/CM/repos/local/cache/13c4acd3a9ce4f47/mlperf/bin/python3 main.py --scenario Offline --dataset-path /root/CM/repos/local/cache/742e52c1c7fb499b/open_orca/open_orca_gpt4_tokenized_llama.sampled_24576.pkl.gz --device cuda:0 --mlperf-conf '/root/CM/repos/local/cache/e2331993ea3b4b0e/inference/mlperf.conf' --user-conf '/root/CM/repos/mlcommons@cm4mlops/script/generate-mlperf-inference-user-conf/tmp/afe7f8cc6680467da2cb0190f82adc63.conf' --output-log-dir /root/CM/repos/local/cache/7df07c822ee14316/test_results/fa15046c2056-reference-gpu-pytorch-v2.4.0-default_config/llama2-70b-99/offline/performance/run_1 --dtype float16 --model-path /Llama-2-70b-chat-hf 2>&1 ; echo $? > exitstatus | tee '/root/CM/repos/local/cache/7df07c822ee14316/test_results/fa15046c2056-reference-gpu-pytorch-v2.4.0-default_config/llama2-70b-99/offline/performance/run_1/console.out'
Loading dataset...
Finished loading dataset.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:27<00:00, 1.83s/it]
Loaded model
Loaded tokenizer
INFO:Llama-70B-MAIN:Starting Benchmark run
IssueQuery started with 1000 samples
IssueQuery done
Saving outputs to run_outputs/q4118.pkl
Samples run: 1
BatchMaker time: 0.0007042884826660156
Inference time: 31.035701990127563
Postprocess time: 0.0006494522094726562
==== Total time: 31.037055730819702
Should work now - final time.
Now it is set in a correct manner! Thanks @arjunsuresh.
Now I am running into this error, but as you said the behavior is unpredictable:
Exception in thread Thread-2:
Traceback (most recent call last):
File "/root/mambaforge/envs/myenv/lib/python3.9/threading.py", line 980, in _bootstrap_inner
self.run()
File "/root/mambaforge/envs/myenv/lib/python3.9/threading.py", line 917, in run
self._target(*self._args, **self._kwargs)
File "/root/CM/repos/local/cache/e2331993ea3b4b0e/inference/language/llama2-70b/SUT.py", line 205, in process_queries
processed_output = self.data_object.postProcess(pred_output_tokens,
File "/root/CM/repos/local/cache/e2331993ea3b4b0e/inference/language/llama2-70b/dataset.py", line 91, in postProcess
with open(fname, mode='wb') as f:
OSError: [Errno 36] File name too long: 'run_outputs/q4118_3770_22617_4660_21859_18301_9778_6453_8562_18955_7100_13731_4407_19701_7475_10848_18702_4084_193_23006_3177_20890_10818_18814_200_2737_5746_8701_9518_4396_326_11445_10594_15303_17265_16644_9158_19060_316_21895_5409_1781_4312_13741_9429_17230_9285_20113_2111_12954_4186_1925_14560_9625_17576_2828_20126_13464_20755_14847_1833_10862_14586_6106_11723_7494_16183_22833_10566_23337_4972_7147_15613_16167_22714_18997_20852_21283_6598_22134_22665_4125_20359_5553_1787_22688_16035_10054_6777_18189_16263_14135_4653_13306_20705_20630_12614_17635_12577_3587.pkl'
Yes, I think we can reduce the batch size but not increase for the reference implementation.
Are you trying for v4.1 submissions?
No, I am not going to submit v4.1, but my goal is to evaluate different hardware options. But then probably the best option is to use the v4.1 implementations by the different vendors correct?
@philross yes, thats correct. Also, using vllm
is a good option- we'll add this support in the documentation next week. By August 28 we'll have the 4.1 implementations public also.
@arjunsuresh Thanks again, is the vllm
option already released?
You're welcome @philross The docs website is being updated by @anandhu-eng including the vllm changes. Hopefully the changes will be merged by EOD today.
Hi @philross we couldn't merge the PR during the last inference WG meeting. But you can see the instructions here
Thanks @arjunsuresh @anandhu-eng! If I try to run the instructions, the "Run the Inference Server" works, but for "# Performance Estimation for Offline Scenario" I am getting this error:
hf-FP8 --api-model-name nm-testing/Llama-2-70b-chat-hf-FP8 --vllm 2>&1 ; echo $? > exitstatus | tee '/root/CM/repos/local/cache/8095c510167e4472/test_results/613bbaa74484-reference-gpu-pytorch-v2.4.0-default_config/nm-testing_Llama-2-70b-chat-hf-FP8/offline/performance/run_1/console.out'
usage: main.py [-h] [--scenario {Offline,Server}] [--model-path MODEL_PATH] [--dataset-path DATASET_PATH] [--accuracy] [--dtype DTYPE] [--device {cpu,cuda:0}] [--audit-conf AUDIT_CONF] [--mlperf-conf MLPERF_CONF] [--user-conf USER_CONF]
[--total-sample-count TOTAL_SAMPLE_COUNT] [--batch-size BATCH_SIZE] [--output-log-dir OUTPUT_LOG_DIR] [--enable-log-trace] [--num-workers NUM_WORKERS]
main.py: error: unrecognized arguments: --api-server http://localhost:8000 --api-model-name nm-testing/Llama-2-70b-chat-hf-FP8 --vllm
./run.sh: line 59: 2: command not found
./run.sh: line 65: 2: command not found
Do I maybe need to checkout a specific branch for this?
Edit:
I think I made it work by running:
cm pull repo mlcommons@cm4mlops --checkout=6d87fa795ef001d7d76fe10217d2e2fa5e9b9742
cm run script \
--tags=run-mlperf,inference,_full,_compliance \
--model=llama2-70b-99 \
--implementation=reference \
--device=cpu \
--quiet \
--api_server=http://localhost:8000 \
--adr.mlperf-implementation.tags=_repo.https://github.com/neuralmagic/inference,_branch.vllm \
--vllm_model_name=nm-testing/Llama-2-70b-chat-hf-FP8 \
--test_query_count=1 \
--server_target_qps=1 \
--num_workers=1 \
--scenario=Server \
--max_test_duration=2000 \
--execution_mode=valid \
--offline_target_qps=3 \
--division=closed \
-category=datacenter
Sorry @philross . We need --adr.mlperf-implementation.tags=_repo.https://github.com/neuralmagic/inference,_branch.vllm
for vllm
.
@anandhu-eng Can you please add this in the docs?
Thank you @philross for pointing out the issue. The additional required tag has been added here.
Hello mlcommons team,
I want to run the "Automated command to run the benchmark via MLCommons CM" (from the example: https://github.com/mlcommons/inference/tree/master/language/llama2-70b) with a different batch size, but I am getting the following error:
I seems that it sets the
max-batchsize
even tho I specified the--batch-size=100
I am running the following command: