Open rabaja opened 1 week ago
Can you show the full logs?
cc @ywang96
(myenv) root@3-1-70-benchmark-pod:/benchmarking# ./benchmark_serving.sh -m Meta-Llama-3.1-70b-instruct -r 3 -t 2 -i 10 -d result --dataset-name sharegpt --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --tokenizer-path /mnt/models/meta-llama-3-1-70b-instruct/ --endpoint /v1/completions --save-result True --host 10.244.2.102 --port 8000
Using dataset: sharegpt at ShareGPT_V3_unfiltered_cleaned_split.json
Running: python3 vllm/benchmarks/benchmark_serving.py --host 10.244.2.102 --port 8000 --endpoint /v1/completions --model Meta-Llama-3.1-70b-instruct --tokenizer /mnt/models/meta-llama-3-1-70b-instruct/ --random-input-len 10 --random-output-len 512 --request-rate 3 --dataset-name sharegpt --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 --backend vllm --disable-tqdm --save-result --result-dir result/Meta-Llama-3.1-70b-instruct/RR-3-TP-2-PP-1/IL-10
Namespace(backend='vllm', base_url=None, host='10.244.2.102', port=8000, endpoint='/v1/completions', dataset=None, dataset_name='sharegpt', dataset_path='ShareGPT_V3_unfiltered_cleaned_split.json', max_concurrency=None, model='Meta-Llama-3.1-70b-instruct', tokenizer='/mnt/models/meta-llama-3-1-70b-instruct/', best_of=1, use_beam_search=False, num_prompts=1000, logprobs=None, request_rate=3.0, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=True, profile=False, save_result=True, metadata=None, result_dir='result/Meta-Llama-3.1-70b-instruct/RR-3-TP-2-PP-1/IL-10', result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=10, random_output_len=512, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: 3.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
============ Serving Benchmark Result ============
Successful requests: 836
Benchmark duration (s): 442.76
Total input tokens: 142455
Total generated tokens: 179340
Request throughput (req/s): 1.89
Output token throughput (tok/s): 405.05
Total Token throughput (tok/s): 726.79
---------------Time to First Token----------------
Mean TTFT (ms): 51604.09
Median TTFT (ms): 59053.37
P99 TTFT (ms): 86353.16
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 134.60
Median TPOT (ms): 133.32
P99 TPOT (ms): 234.13
---------------Inter-token Latency----------------
Mean ITL (ms): 131.78
Median ITL (ms): 103.53
P99 ITL (ms): 669.26
==================================================
(myenv) root@3-1-70-benchmark-pod:/benchmarking# ./benchmark_serving.sh -m Meta-Llama-3.1-70b-instruct -r 3 -t 2 -i 10 -d result --dataset-name sharegpt --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --tokenizer-path /mnt/models/meta-llama-3-1-70b-instruct/ --endpoint /v1/chat/completions --save-result True --host 10.244.2.102 --port 8000
Using dataset: sharegpt at ShareGPT_V3_unfiltered_cleaned_split.json
Running: python3 vllm/benchmarks/benchmark_serving.py --host 10.244.2.102 --port 8000 --endpoint /v1/chat/completions --model Meta-Llama-3.1-70b-instruct --tokenizer /mnt/models/meta-llama-3-1-70b-instruct/ --random-input-len 10 --random-output-len 512 --request-rate 3 --dataset-name sharegpt --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 --backend vllm --disable-tqdm --save-result --result-dir result/Meta-Llama-3.1-70b-instruct/RR-3-TP-2-PP-1/IL-10
Namespace(backend='vllm', base_url=None, host='10.244.2.102', port=8000, endpoint='/v1/chat/completions', dataset=None, dataset_name='sharegpt', dataset_path='ShareGPT_V3_unfiltered_cleaned_split.json', max_concurrency=None, model='Meta-Llama-3.1-70b-instruct', tokenizer='/mnt/models/meta-llama-3-1-70b-instruct/', best_of=1, use_beam_search=False, num_prompts=1000, logprobs=None, request_rate=3.0, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=True, profile=False, save_result=True, metadata=None, result_dir='result/Meta-Llama-3.1-70b-instruct/RR-3-TP-2-PP-1/IL-10', result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=10, random_output_len=512, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None)
Starting initial single prompt test run...
Traceback (most recent call last):
File "/benchmarking/vllm/benchmarks/benchmark_serving.py", line 1136, in <module>
main(args)
File "/benchmarking/vllm/benchmarks/benchmark_serving.py", line 794, in main
benchmark_result = asyncio.run(
^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/base_events.py", line 687, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "/benchmarking/vllm/benchmarks/benchmark_serving.py", line 489, in benchmark
raise ValueError(
ValueError: Initial test run failed - Please make sure benchmark arguments are correctly specified. Error: Bad Request
(myenv) root@3-1-70-benchmark-pod:/benchmarking#
Please use code blocks to format your logs properly. They are difficult to read.
I think you need to set --backend openai-chat
to use Chat API.
I will try but my model is running on vllm server
(myenv) root@3-1-70-benchmark-pod:/benchmarking# ./benchmark_serving.sh \
-m Meta-Llama-3.1-70b-instruct \
-r 1 \
-i 10 \
-d result \
--backend openai-chat \
--dataset-name sharegpt \
--dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
--tokenizer-path /mnt/models/meta-llama-3-1-70b-instruct/ \
--endpoint /v1/chat/completions \
--save-result True \
--host 10.244.2.102 \
--port 8000
Using dataset: sharegpt at ShareGPT_V3_unfiltered_cleaned_split.json
Running: python3 vllm/benchmarks/benchmark_serving.py --host 10.244.2.102 --port 8000 --endpoint /v1/chat/completions --model Meta-Llama-3.1-70b-instruct --tokenizer /mnt/models/meta-llama-3-1-70b-instruct/ --random-input-len 10 --random-output-len 512 --request-rate 1 --dataset-name sharegpt --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 --backend openai-chat --disable-tqdm --save-result --result-dir result/Meta-Llama-3.1-70b-instruct/RR-1-TP-1-PP-1/IL-10
Namespace(backend='openai-chat', base_url=None, host='10.244.2.102', port=8000, endpoint='/v1/chat/completions', dataset=None, dataset_name='sharegpt', dataset_path='ShareGPT_V3_unfiltered_cleaned_split.json', max_concurrency=None, model='Meta-Llama-3.1-70b-instruct', tokenizer='/mnt/models/meta-llama-3-1-70b-instruct/', best_of=1, use_beam_search=False, num_prompts=1000, logprobs=None, request_rate=1.0, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=True, profile=False, save_result=True, metadata=None, result_dir='result/Meta-Llama-3.1-70b-instruct/RR-1-TP-1-PP-1/IL-10', result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=10, random_output_len=512, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None)
Starting initial single prompt test run...
Traceback (most recent call last):
File "/benchmarking/vllm/benchmarks/benchmark_serving.py", line 1136, in <module>
main(args)
File "/benchmarking/vllm/benchmarks/benchmark_serving.py", line 794, in main
benchmark_result = asyncio.run(
^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/base_events.py", line 687, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "/benchmarking/vllm/benchmarks/benchmark_serving.py", line 489, in benchmark
raise ValueError(
ValueError: Initial test run failed - Please make sure benchmark arguments are correctly specified. Error: Bad Request
(myenv) root@3-1-70-benchmark-pod:/benchmarking# ./benchmark_serving.sh \
-m Meta-Llama-3.1-70b-instruct \
-r 1 \
-i 10 \
-d result \
--backend openai-chat \
--dataset-name sharegpt \
--dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
--tokenizer-path /mnt/models/meta-llama-3-1-70b-instruct/ \
--endpoint /v1/completions \
--save-result True \
--host 10.244.2.102 \
--port 8000
Using dataset: sharegpt at ShareGPT_V3_unfiltered_cleaned_split.json
Running: python3 vllm/benchmarks/benchmark_serving.py --host 10.244.2.102 --port 8000 --endpoint /v1/completions --model Meta-Llama-3.1-70b-instruct --tokenizer /mnt/models/meta-llama-3-1-70b-instruct/ --random-input-len 10 --random-output-len 512 --request-rate 1 --dataset-name sharegpt --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 --backend openai-chat --disable-tqdm --save-result --result-dir result/Meta-Llama-3.1-70b-instruct/RR-1-TP-1-PP-1/IL-10
Namespace(backend='openai-chat', base_url=None, host='10.244.2.102', port=8000, endpoint='/v1/completions', dataset=None, dataset_name='sharegpt', dataset_path='ShareGPT_V3_unfiltered_cleaned_split.json', max_concurrency=None, model='Meta-Llama-3.1-70b-instruct', tokenizer='/mnt/models/meta-llama-3-1-70b-instruct/', best_of=1, use_beam_search=False, num_prompts=1000, logprobs=None, request_rate=1.0, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=True, profile=False, save_result=True, metadata=None, result_dir='result/Meta-Llama-3.1-70b-instruct/RR-1-TP-1-PP-1/IL-10', result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=10, random_output_len=512, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None)
Starting initial single prompt test run...
Traceback (most recent call last):
File "/benchmarking/vllm/benchmarks/benchmark_serving.py", line 1136, in <module>
main(args)
File "/benchmarking/vllm/benchmarks/benchmark_serving.py", line 794, in main
benchmark_result = asyncio.run(
^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/base_events.py", line 687, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "/benchmarking/vllm/benchmarks/benchmark_serving.py", line 487, in benchmark
test_output = await request_func(request_func_input=test_input)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/benchmarking/vllm/benchmarks/backend_request_func.py", line 317, in async_request_openai_chat_completions
assert api_url.endswith(
^^^^^^^^^^^^^^^^^
AssertionError: OpenAI Chat Completions API URL must end with 'chat/completions'.
(myenv) root@3-1-70-benchmark-pod:/benchmarking#
Now it has failed for both endpoints.
any update on this
@ywang96 @comaniac can you help debug this? I'm not familiar with this part of the code.
@rabaja Can you share what's inside ./benchmark_serving.sh
? I cannot repro this with our benchmark script in the main branch.
my server launch command:
vllm serve meta-llama/Llama-3.1-8B-Instruct
Benchmark launch command:
python3 benchmark_serving.py \
--model meta-llama/Llama-3.1-8B-Instruct \
--dataset-name sharegpt \
--dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
--num-prompts 10 \
--backend openai-chat \
--endpoint /v1/chat/completions \
--request-rate 1
Benchmark result
Namespace(backend='openai-chat', base_url=None, host='localhost', port=8000, endpoint='/v1/chat/completions', dataset=None, dataset_name='sharegpt', dataset_path='ShareGPT_V3_unfiltered_cleaned_split.json', max_concurrency=None, model='meta-llama/Llama-3.1-8B-Instruct', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=10, logprobs=None, request_rate=1.0, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: 1.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā| 10/10 [00:12<00:00, 1.21s/it]
============ Serving Benchmark Result ============
Successful requests: 10
Benchmark duration (s): 12.11
Total input tokens: 1369
Total generated tokens: 2275
Request throughput (req/s): 0.83
Output token throughput (tok/s): 187.87
Total Token throughput (tok/s): 300.92
---------------Time to First Token----------------
Mean TTFT (ms): 28.91
Median TTFT (ms): 28.09
P99 TTFT (ms): 36.37
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 7.84
Median TPOT (ms): 7.87
P99 TPOT (ms): 7.90
---------------Inter-token Latency----------------
Mean ITL (ms): 7.84
Median ITL (ms): 7.80
P99 ITL (ms): 8.37
==================================================
Its a wrapper script on top of that from where we are calling.. I have attached it for your reference.
` Function to display help message show_help() { echo "Usage: ./benchmark_serving.sh [options]" echo echo "Options:" echo " -m, --model Model name (default: microsoft/Phi-3-mini-4k-instruct)" echo " -r, --request-rates Comma-separated list of request rates (default: 10,20,30)" echo " -i, --input-lens Comma-separated list of input lengths (default: 128,256,512,1024,2048)" echo " -t, --tp Tensor Parallelism size (default: 1)" echo " -p, --pp Pipeline Parallelism size (default: 1)" echo " -d, --result-dir Result directory (default: results)" echo " -h, --host Host IP address (default: 10.150.17.207)" echo " --port Port (default: 8080)" echo " --dataset-name Dataset name (default: random)" echo " --dataset-path Path to dataset" echo " --num-prompts Number of prompts to process (default: 1000)" echo " --random-output-len Random output length (default: 512)" echo " --backend Backend for serving (default: vllm)" echo " --tokenizer-path Tokenizer path" echo " --disable-tqdm Disable TQDM progress bar" echo " --save-result Save benchmark results to a file" echo " --endpoint Endpoint to be tested (default: /v1/completions)" echo " --help Show this help message" echo echo "Example:" echo " ./benchmark_serving.sh -m meta-llama/Meta-Llama-3.1-8B-Instruct -r 10,20,30 -i 128,256,512 -d results --dataset-name sharegpt --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --tokenizer-path ~/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/5f0b02c75b57c5855da9ae460ce51323ea669d8a/" echo }
model_name="microsoft/Phi-3-mini-4k-instruct" request_rates="10,20,30" input_lens="128,256,512,1024,2048" result_dir="results" host="10.150.17.207" port="8080" dataset_name="random" dataset_path="" num_prompts=1000 random_output_len=512 backend="vllm" disable_tqdm="--disable-tqdm" vllm_path="vllm" tokenizer_path="" # Tokenizer path to be provided tp=1 pp=1 endpoint="/v1/completions"
while [[ "$#" -gt 0 ]]; do case $1 in -m|--model) model_name="$2"; shift ;; -r|--request-rates) request_rates="$2"; shift ;; -i|--input-lens) input_lens="$2"; shift ;; -t|--tp) tp="$2"; shift ;; -p|--pp) pp="$2"; shift ;; -d|--result-dir) result_dir="$2"; shift ;; -h|--host) host="$2"; shift ;; --port) port="$2"; shift ;; --dataset-name) dataset_name="$2"; shift ;; --dataset-path) dataset_path="$2"; shift ;; --num-prompts) num_prompts="$2"; shift ;; --random-output-len) random_output_len="$2"; shift ;; --backend) backend="$2"; shift ;; --tokenizer-path) tokenizer_path="$2"; shift ;; --disable-tqdm) disable_tqdm="--disable-tqdm"; shift ;; --save-result) save_result="--save-result"; shift ;; --endpoint) endpoint="$2"; shift ;; --help) show_help; exit 0 ;; *) echo "Unknown parameter passed: $1"; show_help; exit 1 ;; esac shift done
IFS=',' read -r -a request_rate_array <<< "$request_rates" IFS=',' read -r -a input_lens_array <<< "$input_lens"
if [[ -n "$dataset_path" ]]; then if [[ ! -f "$dataset_path" ]]; then echo "Dataset not found at $dataset_path. Exiting..." exit 1 else echo "Using dataset: $dataset_name at $dataset_path" fi fi
if [[ -z "$tokenizer_path" ]]; then echo "Tokenizer path is required. Please specify the tokenizer path with --tokenizer-path." exit 1 fi
for rate in "${request_rate_array[@]}"; do for input_len in "${input_lens_array[@]}"; do
rate_result_dir="${result_dir}/${model_name//\//_}/RR-${rate}-TP-${tp}-PP-${pp}/IL-${input_len}"
# Create the directory structure
mkdir -p "$rate_result_dir"
# Build the command to run the benchmark
cmd="python3 ${vllm_path}/benchmarks/benchmark_serving.py \
--host ${host} \
--port ${port} \
--endpoint ${endpoint} \
--model ${model_name} \
--tokenizer ${tokenizer_path} \
--random-input-len ${input_len} \
--random-output-len ${random_output_len} \
--request-rate ${rate} \
--dataset-name ${dataset_name} \
--dataset-path ${dataset_path} \
--num-prompts ${num_prompts} \
--backend ${backend} \
${disable_tqdm} \
${save_result} \
--result-dir ${rate_result_dir}"
# Echo the command for debugging
echo "Running: $cmd"
# Execute the command
$cmd
done
done`
It would be great if you can clone the latest main branch and just confirm that the benchmark script works for you.
I did took the latest yesterday only.
Your current environment
The output of `python collect_env.py`
```text Your output of `python collect_env.py` here ```Model Input Dumps
No response
š Describe the bug
vllm benchmark script is falling for endpoint v1/chat/completions
Namespace(backend='vllm', base_url=None, host='10.244.2.102', port=8000, endpoint='/v1/chat/completions', dataset=None, dataset_name='sharegpt', dataset_path='ShareGPT_V3_unfiltered_cleaned_split.json', max_concurrency=None, model='Meta-Llama-3.1-70b-instruct', tokenizer='/mnt/models/meta-llama-3-1-70b-instruct/', best_of=1, use_beam_search=False, num_prompts=1000, logprobs=None, request_rate=3.0, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=True, profile=False, save_result=True, metadata=None, result_dir='result/Meta-Llama-3.1-70b-instruct/RR-3-TP-2-PP-1/IL-10', result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=10, random_output_len=512, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None)
ERROR
Starting initial single prompt test run... Traceback (most recent call last): File "/benchmarking/vllm/benchmarks/benchmark_serving.py", line 1136, in
main(args)
File "/benchmarking/vllm/benchmarks/benchmark_serving.py", line 794, in main
benchmark_result = asyncio.run(
^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/base_events.py", line 687, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "/benchmarking/vllm/benchmarks/benchmark_serving.py", line 489, in benchmark
raise ValueError(
ValueError: Initial test run failed - Please make sure benchmark arguments are correctly specified. Error: Bad Request
But it works fine for endpoint as 'v1/completions'
Before submitting a new issue...