vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.31k stars 4.59k forks source link

[Bug]: Error in benchmark model with vllm backend for endpoint /v1/chat/completions #10158

Open rabaja opened 1 week ago

rabaja commented 1 week ago

Your current environment

The output of `python collect_env.py` ```text Your output of `python collect_env.py` here ```

Model Input Dumps

No response

šŸ› Describe the bug

vllm benchmark script is falling for endpoint v1/chat/completions

Namespace(backend='vllm', base_url=None, host='10.244.2.102', port=8000, endpoint='/v1/chat/completions', dataset=None, dataset_name='sharegpt', dataset_path='ShareGPT_V3_unfiltered_cleaned_split.json', max_concurrency=None, model='Meta-Llama-3.1-70b-instruct', tokenizer='/mnt/models/meta-llama-3-1-70b-instruct/', best_of=1, use_beam_search=False, num_prompts=1000, logprobs=None, request_rate=3.0, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=True, profile=False, save_result=True, metadata=None, result_dir='result/Meta-Llama-3.1-70b-instruct/RR-3-TP-2-PP-1/IL-10', result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=10, random_output_len=512, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None)

ERROR

Starting initial single prompt test run... Traceback (most recent call last): File "/benchmarking/vllm/benchmarks/benchmark_serving.py", line 1136, in main(args) File "/benchmarking/vllm/benchmarks/benchmark_serving.py", line 794, in main benchmark_result = asyncio.run( ^^^^^^^^^^^^ File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run return runner.run(main) ^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run return self._loop.run_until_complete(task) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/asyncio/base_events.py", line 687, in run_until_complete return future.result() ^^^^^^^^^^^^^^^ File "/benchmarking/vllm/benchmarks/benchmark_serving.py", line 489, in benchmark raise ValueError( ValueError: Initial test run failed - Please make sure benchmark arguments are correctly specified. Error: Bad Request

But it works fine for endpoint as 'v1/completions'

Before submitting a new issue...

DarkLight1337 commented 1 week ago

Can you show the full logs?

DarkLight1337 commented 1 week ago

cc @ywang96

rabaja commented 1 week ago
(myenv) root@3-1-70-benchmark-pod:/benchmarking# ./benchmark_serving.sh     -m Meta-Llama-3.1-70b-instruct     -r 3     -t 2     -i 10     -d result     --dataset-name sharegpt     --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json     --tokenizer-path /mnt/models/meta-llama-3-1-70b-instruct/     --endpoint /v1/completions     --save-result True     --host 10.244.2.102     --port 8000 
Using dataset: sharegpt at ShareGPT_V3_unfiltered_cleaned_split.json
Running: python3 vllm/benchmarks/benchmark_serving.py             --host 10.244.2.102             --port 8000       --endpoint /v1/completions      --model Meta-Llama-3.1-70b-instruct             --tokenizer /mnt/models/meta-llama-3-1-70b-instruct/             --random-input-len 10             --random-output-len 512             --request-rate 3             --dataset-name sharegpt             --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json             --num-prompts 1000             --backend vllm             --disable-tqdm             --save-result             --result-dir result/Meta-Llama-3.1-70b-instruct/RR-3-TP-2-PP-1/IL-10
Namespace(backend='vllm', base_url=None, host='10.244.2.102', port=8000, endpoint='/v1/completions', dataset=None, dataset_name='sharegpt', dataset_path='ShareGPT_V3_unfiltered_cleaned_split.json', max_concurrency=None, model='Meta-Llama-3.1-70b-instruct', tokenizer='/mnt/models/meta-llama-3-1-70b-instruct/', best_of=1, use_beam_search=False, num_prompts=1000, logprobs=None, request_rate=3.0, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=True, profile=False, save_result=True, metadata=None, result_dir='result/Meta-Llama-3.1-70b-instruct/RR-3-TP-2-PP-1/IL-10', result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=10, random_output_len=512, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: 3.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
============ Serving Benchmark Result ============
Successful requests:                     836       
Benchmark duration (s):                  442.76    
Total input tokens:                      142455    
Total generated tokens:                  179340    
Request throughput (req/s):              1.89      
Output token throughput (tok/s):         405.05    
Total Token throughput (tok/s):          726.79    
---------------Time to First Token----------------
Mean TTFT (ms):                          51604.09  
Median TTFT (ms):                        59053.37  
P99 TTFT (ms):                           86353.16  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          134.60    
Median TPOT (ms):                        133.32    
P99 TPOT (ms):                           234.13    
---------------Inter-token Latency----------------
Mean ITL (ms):                           131.78    
Median ITL (ms):                         103.53    
P99 ITL (ms):                            669.26    
==================================================
(myenv) root@3-1-70-benchmark-pod:/benchmarking# ./benchmark_serving.sh     -m Meta-Llama-3.1-70b-instruct     -r 3     -t 2     -i 10     -d result     --dataset-name sharegpt     --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json     --tokenizer-path /mnt/models/meta-llama-3-1-70b-instruct/     --endpoint /v1/chat/completions     --save-result True     --host 10.244.2.102     --port 8000 
Using dataset: sharegpt at ShareGPT_V3_unfiltered_cleaned_split.json
Running: python3 vllm/benchmarks/benchmark_serving.py             --host 10.244.2.102             --port 8000       --endpoint /v1/chat/completions         --model Meta-Llama-3.1-70b-instruct             --tokenizer /mnt/models/meta-llama-3-1-70b-instruct/             --random-input-len 10             --random-output-len 512             --request-rate 3             --dataset-name sharegpt             --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json             --num-prompts 1000             --backend vllm             --disable-tqdm             --save-result             --result-dir result/Meta-Llama-3.1-70b-instruct/RR-3-TP-2-PP-1/IL-10
Namespace(backend='vllm', base_url=None, host='10.244.2.102', port=8000, endpoint='/v1/chat/completions', dataset=None, dataset_name='sharegpt', dataset_path='ShareGPT_V3_unfiltered_cleaned_split.json', max_concurrency=None, model='Meta-Llama-3.1-70b-instruct', tokenizer='/mnt/models/meta-llama-3-1-70b-instruct/', best_of=1, use_beam_search=False, num_prompts=1000, logprobs=None, request_rate=3.0, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=True, profile=False, save_result=True, metadata=None, result_dir='result/Meta-Llama-3.1-70b-instruct/RR-3-TP-2-PP-1/IL-10', result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=10, random_output_len=512, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None)
Starting initial single prompt test run...
Traceback (most recent call last):
  File "/benchmarking/vllm/benchmarks/benchmark_serving.py", line 1136, in <module>
    main(args)
  File "/benchmarking/vllm/benchmarks/benchmark_serving.py", line 794, in main
    benchmark_result = asyncio.run(
                       ^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/base_events.py", line 687, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/benchmarking/vllm/benchmarks/benchmark_serving.py", line 489, in benchmark
    raise ValueError(
ValueError: Initial test run failed - Please make sure benchmark arguments are correctly specified. Error: Bad Request
(myenv) root@3-1-70-benchmark-pod:/benchmarking# 
DarkLight1337 commented 1 week ago

Please use code blocks to format your logs properly. They are difficult to read.

DarkLight1337 commented 1 week ago

I think you need to set --backend openai-chat to use Chat API.

rabaja commented 1 week ago

I will try but my model is running on vllm server

rabaja commented 1 week ago
(myenv) root@3-1-70-benchmark-pod:/benchmarking# ./benchmark_serving.sh \
    -m Meta-Llama-3.1-70b-instruct \
    -r 1 \
    -i 10 \
    -d result \
    --backend openai-chat \
    --dataset-name sharegpt \
    --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
    --tokenizer-path /mnt/models/meta-llama-3-1-70b-instruct/ \
    --endpoint /v1/chat/completions \
    --save-result True \
    --host 10.244.2.102 \
    --port 8000 
Using dataset: sharegpt at ShareGPT_V3_unfiltered_cleaned_split.json
Running: python3 vllm/benchmarks/benchmark_serving.py             --host 10.244.2.102             --port 8000       --endpoint /v1/chat/completions         --model Meta-Llama-3.1-70b-instruct             --tokenizer /mnt/models/meta-llama-3-1-70b-instruct/             --random-input-len 10             --random-output-len 512             --request-rate 1             --dataset-name sharegpt             --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json             --num-prompts 1000             --backend openai-chat             --disable-tqdm             --save-result             --result-dir result/Meta-Llama-3.1-70b-instruct/RR-1-TP-1-PP-1/IL-10
Namespace(backend='openai-chat', base_url=None, host='10.244.2.102', port=8000, endpoint='/v1/chat/completions', dataset=None, dataset_name='sharegpt', dataset_path='ShareGPT_V3_unfiltered_cleaned_split.json', max_concurrency=None, model='Meta-Llama-3.1-70b-instruct', tokenizer='/mnt/models/meta-llama-3-1-70b-instruct/', best_of=1, use_beam_search=False, num_prompts=1000, logprobs=None, request_rate=1.0, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=True, profile=False, save_result=True, metadata=None, result_dir='result/Meta-Llama-3.1-70b-instruct/RR-1-TP-1-PP-1/IL-10', result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=10, random_output_len=512, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None)
Starting initial single prompt test run...
Traceback (most recent call last):
  File "/benchmarking/vllm/benchmarks/benchmark_serving.py", line 1136, in <module>
    main(args)
  File "/benchmarking/vllm/benchmarks/benchmark_serving.py", line 794, in main
    benchmark_result = asyncio.run(
                       ^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/base_events.py", line 687, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/benchmarking/vllm/benchmarks/benchmark_serving.py", line 489, in benchmark
    raise ValueError(
ValueError: Initial test run failed - Please make sure benchmark arguments are correctly specified. Error: Bad Request
(myenv) root@3-1-70-benchmark-pod:/benchmarking# ./benchmark_serving.sh \
    -m Meta-Llama-3.1-70b-instruct \
    -r 1 \
    -i 10 \
    -d result \
    --backend openai-chat \
    --dataset-name sharegpt \
    --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
    --tokenizer-path /mnt/models/meta-llama-3-1-70b-instruct/ \
    --endpoint /v1/completions \
    --save-result True \
    --host 10.244.2.102 \
    --port 8000 
Using dataset: sharegpt at ShareGPT_V3_unfiltered_cleaned_split.json
Running: python3 vllm/benchmarks/benchmark_serving.py             --host 10.244.2.102             --port 8000       --endpoint /v1/completions      --model Meta-Llama-3.1-70b-instruct             --tokenizer /mnt/models/meta-llama-3-1-70b-instruct/             --random-input-len 10             --random-output-len 512             --request-rate 1             --dataset-name sharegpt             --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json             --num-prompts 1000             --backend openai-chat             --disable-tqdm             --save-result             --result-dir result/Meta-Llama-3.1-70b-instruct/RR-1-TP-1-PP-1/IL-10
Namespace(backend='openai-chat', base_url=None, host='10.244.2.102', port=8000, endpoint='/v1/completions', dataset=None, dataset_name='sharegpt', dataset_path='ShareGPT_V3_unfiltered_cleaned_split.json', max_concurrency=None, model='Meta-Llama-3.1-70b-instruct', tokenizer='/mnt/models/meta-llama-3-1-70b-instruct/', best_of=1, use_beam_search=False, num_prompts=1000, logprobs=None, request_rate=1.0, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=True, profile=False, save_result=True, metadata=None, result_dir='result/Meta-Llama-3.1-70b-instruct/RR-1-TP-1-PP-1/IL-10', result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=10, random_output_len=512, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None)
Starting initial single prompt test run...
Traceback (most recent call last):
  File "/benchmarking/vllm/benchmarks/benchmark_serving.py", line 1136, in <module>
    main(args)
  File "/benchmarking/vllm/benchmarks/benchmark_serving.py", line 794, in main
    benchmark_result = asyncio.run(
                       ^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/base_events.py", line 687, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/benchmarking/vllm/benchmarks/benchmark_serving.py", line 487, in benchmark
    test_output = await request_func(request_func_input=test_input)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/benchmarking/vllm/benchmarks/backend_request_func.py", line 317, in async_request_openai_chat_completions
    assert api_url.endswith(
           ^^^^^^^^^^^^^^^^^
AssertionError: OpenAI Chat Completions API URL must end with 'chat/completions'.
(myenv) root@3-1-70-benchmark-pod:/benchmarking# 

Now it has failed for both endpoints.

rabaja commented 3 days ago

any update on this

DarkLight1337 commented 3 days ago

@ywang96 @comaniac can you help debug this? I'm not familiar with this part of the code.

ywang96 commented 3 days ago

@rabaja Can you share what's inside ./benchmark_serving.sh? I cannot repro this with our benchmark script in the main branch.

my server launch command:

vllm serve meta-llama/Llama-3.1-8B-Instruct

Benchmark launch command:

python3 benchmark_serving.py \
        --model meta-llama/Llama-3.1-8B-Instruct \
        --dataset-name sharegpt \
        --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
        --num-prompts 10 \
        --backend openai-chat \
        --endpoint /v1/chat/completions \
        --request-rate 1

Benchmark result

Namespace(backend='openai-chat', base_url=None, host='localhost', port=8000, endpoint='/v1/chat/completions', dataset=None, dataset_name='sharegpt', dataset_path='ShareGPT_V3_unfiltered_cleaned_split.json', max_concurrency=None, model='meta-llama/Llama-3.1-8B-Instruct', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=10, logprobs=None, request_rate=1.0, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: 1.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 10/10 [00:12<00:00,  1.21s/it]
============ Serving Benchmark Result ============
Successful requests:                     10        
Benchmark duration (s):                  12.11     
Total input tokens:                      1369      
Total generated tokens:                  2275      
Request throughput (req/s):              0.83      
Output token throughput (tok/s):         187.87    
Total Token throughput (tok/s):          300.92    
---------------Time to First Token----------------
Mean TTFT (ms):                          28.91     
Median TTFT (ms):                        28.09     
P99 TTFT (ms):                           36.37     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          7.84      
Median TPOT (ms):                        7.87      
P99 TPOT (ms):                           7.90      
---------------Inter-token Latency----------------
Mean ITL (ms):                           7.84      
Median ITL (ms):                         7.80      
P99 ITL (ms):                            8.37      
==================================================
rabaja commented 3 days ago

Its a wrapper script on top of that from where we are calling.. I have attached it for your reference.

` Function to display help message show_help() { echo "Usage: ./benchmark_serving.sh [options]" echo echo "Options:" echo " -m, --model Model name (default: microsoft/Phi-3-mini-4k-instruct)" echo " -r, --request-rates Comma-separated list of request rates (default: 10,20,30)" echo " -i, --input-lens Comma-separated list of input lengths (default: 128,256,512,1024,2048)" echo " -t, --tp Tensor Parallelism size (default: 1)" echo " -p, --pp Pipeline Parallelism size (default: 1)" echo " -d, --result-dir Result directory (default: results)" echo " -h, --host Host IP address (default: 10.150.17.207)" echo " --port Port (default: 8080)" echo " --dataset-name Dataset name (default: random)" echo " --dataset-path Path to dataset" echo " --num-prompts Number of prompts to process (default: 1000)" echo " --random-output-len Random output length (default: 512)" echo " --backend Backend for serving (default: vllm)" echo " --tokenizer-path Tokenizer path" echo " --disable-tqdm Disable TQDM progress bar" echo " --save-result Save benchmark results to a file" echo " --endpoint Endpoint to be tested (default: /v1/completions)" echo " --help Show this help message" echo echo "Example:" echo " ./benchmark_serving.sh -m meta-llama/Meta-Llama-3.1-8B-Instruct -r 10,20,30 -i 128,256,512 -d results --dataset-name sharegpt --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --tokenizer-path ~/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/5f0b02c75b57c5855da9ae460ce51323ea669d8a/" echo }

Default values

model_name="microsoft/Phi-3-mini-4k-instruct" request_rates="10,20,30" input_lens="128,256,512,1024,2048" result_dir="results" host="10.150.17.207" port="8080" dataset_name="random" dataset_path="" num_prompts=1000 random_output_len=512 backend="vllm" disable_tqdm="--disable-tqdm" vllm_path="vllm" tokenizer_path="" # Tokenizer path to be provided tp=1 pp=1 endpoint="/v1/completions"

Parse command-line arguments

while [[ "$#" -gt 0 ]]; do case $1 in -m|--model) model_name="$2"; shift ;; -r|--request-rates) request_rates="$2"; shift ;; -i|--input-lens) input_lens="$2"; shift ;; -t|--tp) tp="$2"; shift ;; -p|--pp) pp="$2"; shift ;; -d|--result-dir) result_dir="$2"; shift ;; -h|--host) host="$2"; shift ;; --port) port="$2"; shift ;; --dataset-name) dataset_name="$2"; shift ;; --dataset-path) dataset_path="$2"; shift ;; --num-prompts) num_prompts="$2"; shift ;; --random-output-len) random_output_len="$2"; shift ;; --backend) backend="$2"; shift ;; --tokenizer-path) tokenizer_path="$2"; shift ;; --disable-tqdm) disable_tqdm="--disable-tqdm"; shift ;; --save-result) save_result="--save-result"; shift ;; --endpoint) endpoint="$2"; shift ;; --help) show_help; exit 0 ;; *) echo "Unknown parameter passed: $1"; show_help; exit 1 ;; esac shift done

Convert request rates and input lengths to arrays

IFS=',' read -r -a request_rate_array <<< "$request_rates" IFS=',' read -r -a input_lens_array <<< "$input_lens"

Ensure the dataset is available if specified

if [[ -n "$dataset_path" ]]; then if [[ ! -f "$dataset_path" ]]; then echo "Dataset not found at $dataset_path. Exiting..." exit 1 else echo "Using dataset: $dataset_name at $dataset_path" fi fi

Ensure the tokenizer path is specified

if [[ -z "$tokenizer_path" ]]; then echo "Tokenizer path is required. Please specify the tokenizer path with --tokenizer-path." exit 1 fi

Loop over request rates and input lengths

for rate in "${request_rate_array[@]}"; do for input_len in "${input_lens_array[@]}"; do

Define directory path based on the model name, request rate, TP, and PP

    rate_result_dir="${result_dir}/${model_name//\//_}/RR-${rate}-TP-${tp}-PP-${pp}/IL-${input_len}"

    # Create the directory structure
    mkdir -p "$rate_result_dir"

    # Build the command to run the benchmark
    cmd="python3 ${vllm_path}/benchmarks/benchmark_serving.py \
        --host ${host} \
        --port ${port} \
        --endpoint ${endpoint} \
        --model ${model_name} \
        --tokenizer ${tokenizer_path} \
        --random-input-len ${input_len} \
        --random-output-len ${random_output_len} \
        --request-rate ${rate} \
        --dataset-name ${dataset_name} \
        --dataset-path ${dataset_path} \
        --num-prompts ${num_prompts} \
        --backend ${backend} \
        ${disable_tqdm} \
        ${save_result} \
        --result-dir ${rate_result_dir}"

    # Echo the command for debugging
    echo "Running: $cmd"

    # Execute the command
    $cmd
done

done`

ywang96 commented 2 days ago

It would be great if you can clone the latest main branch and just confirm that the benchmark script works for you.

rabaja commented 2 days ago

I did took the latest yesterday only.