vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
23.83k stars 3.42k forks source link

[Bug]: benchmark_serving model_id bug for lmdeploy #4001

Closed zhyncs closed 3 months ago

zhyncs commented 3 months ago

Your current environment

PyTorch version: 2.1.2+cu118
CUDA used to build PyTorch: 11.8
OS: CentOS Linux 7 (Core) (x86_64)
GCC version: (GCC) 10.2.1 20210130 (Red Hat 10.2.1-11)
Libc version: glibc-2.17

Python version: 3.9.16 (main, Aug 15 2023, 19:38:56)  [GCC 8.3.1 20190311 (Red Hat 8.3.1-3)] (64-bit runtime)
Python platform: Linux-4.18.0-147.mt20200626.413.el8_1.x86_64-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: 11.8.89
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A100-SXM4-80GB
GPU 1: NVIDIA A100-SXM4-80GB

Nvidia driver version: 470.103.01

🐛 Describe the bug

Hi @ywang96 Currently there is a small issue in benchmarks/backend_request_func when benchmark LMDeploy with Llama-2-13b-chat-hf.

# server
python3 -m lmdeploy serve api_server /workdir/Llama-2-13b-chat-hf
# client
# https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py
python3 benchmarks/benchmark_serving.py --backend lmdeploy --model /workdir/Llama-2-13b-chat-hf --dataset-name sharegpt --dataset-path /workdir/ShareGPT_V3_unfiltered_cleaned_split.json --request-rate 128 --num-prompts 1000 --port 23333

I need to change request_func_input.model to llama2 https://github.com/vllm-project/vllm/blob/f3d0bf7589d6e63a691dcbb9d1db538c184fde29/benchmarks/backend_request_func.py#L222

After manual modification, testing can be conducted, and the correct result is:

============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  106.18
Total input tokens:                      248339
Total generated tokens:                  198641
Request throughput (req/s):              9.42
Input token throughput (tok/s):          2338.84
Output token throughput (tok/s):         1870.79
---------------Time to First Token----------------
Mean TTFT (ms):                          28614.33
Median TTFT (ms):                        24839.67
P99 TTFT (ms):                           80789.10
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          63.67
Median TPOT (ms):                        59.62
P99 TPOT (ms):                           220.70
==================================================

Otherwise, the test result is incorrect because the model name was not correctly matched.

ywang96 commented 3 months ago

I wouldn't call this a bug because unlike other inference backends that use the huggingface model id also as the default model name for the API server, the model name for lmdeploy API server has to be from one of the names under lmdeploy list, and afaik this cannot be user defined either when launching the server.

lmdeploy list
The older chat template name like "internlm2-7b", "qwen-7b" and so on are deprecated and will be removed in the future. The supported chat template names are:
baichuan2
chatglm
codellama
dbrx
deepseek
deepseek-coder
deepseek-vl
falcon
gemma
internlm
internlm2
llama
llama2
mistral
mixtral
puyu
qwen
solar
ultracm
ultralm
vicuna
wizardlm
yi
yi-vl

The benchmark script already provides the flexibility to allow users to specify --model for only two purposes:

  1. Used as the value for model in the payload when calling the server via OpenAI API.
  2. Used to identify the tokenizer only if the user doesn't specify --tokenizer.

I can make a PR to make this clearer if that helps.

zhyncs commented 3 months ago

@ywang96 In order to make this benchmark run as expected, perhaps we can add a parameter similar to model_name for the lmdeploy scenario. Do you have any suggestions?

ywang96 commented 3 months ago

@ywang96 In order to make this benchmark run as expected, perhaps we can add a parameter similar to model_name for the lmdeploy scenario. Do you have any suggestions? Without changing anything or just adding some user instructions, the issue cannot be solved.

Wouldn't this work for lmdeploy? (Modified based on your original command in the issue)

python3 benchmarks/benchmark_serving.py \
               --backend lmdeploy \
               --model llama2 \
               --tokenizer /workdir/Llama-2-13b-chat-hf \
               --dataset-name sharegpt \
               --dataset-path /workdir/ShareGPT_V3_unfiltered_cleaned_split.json \
               --request-rate 128 \
               --num-prompts 1000 \
               --port 23333

Perhaps I can make the intention of --model clearer in its argument help message.

zhyncs commented 3 months ago

make sense

zhyncs commented 3 months ago

I can make a PR to make this clearer if that helps.

It's ok.