[Misc]: vLLM performs consistently poor as compared to HF TGI when tested with the DeepSeek Coder Model

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

https://docs.vllm.ai

Apache License 2.0

31.31k stars 4.75k forks source link

[Misc]: vLLM performs consistently poor as compared to HF TGI when tested with the DeepSeek Coder Model #4030

Closed anindya-saha closed 7 months ago

anindya-saha commented 7 months ago

Anything you want to discuss about vllm.

Hello Folks,

We are using the Deep Seek Coder model for code completions and chat completions. I did try to run the benchmark scripts for that model both for vLLM and TGI and I see that vLLM metrics are consistently poorer as compared to TGI.

Could you please review and comment on the setup ?

Bring up the servers:

MODEL="deepseek-ai/deepseek-coder-1.3b-instruct"

python -m vllm.entrypoints.openai.api_server \
    --model ${MODEL} \
    --swap-space 16 \
    --disable-log-requests

    (TGI backend)
    ./launch_tgi_server.sh ${MODEL} 8192

On the client side, run:

python benchmarks/benchmark_serving.py \
    --backend vllm \
    --model ${MODEL} \
    --dataset-name sharegpt \
    --dataset-path /home/anindya/ShareGPT_V3_unfiltered_cleaned_split.json \
    --request-rate 10 \
    --num-prompts 1000 \
    --save-result

python benchmarks/benchmark_serving.py \
    --backend tgi \
    --model ${MODEL} \
    --endpoint /generate_stream \
    --dataset-name sharegpt \
    --dataset-path /home/anindya/ShareGPT_V3_unfiltered_cleaned_split.json \
    --request-rate 10 \
    --num-prompts 1000 \
    --save-result

Results:

With DeepSeek Model vLLM Backend
============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  131.96    
Total input tokens:                      243330    
Total generated tokens:                  188582    
Request throughput (req/s):              7.58      
Input token throughput (tok/s):          1844.00   
Output token throughput (tok/s):         1429.11   
---------------Time to First Token----------------
Mean TTFT (ms):                          64.49     
Median TTFT (ms):                        56.54     
P99 TTFT (ms):                           226.61    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          61.00     
Median TPOT (ms):                        57.76     
P99 TPOT (ms):                           142.52      
==================================================

With DeepSeek Model TGI Backend
============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  113.41    
Total input tokens:                      243330    
Total generated tokens:                  180307    
Request throughput (req/s):              8.82      
Input token throughput (tok/s):          2145.67   
Output token throughput (tok/s):         1589.94   
---------------Time to First Token----------------
Mean TTFT (ms):                          332.49    
Median TTFT (ms):                        324.67    
P99 TTFT (ms):                           708.42    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          25.51     
Median TPOT (ms):                        25.55     
P99 TPOT (ms):                           35.88   
==================================================

ywang96 commented 7 months ago

Could you share the hardware you're benchmarking on? I will take a look and try to repro

anindya-saha commented 7 months ago

Hi @ywang96 Its the standard AWS g5.4xlarge instance. Also, I ran multiple times to make sure the stats are consistent. I also did not enable flash-attention or quantization in any of them.

Also, vLLM does not support eetq quantization. Could you also try the deepseek-ai/deepseek-coder-1.3b-instruct with a quantization mechanism.

rbgo404 commented 7 months ago

TTFT is shocking!

youkaichao commented 7 months ago

FYI: you can run https://github.com/vllm-project/vllm/blob/main/collect_env.py to report more environment information.

I believe something is wrong in your environment. It doesn't need 6369.27 ms to get the first token for just a 1.3b model.

ywang96 commented 7 months ago

Hi @ywang96 Its the standard AWS g5.4xlarge instance. Also, I ran multiple times to make sure the stats are consistent.

@anindya-saha I'm actually a bit confused now because the plots you showed make more sense than the original results posted in the issue.

Currently, vllm will always try to schedule requests whenever possible which lead to very low TTFT as expected but also affect TPOT because of prefill blocking decode.

rbgo404 commented 7 months ago

Hey @ywang96, how does batch request happens in OpenAI Client? Can you share some information? I want to see how increasing the batch size impacts the performance

ywang96 commented 7 months ago

Hey @ywang96, how does batch request happens in OpenAI Client? Can you share some information? I want to see how increasing the batch size impacts the performance

IMO for batch processing - benchmarking with benchmark_throughput.py makes more sense.

anindya-saha commented 7 months ago

Hi @ywang96 I ran many experiments when I logged the issue. So the metrics was from one of the run outputs from there.

Anyway, I reran the experiments again to create the charts today. I updated the original post with the metrics from the graph.

anindya-saha commented 7 months ago

FYI: you can run https://github.com/vllm-project/vllm/blob/main/collect_env.py to report more environment information.

I believe something is wrong in your environment. It doesn't need 6369.27 ms to get the first token for just a 1.3b model.

This is the o/p from the env info:

Collecting environment information...
PyTorch version: 2.1.2+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: version 3.29.0
Libc version: glibc-2.31

Python version: 3.10.13 (main, Feb 23 2024, 00:58:06) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-1056-aws-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 12.2.140
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA A10G
Nvidia driver version: 535.161.07
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      48 bits physical, 48 bits virtual
CPU(s):                             16
On-line CPU(s) list:                0-15
Thread(s) per core:                 2
Core(s) per socket:                 8
Socket(s):                          1
NUMA node(s):                       1
Vendor ID:                          AuthenticAMD
CPU family:                         23
Model:                              49
Model name:                         AMD EPYC 7R32
Stepping:                           0
CPU MHz:                            2324.061
BogoMIPS:                           5600.00
Hypervisor vendor:                  KVM
Virtualization type:                full
L1d cache:                          256 KiB
L1i cache:                          256 KiB
L2 cache:                           4 MiB
L3 cache:                           32 MiB
NUMA node0 CPU(s):                  0-15
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec rstack overflow: Mitigation; safe RET
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.1.2
[pip3] triton==2.1.0
[conda] Could not collectROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.0
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      0-15    0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

ywang96 commented 7 months ago

@anindya-saha Could you give the latest vLLM package on PyPI a try? I couldn't quite repro your results on g5 (albeit this is g5.12xlarge, but I'm running a docker image with 12 vCPU + 1 GPU on K8s)

============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  115.06    
Total input tokens:                      243330    
Total generated tokens:                  183154    
Request throughput (req/s):              8.69      
Input token throughput (tok/s):          2114.82   
Output token throughput (tok/s):         1591.82   
---------------Time to First Token----------------
Mean TTFT (ms):                          42.66     
Median TTFT (ms):                        37.20     
P99 TTFT (ms):                           126.86    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          33.00     
Median TPOT (ms):                        30.41     
P99 TPOT (ms):                           85.07     
==================================================

Below is the system info

INFO 04-13 09:49:21 pynccl.py:58] Loading nccl from library libnccl.so.2
PyTorch version: 2.1.2+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.2
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.10.192-183.736.amzn2.x86_64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.3.103
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA A10G
Nvidia driver version: 535.129.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      48 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             48
On-line CPU(s) list:                0-47
Vendor ID:                          AuthenticAMD
Model name:                         AMD EPYC 7R32
CPU family:                         23
Model:                              49
Thread(s) per core:                 2
Core(s) per socket:                 24
Socket(s):                          1
Stepping:                           0
BogoMIPS:                           5599.99
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
Hypervisor vendor:                  KVM
Virtualization type:                full
L1d cache:                          768 KiB (24 instances)
L1i cache:                          768 KiB (24 instances)
L2 cache:                           12 MiB (24 instances)
L3 cache:                           96 MiB (6 instances)
NUMA node(s):                       1
NUMA node0 CPU(s):                  0-47
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec rstack overflow: Mitigation; safe RET
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.1.2
[pip3] triton==2.1.0
[conda] Could not collectROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.0.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      0-47    0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

anindya-saha commented 7 months ago

@ywang96 I installed vLLM from source master branch.

Your vLLM results are quite similar to mine. Can you also publish the benchmark results from TGI from your same setup ? Wanted to check are they better or worse in your same setup.

AaronFriel commented 7 months ago

@anindya-saha, if you'd accept some feedback, I was a bit confused as I saw this issue before and after your edit around 11 hours ago. After the edit, the issue title and description don't quite line up with the data. Personally, I find it helpful to append corrected data with context rather than edit-in-place.

anindya-saha commented 7 months ago

@anindya-saha, if you'd accept some feedback, I was a bit confused as I saw this issue before and after your edit around 11 hours ago. After the edit, the issue title and description don't quite line up with the data. Personally, I find it helpful to append corrected data with context rather than edit-in-place.

Hi @AaronFriel That's a good feedback. Thank You. However, after the edit the HF TGI metrics shows it is better than the vLLM metrics (except the TTFT). The graph also show the same. After the edit the issue description and the graph lines up, is that not what you feel ?

AaronFriel commented 7 months ago

I think "consistently poor" suggests to me worse TTFT, TPOT and throughput, but it looks like vLLM and TGI may have made choices to optimize for one or the other. It'd help to clarify in the issue description that this is primarily about lower throughput.

Now that you're using vLLM latest, can you try enabling any of these features to optimize throughput?

enable_chunked_prefill
enable_prefix_caching
scheduler_delay_factor at 0.1, or other values in (0, 0.5]

Given that the performance gap is now only 16%, I think any of these could be sufficient to close the gap.

anindya-saha commented 7 months ago

I think "consistently poor" suggests to me worse TTFT, TPOT and throughput, but it looks like vLLM and TGI may have made choices to optimize for one or the other. It'd help to clarify in the issue description that this is primarily about lower throughput.

Now that you're using vLLM latest, can you try enabling any of these features to optimize throughput?

enable_chunked_prefill

enable_prefix_caching

scheduler_delay_factor at 0.1, or other values in (0, 0.5]

Given that the performance gap is now only 16%, I think any of these could be sufficient to close the gap.

I see what you mean. Yes, we are mainly concerned with the online serving throughput as mentioned in the benchmark_serving.py. TTFT is not a major concern for our use case right now.

I did not want to try any optimization on vLLM yet because I wanted to see how vLLM and TGI compare neck to neck without any optimization and using the vanilla OOTB benchmark scripts provided in the vLLM repo itself. If that shows quite good positive sign already then I can make a case for engineers to allocate time and investigate more with other parameters to bring it down even further.

If I enable any kind of optimization in vLLM then I have to enable similar optimization in the HF TGI server too to make the benchmark study in comparable scenarios, which will be quite time consuming.

I just wanted to check do you see the same behavior with this particular deepseek-ai/deepseek-coder-1.3b-instruct model that default TGI HF performs better w.r.t. throughput as compared to default vLLM without any optimization trick in any of them and using the same setup of EC2 m/c (k8s etc.)

ywang96 commented 7 months ago

@anindya-saha Here are the results I get from TGI, and I actually used their latest version as well (v2.0.0)

============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  112.05    
Total input tokens:                      243330    
Total generated tokens:                  184459    
Request throughput (req/s):              8.92      
Input token throughput (tok/s):          2171.58   
Output token throughput (tok/s):         1646.19   
---------------Time to First Token----------------
Mean TTFT (ms):                          329.45    
Median TTFT (ms):                        308.16    
P99 TTFT (ms):                           702.44    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          24.37     
Median TPOT (ms):                        24.58     
P99 TPOT (ms):                           34.13     
==================================================

One thing I'll have to point out regarding the throughput difference (3% in this case): If your overall concern is just throughput (how many requests you can process over a certain period of time), then benchmark_throughput.py is a better benchmark to run since all requests are sent at time 0.

IMO LLM online inference generally speaking is a lot more complicated than just throughput & latency because of its unique autoregressive nature. For example, one could argue that if the use case is simply generating a few tokens as labels, then TTFT will be significantly more important here. For the ShareGPT dataset, each request has different input and output lengths setting to reflect what it might look like if it's serving a conversation use case.

Therefore, another suggestion I have here is benchmarking on your own dataset that best reflects your use case. Initially I thought about adding a data registry when I refactored benchmark_serving.py so that people can port their own dataset to the benchmark script (I didn't get time to do it unfortunately), but at least I hope the script itself is easy enough to follow and modify to add your own dataset.

AaronFriel commented 7 months ago

I did not want to try any optimization on vLLM yet because I wanted to see how vLLM and TGI compare neck to neck without any optimization and using the vanilla OOTB benchmark scripts provided in the vLLM repo itself. If that shows quite good positive sign already then I can make a case for engineers to allocate time and investigate more with other parameters to bring it down even further.

Definitely, but I think it's important to understand that your optimization goal (maximum throughput) may not be the vLLM projects, it may not be a "bug" for vLLM to default to lower TTFT and lower throughput, and TGI to default to higher TTFT for higher throughput.

I don't know if this project has expressly written down their optimization goals, but that's why options like the ones I shared exist.

For my current use cases, for example, I care deeply about TTFT and significantly less about throughput (within reason). I would not want the vLLM project to change the defaults to favor throughput at any cost. :)

rbgo404 commented 7 months ago

I don't know if this project has expressly written down their optimization goals, but that's why options like the ones I shared exist.

@ywang96 I couldn't able to understand as benchmark_throughput.py usage the ofline serving as compared to the benchmark_serving.py. If I want to use the online serving at prod then it doesn't make sense to compare with ofline serving.

anindya-saha commented 7 months ago

@anindya-saha Here are the results I get from TGI, and I actually used their latest version as well (v2.0.0)
============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  112.05    
Total input tokens:                      243330    
Total generated tokens:                  184459    
Request throughput (req/s):              8.92      
Input token throughput (tok/s):          2171.58   
Output token throughput (tok/s):         1646.19   
---------------Time to First Token----------------
Mean TTFT (ms):                          329.45    
Median TTFT (ms):                        308.16    
P99 TTFT (ms):                           702.44    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          24.37     
Median TPOT (ms):                        24.58     
P99 TPOT (ms):                           34.13     
==================================================
One thing I'll have to point out regarding the throughput difference (3% in this case): If your overall concern is just throughput (how many requests you can process over a certain period of time), then benchmark_throughput.py is a better benchmark to run since all requests are sent at time 0.

IMO LLM online inference generally speaking is a lot more complicated than just throughput & latency because of its unique autoregressive nature. For example, one could argue that if the use case is simply generating a few tokens as labels, then TTFT will be significantly more important here. For the ShareGPT dataset, each request has different input and output lengths setting to reflect what it might look like if it's serving a conversation use case.

Therefore, another suggestion I have here is benchmarking on your own dataset that best reflects your use case. Initially I thought about adding a data registry when I refactored benchmark_serving.py so that people can port their own dataset to the benchmark script (I didn't get time to do it unfortunately), but at least I hope the script itself is easy enough to follow and modify to add your own dataset

Thank you for the suggestions. I will revisit them. Also, the metrics that you posted for HF TGI is similar to what I seeing as well.

I am making an application for Code Completion and also another product for Chat. During inference I need to choose an inference server to host the models. What should I optimize for Latency, Throughput or TTFT ? When you have a chance, I wanted to get your general suggestions and if you have insights on them.

Personally I feel the following:

Code Completion: Optimize for latency to ensure that suggestions appear swiftly as the user types, enhancing the feeling of immediate responsiveness.

Chat Application: Focus primarily on latency and throughput. Latency is critical for a fast conversational experience, and throughput is important to manage multiple users and conversations concurrently without performance bottlenecks.

I am curious to know your thoughts on them.

ywang96 commented 7 months ago

Personally I feel the following:

Code Completion: Optimize for latency to ensure that suggestions appear swiftly as the user types, enhancing the feeling of immediate responsiveness.

Chat Application: Focus primarily on latency and throughput. Latency is critical for a fast conversational experience, and throughput is important to manage multiple users and conversations concurrently without performance bottlenecks.

I am curious to know your thoughts on them.

@anindya-saha One distinction between the two is whether the outputs are streamed.

For code completion, it probably will not be streamed, so end-to-end latency (TTFT + TPOT * # of output tokens - 1) is what you would care about.

For chat application similar to chatgpt where the outputs are streamed, I would actually argue that TTFT is much more significant than TPOT usually since it's the user perceived latency of model response. Take one extreme example: Imagine you have to add 10 seconds to the end-to-end latency of a 100-token streaming output, would you add it to the TTFT, or TPOT by spread out over the 99 tokens?

Of course, the metric to optimize will always depend on the goal of your use case, but I hope the serving benchmark can provide enough insights for you to play around with!