[Performance]: reproducing vLLM performance benchmark

KuntaiDu commented 1 month ago

Proposal to improve performance

No response

Report of performance regression

No response

Misc discussion on performance

To reproduce vLLM's performance benchmark, please launch a shell in the following docker images:

SGlang: lmsysorg/sglang:v0.3.0-cu124
lmdeploy: openmmlab/lmdeploy:v0.6.0a0-cu12
TensorRT-LLM: nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
vLLM: vllm/vllm-openai:v0.6.0

And then run the following bash script (don't forget to replace with your huggingface token that has Llama-3 model access):

export HF_TOKEN=<your HF TOKEN>
apt update
apt install -y wget unzip 
# download benchmarking code
wget -O benchmarking_code.zip https://buildkite.com/organizations/vllm/pipelines/performance-benchmark/builds/8532/jobs/0191bbbf-c603-4c15-9f5d-e0b2933ba097/artifacts/0191bd2a-d6cd-4f6d-b618-a7aa2c39456c
unzip benchmarking_code.zip
# remove previous results
rm -r ./benchmarks/results
VLLM_SOURCE_CODE_LOC=$(pwd) bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh

Your benchmarking results will be in ./benchmarks/results, with the name format of xxx_nightly_results.jsonand can be loaded and converted to pandas dataframe by pandas.DataFrame.from_dict(). Each benchmark run takes roughly 1 hour 10 minutes assuming that the model weights are already downloaded (and 1 hour 30 minutes for TensorRT-LLM as it needs to convert the model to triton inference engine).

When you run the H100 benchmark inside TensorRT-LLM docker container, you may experience a memory leaking issue (issue link). In this case, please add the following code

      # temporary fix for trt
      kill_gpu_processes
      bash -c "python3 /tensorrtllm_backend/scripts/launch_triton_server.py \
              --world_size=${tp} \
              --model_repo=/tensorrtllm_backend/triton_model_repo & " </dev/null >/dev/null 2>&1 &
      wait_for_server

to Line 211 (right after the for loop) in ./.buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh to force TensorRT-LLM to restart the serve more often.

Known issue:

In different serving engines, the # of output tokens do not strictly align (even after setting ignore_eos or max_length due to imperfect implementation of these two flags in different engines). That said, the number of tokens generated by vLLM is roughly aligned with other engines as all engines are performing greedy sampling using the same model.

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

zhyncs commented 1 month ago

Hi all @WoosukKwon @zhuohan123 @KuntaiDu @alexm-neuralmagic cc @merrymercy @ying1123 @hnyls2002

First of all, congratulations to vLLM on the improvement in offline throughput over the past month. However, there are some confusion or errors in this blog post, which I have pointed out in this document.

We reproduce the benchmark results for SGLang v0.3.0 compared to vLLM v0.6.0. In short, with multi step enabled, in online scenarios, the Median TTFT of vLLM is 3 times that of SGLang, and the Median ITL is 10 times that of SGLang. Also, under maximum throughput, if vLLM does not set gpu util to 0.95 separately and uses the default configuration instead, its maximum throughput is lower than that of SGLang. Lower Median TTFT and ITL are better. vLLM's multi-step optimization did not improve throughput while ensuring lower Median TTFT and ITL.

ref https://x.com/zhyncs42/status/1831754352278839778

https://github.com/sgl-project/sglang/blob/main/benchmark/benchmark_vllm_060/README.md

KuntaiDu commented 1 month ago

AFAIK the current multi-step scheduler will send multiple tokens inside one networking packet. As the benchmark is using 10 steps, and this causes the inflation on both TTFT (the first token needs to wait) and ITL (ITL only measures inter-network-packet latency in current vLLM implementation, so it will be 10x). That said, such inflation for me is not fundamental and can be improved by, for example, streaming output token-by-token.

zhyncs commented 1 month ago

@KuntaiDu Since the package sends 10 tokens at once, incorporating a streaming simulator to sequentially output tokens or introducing an initial delay for the first chunk will significantly raise inter-token latency or TTFT. The crucial aspect is that vLLM processes chunks of ten tokens together, rather than generating them individually.

KuntaiDu commented 1 month ago

IIUC currently vLLM is still generating tokens one-by-one (the scheduling algorithm is run once per 10 steps, unless new request comes) but streaming out 10 tokens together. I am expecting that vLLM will stream out tokens 1 by 1 in the near future and both TTFT and ITL will be reduced after that.

zhyncs commented 1 month ago

@KuntaiDu We are discussing the current situation here, and right now the ITL is very high. For an explanation of ITL, you can refer to https://github.com/sgl-project/sglang/pull/1340#issuecomment-2332455776 In this scenario, this will introduce choppiness during online serving, leading to degraded user experiences. By the way, I am not challenging vLLM. On the contrary, I greatly appreciate a lot of the work done by vLLM, and I have always found the committers of vLLM such as @robertgshaw2-neuralmagic @ywang96 to be very open-minded. Looking forward to vLLM’s future improvements. Cheers.

yao-matrix commented 1 month ago

Since most of the optimizations are de-bottlenecking CPU, so I suppose CPU information in benchmark will be important for us to reproduce and analyze. Could you add your CPU info besides GPU info for your data? Thx.

KuntaiDu commented 1 month ago

Since most of the optimizations are de-bottlenecking CPU, so I suppose CPU information in benchmark will be important for us to reproduce and analyze. Could you add your CPU info besides GPU info for your data? Thx.

We are running the A100 benchmark on vLLM CI platform (AWS 8x A100 instance) and H100 benchmark on mosaic platform. However, as these instances are from cloud platform and different machines have different CPUs (even with the same spec), I don't have the accurate CPU spec in hand.

vllm-project / vllm