Open KuntaiDu opened 1 month ago
Hi all @WoosukKwon @zhuohan123 @KuntaiDu @alexm-neuralmagic cc @merrymercy @ying1123 @hnyls2002
First of all, congratulations to vLLM on the improvement in offline throughput over the past month. However, there are some confusion or errors in this blog post, which I have pointed out in this document.
We reproduce the benchmark results for SGLang v0.3.0 compared to vLLM v0.6.0. In short, with multi step enabled, in online scenarios, the Median TTFT of vLLM is 3 times that of SGLang, and the Median ITL is 10 times that of SGLang. Also, under maximum throughput, if vLLM does not set gpu util to 0.95 separately and uses the default configuration instead, its maximum throughput is lower than that of SGLang. Lower Median TTFT and ITL are better. vLLM's multi-step optimization did not improve throughput while ensuring lower Median TTFT and ITL.
ref https://x.com/zhyncs42/status/1831754352278839778
https://github.com/sgl-project/sglang/blob/main/benchmark/benchmark_vllm_060/README.md
AFAIK the current multi-step scheduler will send multiple tokens inside one networking packet. As the benchmark is using 10 steps, and this causes the inflation on both TTFT (the first token needs to wait) and ITL (ITL only measures inter-network-packet latency in current vLLM implementation, so it will be 10x). That said, such inflation for me is not fundamental and can be improved by, for example, streaming output token-by-token.
@KuntaiDu Since the package sends 10 tokens at once, incorporating a streaming simulator to sequentially output tokens or introducing an initial delay for the first chunk will significantly raise inter-token latency or TTFT. The crucial aspect is that vLLM processes chunks of ten tokens together, rather than generating them individually.
IIUC currently vLLM is still generating tokens one-by-one (the scheduling algorithm is run once per 10 steps, unless new request comes) but streaming out 10 tokens together. I am expecting that vLLM will stream out tokens 1 by 1 in the near future and both TTFT and ITL will be reduced after that.
@KuntaiDu We are discussing the current situation here, and right now the ITL is very high. For an explanation of ITL, you can refer to https://github.com/sgl-project/sglang/pull/1340#issuecomment-2332455776 In this scenario, this will introduce choppiness during online serving, leading to degraded user experiences. By the way, I am not challenging vLLM. On the contrary, I greatly appreciate a lot of the work done by vLLM, and I have always found the committers of vLLM such as @robertgshaw2-neuralmagic @ywang96 to be very open-minded. Looking forward to vLLM’s future improvements. Cheers.
Since most of the optimizations are de-bottlenecking CPU, so I suppose CPU information in benchmark will be important for us to reproduce and analyze. Could you add your CPU info besides GPU info for your data? Thx.
Since most of the optimizations are de-bottlenecking CPU, so I suppose CPU information in benchmark will be important for us to reproduce and analyze. Could you add your CPU info besides GPU info for your data? Thx.
We are running the A100 benchmark on vLLM CI platform (AWS 8x A100 instance) and H100 benchmark on mosaic platform. However, as these instances are from cloud platform and different machines have different CPUs (even with the same spec), I don't have the accurate CPU spec in hand.
Proposal to improve performance
No response
Report of performance regression
No response
Misc discussion on performance
To reproduce vLLM's performance benchmark, please launch a shell in the following docker images:
lmsysorg/sglang:v0.3.0-cu124
openmmlab/lmdeploy:v0.6.0a0-cu12
nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
vllm/vllm-openai:v0.6.0
And then run the following bash script (don't forget to replace with your huggingface token that has Llama-3 model access):
Your benchmarking results will be in
./benchmarks/results
, with the name format ofxxx_nightly_results.json
and can be loaded and converted to pandas dataframe bypandas.DataFrame.from_dict()
. Each benchmark run takes roughly 1 hour 10 minutes assuming that the model weights are already downloaded (and 1 hour 30 minutes for TensorRT-LLM as it needs to convert the model to triton inference engine).When you run the H100 benchmark inside TensorRT-LLM docker container, you may experience a memory leaking issue (issue link). In this case, please add the following code
to Line 211 (right after the for loop) in
./.buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
to force TensorRT-LLM to restart the serve more often.Known issue:
ignore_eos
ormax_length
due to imperfect implementation of these two flags in different engines). That said, the number of tokens generated by vLLM is roughly aligned with other engines as all engines are performing greedy sampling using the same model.Your current environment (if you think it is necessary)
Before submitting a new issue...