[Performance]: What can we learn from OctoAI

hmellor commented 1 month ago

OctoAI use vLLM as a benchmark to demonstrate how fast they are https://octo.ai/blog/acceleration-is-all-you-need-techniques-powering-octostacks-10x-performance-boost:

Single User Throughput	Multi-user Throughput	Inter-Token Latency

Their main optimisations appear to be:

FP8 quantisation of the model (currently we only support KV cache)
The CustomAllReduce kernel from Nvidia TRT LLM
CUDA graphs
Speculative decoding (which we have thanks to @cadedaniel!)
Dynamic SplitFuse (A.K.A. Chunked Prefill, which we have thanks to @rkooo567!)

My question is, what do we need to do to reach performance parity?

Some clear things are:

Make all of these features compatible with eachother
See what can be learned from the TRT LLM CustomAllReduce
Support executing models in FP8

Notable issues:

youkaichao commented 1 month ago

@KuntaiDu is creating our own benchmarks in realworld models and high-end GPUs. We need to know at first the current speed of vLLM. Companies may use an old version of vLLM or don't know how to set some advanced flags in vLLM, leading to poor performance in their benchmark (and they are incentivized to do so :) ).

hmellor commented 1 month ago

That is an excellent point, I've noticed that too in other comparisons.

Will these benchmarks be made available in https://github.com/vllm-project/vllm/tree/main/benchmarks? I've love for that directory to be tidied up a bit and generalised so that they can be used both offline (as most of them are today) and online (which would be more useful).

youkaichao commented 1 month ago

You can track it via https://github.com/vllm-project/vllm/pull/5073 .

rkooo567 commented 3 weeks ago

+1. It is very easy to trick this kind of benchmark tbh. It is the best we compare it ourselves in a fair way

ywang96 commented 3 weeks ago

ICYMI - they were using vLLM 0.3.3 for this benchmark.

zhyncs commented 2 weeks ago

Make all of these features compatible with each other

Make sense. Currently, the biggest issue with vLLM is that many features are not compatible for simultaneous use. For example, when baseline (vanilla fp16) + automatic prefix cache + chunked prefill + int8 kv cache + awq + speculative decoding are all enabled at the same time, there will be significant benefits compared to just using the baseline (vanilla fp16). ref https://github.com/vllm-project/vllm/issues/2614#issuecomment-2155649411

The main issue here is that, when adding each feature, from design and implementation to review, compatibility has not been given enough attention. ref https://github.com/InternLM/lmdeploy/pull/1450#issuecomment-2062979043

At the same time, the advantages of vLLM are also very obvious, more like a higher-performance transformers. In terms of model support, different hardware backend support and community activity, it is so great.

zhyncs commented 2 weeks ago

In fact, our team started using vLLM in the early part of last year, around July 2023. At that time, we also submitted PRs for W8A8 and KV Cache Int8 in September 2023 https://github.com/vllm-project/vllm/pull/1112. Later, to facilitate review, the PR was split into two parts https://github.com/vllm-project/vllm/pull/1507 https://github.com/vllm-project/vllm/pull/1508. This year, we also submitted a PR for W4A8 https://github.com/vllm-project/vllm/pull/5218.

TensorRT LLM has some closed-source components, such as batch manager and attention kernel, and its usability is average. LMDeploy TurboMind has excellent performance but supports fewer models, for example, it lacks support for MOE models. It can be said that each framework has its own advantages and disadvantages. At that time, we combined our own business needs. For example, we did not use MOE models in the short term because our algorithm colleagues found that after applying SFT to the model, the effect on MOE was not as good as Dense models in the short term (this is another topic).

Currently, many startups write blogs to demonstrate that their LLM Inference framework is better, such as FireWorks AI, FriendliAI and the OctoAI you mentioned above. At this time, they will naturally choose the most popular vLLM in the community and then construct scenarios that are favorable to themselves in testing environments and software versions. I don't think these performance comparison blogs have much significance. It's more about Public Relations.

vllm-project / vllm

[Performance]: What can we learn from OctoAI #5167