[Feature]: Adopt Colossal Inference Features (55% speedup over vLLM)

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

Apache License 2.0

29.11k stars 4.35k forks source link

🚀 The feature, motivation and pitch

ColossalAI has been able to demonstrate an impressive speedup over vLLM in multi-GPU inference. With TP=2, batch size 64, input len 512, output len 256 - a 55% speedup can be observed. I believe vLLM could see a speedup if it was to adopt a more performant batched prefilling.

For reference, here is the continuous batching feature:

Alternatives

No response

Additional context

Blog post: https://hpc-ai.com/blog/colossal-inference Source code: https://github.com/hpcaitech/ColossalAI/tree/main/colossalai/inference

vllm-project / vllm