vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.11k stars 4.35k forks source link

[Feature]: Adopt Colossal Inference Features (55% speedup over vLLM) #5085

Open casper-hansen opened 5 months ago

casper-hansen commented 5 months ago

🚀 The feature, motivation and pitch

ColossalAI has been able to demonstrate an impressive speedup over vLLM in multi-GPU inference. With TP=2, batch size 64, input len 512, output len 256 - a 55% speedup can be observed. I believe vLLM could see a speedup if it was to adopt a more performant batched prefilling.

image

For reference, here is the continuous batching feature:

image

Alternatives

No response

Additional context

Blog post: https://hpc-ai.com/blog/colossal-inference Source code: https://github.com/hpcaitech/ColossalAI/tree/main/colossalai/inference

github-actions[bot] commented 6 hours ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!