Open casper-hansen opened 5 months ago
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
🚀 The feature, motivation and pitch
ColossalAI has been able to demonstrate an impressive speedup over vLLM in multi-GPU inference. With TP=2, batch size 64, input len 512, output len 256 - a 55% speedup can be observed. I believe vLLM could see a speedup if it was to adopt a more performant batched prefilling.
For reference, here is the continuous batching feature:
Alternatives
No response
Additional context
Blog post: https://hpc-ai.com/blog/colossal-inference Source code: https://github.com/hpcaitech/ColossalAI/tree/main/colossalai/inference