vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.68k stars 4.65k forks source link

[Misc]: How is the continous batching feature of vLLM implemented? #4316

Open llx-08 opened 7 months ago

llx-08 commented 7 months ago

Hi, I'm curious about the imlementation of continous batching. It is not mentioned in detail in the vLLM paper, and the code can only use this feature, but there is no detailed code on exactly how it is implemented. Is it a serial execution of the attention layer like Orca, or is it implemented like collapsing all requests within a batch into 1 dimension, then using tree-attention-mask to control valid attention score? Thank you very much!

wxthu commented 2 months ago

right!I also wonder how vllm implement orca. And when we batch different requests, how to async get respective generation results