Hi, I'm curious about the imlementation of continous batching.
It is not mentioned in detail in the vLLM paper, and the code can only use this feature, but there is no detailed code on exactly how it is implemented. Is it a serial execution of the attention layer like Orca, or is it implemented like collapsing all requests within a batch into 1 dimension, then using tree-attention-mask to control valid attention score? Thank you very much!
Hi, I'm curious about the imlementation of continous batching. It is not mentioned in detail in the vLLM paper, and the code can only use this feature, but there is no detailed code on exactly how it is implemented. Is it a serial execution of the attention layer like Orca, or is it implemented like collapsing all requests within a batch into 1 dimension, then using tree-attention-mask to control valid attention score? Thank you very much!