In the advance_step.cu, there is a constraint on the number of sequences based on the number of available GPU threads and block_tables stride.
// TODO(will): support arbitrary block_tables stride
if ((blocks * threads) / block_tables.stride(0) < num_queries) {
TORCH_CHECK(false,
This prevents supporting larger batch sizes for which we have enabled cuda graphs recently, which hit perf significantly on H100/200 machines with models like llama70B. Would also like to help and @youkaichao please kindly connect us to Will. Thank you.
Alternatives
Change the kernel to support larger sequence size
Additional context
No response
Before submitting a new issue...
[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
🚀 The feature, motivation and pitch
In the advance_step.cu, there is a constraint on the number of sequences based on the number of available GPU threads and block_tables stride.
This prevents supporting larger batch sizes for which we have enabled cuda graphs recently, which hit perf significantly on H100/200 machines with models like llama70B. Would also like to help and @youkaichao please kindly connect us to Will. Thank you.
Alternatives
Change the kernel to support larger sequence size
Additional context
No response
Before submitting a new issue...