[Feature]: Enabling MSS for larger number of sequences (>256)

kushanam commented 1 month ago

🚀 The feature, motivation and pitch

In the advance_step.cu, there is a constraint on the number of sequences based on the number of available GPU threads and block_tables stride.

// TODO(will): support arbitrary block_tables stride
  if ((blocks * threads) / block_tables.stride(0) < num_queries) {
    TORCH_CHECK(false,

This prevents supporting larger batch sizes for which we have enabled cuda graphs recently, which hit perf significantly on H100/200 machines with models like llama70B. Would also like to help and @youkaichao please kindly connect us to Will. Thank you.

Alternatives

Change the kernel to support larger sequence size

Additional context

No response

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

youkaichao commented 1 month ago

@SolitaryThinker can you please take a look, and see if it is possible to remove the constraint?

kushanam commented 4 weeks ago

@pavanimajety

vllm-project / vllm