vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.95k stars 4.71k forks source link

[Feature]: Enabling MSS for larger number of sequences (>256) #9164

Open kushanam opened 1 month ago

kushanam commented 1 month ago

🚀 The feature, motivation and pitch

In the advance_step.cu, there is a constraint on the number of sequences based on the number of available GPU threads and block_tables stride.

// TODO(will): support arbitrary block_tables stride
  if ((blocks * threads) / block_tables.stride(0) < num_queries) {
    TORCH_CHECK(false,

This prevents supporting larger batch sizes for which we have enabled cuda graphs recently, which hit perf significantly on H100/200 machines with models like llama70B. Would also like to help and @youkaichao please kindly connect us to Will. Thank you.

Alternatives

Change the kernel to support larger sequence size

Additional context

No response

Before submitting a new issue...

youkaichao commented 1 month ago

@SolitaryThinker can you please take a look, and see if it is possible to remove the constraint?

kushanam commented 4 weeks ago

@pavanimajety