[Bug] [BlockManagerV2]: Prefill for sliding window models can allocate more blocks than sliding window size

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

https://docs.vllm.ai

Apache License 2.0

26.71k stars 3.91k forks source link

[Bug] [BlockManagerV2]: Prefill for sliding window models can allocate more blocks than sliding window size #7470

Open sylviayangyy opened 1 month ago

sylviayangyy commented 1 month ago

Your current environment

The output of `python collect_env.py`

```text Your output of `python collect_env.py` here ```

🐛 Describe the bug

Hi there, I'm new to vllm and I may have missed something, but in BlockManagerV2, I only see consideration of the sliding window in the can_allocate function, like the following code snippet:

def can_allocate(self, seq_group: SequenceGroup) -> AllocStatus:
            # other code ...
              if self.max_block_sliding_window is not None:
                 num_required_blocks = min(num_required_blocks, self.max_block_sliding_window)
            # other code ...

But I don't see any consideration of the sliding window when actually performing the allocation. Is this by design or a potential bug? If it's by design, I'm wondering about a scenario where the entire prompt requires 4 blocks, but the number of free blocks is only 3. In this case, if max_block_sliding_window=3, the can_allocate function would return True, but when it comes to the actual allocation, there wouldn't be enough space for the tokens in the 4th block. Is this a known issue or something that is handled differently?

Any help would be greatly appreciated! 😊

youkaichao commented 1 month ago

cc @cadedaniel @alexm-neuralmagic

cadedaniel commented 1 month ago

The logic you're looking for resides in the block table: https://github.com/vllm-project/vllm/blob/00c3d68e45bad901989c1afe3c223225dc9a5d6d/vllm/core/block/block_table.py#L132-L142

The TL;DR is it drops blocks at the beginning of the context window once the context length exceeds the context window.

sylviayangyy commented 1 month ago

The logic you're looking for resides in the block table:

https://github.com/vllm-project/vllm/blob/00c3d68e45bad901989c1afe3c223225dc9a5d6d/vllm/core/block/block_table.py#L132-L142

The TL;DR is it drops blocks at the beginning of the context window once the context length exceeds the context window.

Hi @cadedaniel, thanks for your reply! Yeah I did notice this part before, but it's in the append_token_ids function right? I've gone through the related logic, and I found just one call path: Scheduler._schedule_running()/Scheduler._schedule_swapped() -> Scheduler._append_slots() -> BlockManagerV2.append_slots() -> BlockTable.append_token_ids(). However, I didn't see any direct or indirect call to append_token_ids() when _schedule_prefills is called, where it calls BlockManagerV2.allocate() instead of BlockManagerV2.append_slots(). That's why I'm a bit confused and provided a prompt example scenario in my first comment. If I misunderstood anything, please feel free to point it out ~

cadedaniel commented 1 month ago

I see. What you're looking for is how to prefill the prompt with sliding window, while limiting the number of required blocks to the sliding window size.

Currently this is not supported in vLLM, either v1 or v2 block manager. If this use-case is important to you, you should file a new issue.

sylviayangyy commented 1 month ago

I see. What you're looking for is how to prefill the prompt with sliding window, while limiting the number of required blocks to the sliding window size.

Currently this is not supported in vLLM, either v1 or v2 block manager. If this use-case is important to you, you should file a new issue.

Yes that is exactly what I'm looking for. Then I still wonder Why vLLM didn't support this?

If it's by design, I'd like to understand the intention behind that decision. And in this case, we don't need to limit num_required_blocks to max_block_sliding_window in can_allocate function either, right?That way, it wouldn't lead to the insufficient space issue described in my example scenario. If that's the case, I'd be happy to volunteer and modify this part~
If it's just a missing piece, I'd also volunteer to take this up!😄

sylviayangyy commented 1 month ago

Besides, I think this part in BlockManagerV1 implements prefilling prompts with sliding window? https://github.com/vllm-project/vllm/blob/fc93e5614374688bddc432279244ba7fbf8169c2/vllm/core/block_manager_v1.py#L304-L309

cadedaniel commented 4 weeks ago

Besides, I think this part in BlockManagerV1 implements prefilling prompts with sliding window?

Oh, you're totally right. Good catch.

Unfortunately I think a fix is nontrivial, because the block manager v1 relies on poorly defined behavior to accomplish this. The poorly defined behavior in question is "the block manager v1 breaks immutability of a previously complete block in sliding window during prefill". So the way to fix this is to find a way to encode this behavior without breaking the immutability of previously written blocks.

I feel the most straightforward way to solve this is to have a SlidingWindowBlockTable which handles the complexity. Inside, it would manage the rotation of immutable complete blocks to mutable incomplete blocks, and manage the interaction with prefix caching. That last part needs some more thought, perhaps the promotion behavior once a block is full is modified. I'd need to think more to have a better answer, but happy to review proposals for how to do this. One pointer for the design is here: https://github.com/vllm-project/vllm/pull/3492

sylviayangyy commented 3 weeks ago

Got it, will learn about your proposal about the 'SlidingWindowBlockTable' and share my design if I make any progress. Thank you for your helpful responses!

cadedaniel commented 3 weeks ago

Sounds good. Also happy to schedule a call if it helps you with your design.