Open Leo-T-Zang opened 3 days ago
Is there any suggestion? Really appreciate it if you guys have any idea how to solve this problem (@drisspg @Chillee)
For prefill it might be worth just regenerating the block mask. But in general, indexing at a position just gives you the block mask corresponding to that position. I believe we also support using a sequence smaller than the BlockMask with the BlockMask. So if you generate a BlockMask with S=2048, for example, you can pass in a sequence of length 1001.
For prefill it might be worth just regenerating the block mask. But in general, indexing at a position just gives you the block mask corresponding to that position. I believe we also support using a sequence smaller than the BlockMask with the BlockMask. So if you generate a BlockMask with S=2048, for example, you can pass in a sequence of length 1001.
Thank you so much for your reply!
I understand that with BlockMask (S=2048), we can process sequences up to length 1000 for Prefill Stage. I am sorry if I did not make it clear in the original comment. I'm facing an issue during decoding stage:
During autoregressive decoding, we typically generate one token at a time and need to compute attention scores for a single position (e.g., the 1001th token).
To do this efficiently with KV Cache, we need to slice the BlockMask to get the exact row corresponding to the current position ID to compute attention with cached K and V.
The challenge is that BlockMask can only be sliced in blocks of 128 tokens, not individual rows:
q_slice = torch.arange(0, position_id//128 + 1)
block_mask = block_mask[:, :, q_slice]
block_mask = block_mask[:, :, position_id]
How can we implement KV Cache in this scenario where we can't slice individual rows from the BlockMask?
Additionally, could you help the error between setting `H=H` versus `H=None` ?
Thanks!
Is there any example code to do this? Should I generate new BlockMask everytime?
Thanks!
Essentially, I have problem of slicing BlockMask. For exmaple, if we have a prompt token of length 1000 (Prefill stage), I have the following codes for attention, which can be wrong. But, my question is if I need to generate 1001th token (one single token for Q), how do I slice the exact position in the BlockMask for it?
Another question is that if I use Prefix Mask for token prompts, when I set
H=None
, it works; when I setH=H
, it has errors.When
H=H
Errors