Closed Tomorrowdawn closed 8 hours ago
Supplementary:
It works with DynamicCache.
So it must be something wrong with SinkCache and relevant control code.
cc @gante @ArthurZucker
Have not worked on the sink cache so will let @gante answer here!
In cache_utils.py, I noticed that
keys_to_keep = self.key_cache[layer_idx][ :, :, -self.window_length + self.num_sink_tokens + key_states.shape[-2] : ]
might go wrong when -self.window_length + self.num_sink_tokens + key_states.shape[-2] >= 0
Not sure is it relevant
It's been a bit since I worked on this, but I think that -self.window_length + self.num_sink_tokens + key_states.shape[-2] >= 0
is not really possible.
window_length
is the max. size of the cache, e.g. 1024. num_sink_tokens
is some (usually small) positive integer, e.g. 4key_states.shape[-2]
is the size of the new additions to the cache.In the code here: https://github.com/huggingface/transformers/blob/b72752f06830cb6cf8d21c284f68e15faa100c4d/src/transformers/cache_utils.py#L703-L706
We're in the "Shifting cache" phase, i.e. the cache already exists, and now we're adding enough tokens to make it overflow. However, if it already exists, then I think (I'm not 100% on this) we always add 1 new generated token, i.e. key_states.shape[-2]
is 1. So I think a non-negative value can only happen if the num_sink_tokens >= window_length - 1
, which is not normal behaviour.
However, if it's somehow possible to, when the cache already exists, add a bunch of tokens in one go, then I think it would be possible to mess this up. Then, the keys_to_keep
should really be empty (as we're skipping way ahead and keeping no tokens), but the overflow of -self.window_length + self.num_sink_tokens + key_states.shape[-2] >= 0
into the positives is allowing some keys to stay. Then the new tokens will get appended and we'll accidentally get a cache that's too large here: https://github.com/huggingface/transformers/blob/b72752f06830cb6cf8d21c284f68e15faa100c4d/src/transformers/cache_utils.py#L724
But I think that should probably cause a pretty easy-to-spot crash as the cache is now bigger than the window size, which should not be possible.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
transformers
version: 4.41.0Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
dataset:
I concatenate all 'document' text to generate a streaming task(test streamingLLM). The code is trivial but long so omitted.
model: LlamaForCasualLM Weight: llama2-7b-hf
core run code:
stream object produces 100 tokens per iter, like a list(containinng many tokens).
And I plotted the attention scores. However, the upper triangle part of them are not zeros.
sink num = 0(local window):
sink num = 16(for streaming LLM):
Expected behavior
The attention scores matrix for prompt len = 0(no kv cache) is right: