I find the sink cache does not shift the ROPE positional encoding of the new key_states. This would lead to
The positional encoding is not consecutive in all the keys;
In later generation rounds, the positional encoding is not like what we expected.
Since I'm not quite sure about this, I raise this issue before making a PR. Point it out if I was wrong.
Expected behavior
It seems that after fixing this the model behavior will be more closed to the original streamingllm implementation, but indeed I couldn't find a good way to test this.
I am very confused about how it could pass the test: https://github.com/huggingface/transformers/blob/317e069ee7f4d6d6595b1b03b5d9adcaede043e3/tests/utils/test_cache_utils.py#L336-L376 (though it fails on my end, the generated text is still reasonable). In this test, the key_states will have position encodings inrelevent to the cache size (say we have a sinkcahe with window size 256, the key state may have rope position 300. may be related #32315. If we generate texts manually so that the input positional ids and cache positions are None, the key state position will be starting from at most 257.).
System Info
transformers
version: 4.44.2Who can help?
@ArthurZucker, @zucchini-nlp
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
https://github.com/huggingface/transformers/blob/75b7485cc72bf7122094b943af9f7d26d69ff827/src/transformers/cache_utils.py#L985-L1006
I find the sink cache does not shift the ROPE positional encoding of the new
key_states
. This would lead toSince I'm not quite sure about this, I raise this issue before making a PR. Point it out if I was wrong.
Expected behavior
It seems that after fixing this the model behavior will be more closed to the original streamingllm implementation, but indeed I couldn't find a good way to test this.
I am very confused about how it could pass the test: https://github.com/huggingface/transformers/blob/317e069ee7f4d6d6595b1b03b5d9adcaede043e3/tests/utils/test_cache_utils.py#L336-L376 (though it fails on my end, the generated text is still reasonable). In this test, the
key_states
will have position encodings inrelevent to the cache size (say we have a sinkcahe with window size 256, the key state may have rope position 300. may be related #32315. If we generate texts manually so that the input positional ids and cache positions are None, the key state position will be starting from at most 257.).