huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.7k stars 26.44k forks source link

The Sink Cache looks wired #33691

Open why-in-Shanghaitech opened 3 days ago

why-in-Shanghaitech commented 3 days ago

System Info

Who can help?

@ArthurZucker, @zucchini-nlp

Information

Tasks

Reproduction

https://github.com/huggingface/transformers/blob/75b7485cc72bf7122094b943af9f7d26d69ff827/src/transformers/cache_utils.py#L985-L1006

I find the sink cache does not shift the ROPE positional encoding of the new key_states. This would lead to

  1. The positional encoding is not consecutive in all the keys;
  2. In later generation rounds, the positional encoding is not like what we expected.

Since I'm not quite sure about this, I raise this issue before making a PR. Point it out if I was wrong.

Expected behavior

It seems that after fixing this the model behavior will be more closed to the original streamingllm implementation, but indeed I couldn't find a good way to test this.

I am very confused about how it could pass the test: https://github.com/huggingface/transformers/blob/317e069ee7f4d6d6595b1b03b5d9adcaede043e3/tests/utils/test_cache_utils.py#L336-L376 (though it fails on my end, the generated text is still reasonable). In this test, the key_states will have position encodings inrelevent to the cache size (say we have a sinkcahe with window size 256, the key state may have rope position 300. may be related #32315. If we generate texts manually so that the input positional ids and cache positions are None, the key state position will be starting from at most 257.).

LysandreJik commented 3 days ago

cc @tomaarsen maybe :)