Open freyamom opened 11 months ago
by the way, I have a question about how to use window attention with re-computation. Which need to re-computation? And after read code, I also find that past_key_values will be stored in streaming-llm,and what's the difference with re-computate and this code? Where can I find the code about the re-compute?
Hi!
Thank you for reaching out and expressing your interest in the streaming-llm's attention sink feature. You're right, when use_cache=True
is set, the model reuses past_key_values
to make subsequent inferences efficient.
If you're encountering unexpected outputs when trying to store and reuse past_key_values
on your own, there could be a discrepancy in how you handle it. To ensure you're using it correctly, please refer to the examples provided in our repository:
Guangxuan
@Guangxuan-Xiao Thanks for your reply. I have another question about the correlation between input_length and kv cache size. Example: input_length = 100, kv cache size = 30 streamingllm inference will do:
which one is correct?
sorry mode 2. should add attention_sink for first 4 token :)
@Guangxuan-Xiao sorry to bother you. I have another question about the correlation between input_length and kv cache size. Example: input_length = 100, kv cache size = 30 streamingllm inference will do:
which one is correct?
Hi! Attention sink is very amazing for llm. I am confuse about past_key_values in streaming-llm. In my image, past_key_values will be recompute in every new input. But I notice past_key_values were be stored in streaming-llm by turn use_cache on. I was try my best to stroed past_key_values and reuse it in new input inference, the output will be very very strange. But the output is really good in streaming-llm. I really want to know what kind the effort you did for reuse past_key_values. Thanks a lot!