mit-han-lab / streaming-llm

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks
https://arxiv.org/abs/2309.17453
MIT License
6.59k stars 361 forks source link

For LLMs already trained with window attention and BOS token #1

Closed GeneZC closed 12 months ago

GeneZC commented 1 year ago

Nice work!

I am wondering whether this attention sink magic is still needed for LLMs that has been already trained with window attention (e.g. mistral). While I am curious about this, I still think attention sink is a better way. Since it could be used on almost any LLMs either trained with or without window attention.

And in particular for Llama or say LLMs with an BOS token, attention sink can be viewed as a soft version of hard truncation of farthest tokens where sink token is very much like the BOS token and the position ids are also properly reorganized. This makes me further question about whether attention sink would work well on long-context scenarios (e.g. longeval)? Although it seems that streameval is testing the long-context modeling ability, I did not get the reason why streamllm can outperform dense attention when the context length lies between cache size and pretrained length.

BTW, I am not very certain about what window attention with recompute does, and why it could work?

freckletonj commented 1 year ago

:+1: for Mistral

tomaarsen commented 1 year ago

I think that this should still work. I'm running some experiments using these Attention Sinks here: https://github.com/tomaarsen/attention_sinks There you can just load "Attention Sink"-adapted models like so:

from attention_sinks import AutoModel

model = AutoModel.from_pretrained("meta-llama/Llama-2-7b-hf", device_map="auto")

I hope to add support for Mistral in the coming days.

I'm quite excited about this line of work - my personal experiments match the findings from the paper! There's some information in #5.

Guangxuan-Xiao commented 1 year ago

I've found that @tomaarsen conducted an evaluation of StreamingLLM against the window attention baseline using Mistral-7B. It appears that the model trained with the sliding window attention still requires attention sinks for streaming.

image

For more details, please see this reference: Attention Sinks in Transformers for Endless Fluent Generation.

As for why StreamingLLM surpasses dense attention when the input length is within the cache size and the pre-training length, we're still researching. One hypothesis suggests that LLMs might not fully leverage the extensive context provided to them. In some instances, a shorter context might enhance their performance. For further insights, please refer to the "Lost-in-the-Middle" paper and Table 6 in our paper.

Thank you, Guangxuan

sdc17 commented 12 months ago

Hi, thanks for sharing this impressive work!

BTW, I am not very certain about what window attention with recompute does, and why it could work?

Same question after carefully reading the paper. Any explanations or references to elaborate would be appreciated!

Guangxuan-Xiao commented 12 months ago

Hi, thanks for sharing this impressive work!

BTW, I am not very certain about what window attention with recompute does, and why it could work?

Same question after carefully reading the paper. Any explanations or references to elaborate would be appreciated!

I provided a detailed explanation in https://github.com/mit-han-lab/streaming-llm/issues/33#issuecomment-1758597666. Please let me know if you have further questions!

sdc17 commented 11 months ago

Hi, thanks for sharing this impressive work!

BTW, I am not very certain about what window attention with recompute does, and why it could work?

Same question after carefully reading the paper. Any explanations or references to elaborate would be appreciated!

I provided a detailed explanation in #33 (comment). Please let me know if you have further questions!

Truly helpful, thanks!