mit-han-lab / streaming-llm

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks
https://arxiv.org/abs/2309.17453
MIT License
6.59k stars 361 forks source link

Results for Section 3.2 Rolling KV Cache (Without Pretraining) #61

Open timljj opened 11 months ago

timljj commented 11 months ago

Hi,

Do you have any experiment results for attention sink for the non pre-training case? From what I read, all the results shown in the paper are from pretraining with attention sinks.

Additionally, did you ever test smaller cache sizes like 128? If I understood correctly, the model should not break with smaller cache sizes?

Guangxuan-Xiao commented 11 months ago

We did not pre-train LLMs in most experiments. Only section 4.2 includes pre-training experiments. You can use StreamingLLM with off-the-shelf Llama models, just like our demo.

Guangxuan