Do you have any experiment results for attention sink for the non pre-training case? From what I read, all the results shown in the paper are from pretraining with attention sinks.
Additionally, did you ever test smaller cache sizes like 128? If I understood correctly, the model should not break with smaller cache sizes?
We did not pre-train LLMs in most experiments. Only section 4.2 includes pre-training experiments. You can use StreamingLLM with off-the-shelf Llama models, just like our demo.
Hi,
Do you have any experiment results for attention sink for the non pre-training case? From what I read, all the results shown in the paper are from pretraining with attention sinks.
Additionally, did you ever test smaller cache sizes like 128? If I understood correctly, the model should not break with smaller cache sizes?