Closed hxs91 closed 9 months ago
Hello,
Thank you for your thoughtful questions. Let's delve into the details:
Regarding streaming-llm processing with truncation vs. re-computation:
Comparison between dense attention at 16k and StreamingLLM at 4k:
I hope this helps explain the questions you asked.
Guangxuan
@Guangxuan-Xiao Got it. thank you for the answer. btw, hope to see some quantitatively results of the gap in question 2. : )
Thank you for your nice work and I have read the issue #33, thank you for your patient explanation for the difference between StreamingLLM and Dense Attention, based on your answer, I have futher question:
As you mentioned in FAQ.3, I guess streaming-llm processes long input in a truncation manner. But if I don't mind the expensive time consuming and handle the long input text in a streaming-llm manner (recompuation from the very begining of the text to the end with attention sinks window by window), theoretically, will it perform better than truncation? do you perform some experiments?
For a model with large context size (say 16k), we can still implement streaming llm use shorter context size(say 4k), have you perform comparison between dense attention with 16k and streaming llm with 4k? Theoretically, streaming-4k should perform worse than dense-16k, what's the gap? this is important if one want to use the streaming-llm to approximate larger window size performance.