mit-han-lab / streaming-llm

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks
https://arxiv.org/abs/2309.17453
MIT License
6.38k stars 355 forks source link

How to answer the question in the middle of long input #38

Open yangzhj53 opened 9 months ago

yangzhj53 commented 9 months ago

I wander how the streaming-llm answers the questions in the middle of long input. Specifically, what is the entire decoding process? When it generates the first token, where does the tokens in the KV cache come from?