How to answer the question in the middle of long input

mit-han-lab / streaming-llm

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks

https://arxiv.org/abs/2309.17453

MIT License

6.38k stars 355 forks source link

How to answer the question in the middle of long input #38

Open yangzhj53 opened 9 months ago

yangzhj53 commented 9 months ago

I wander how the streaming-llm answers the questions in the middle of long input. Specifically, what is the entire decoding process? When it generates the first token, where does the tokens in the KV cache come from?