mit-han-lab / streaming-llm

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks
https://arxiv.org/abs/2309.17453
MIT License
6.59k stars 361 forks source link

How do you feed long texts to a model? #2

Closed CorentinvdBdO closed 1 year ago

CorentinvdBdO commented 1 year ago

I tried naively to add examples in https://github.com/mit-han-lab/streaming-llm/blob/main/data/mt_bench.jsonl, including examples with length of 4k tokens, without changing anything in the script. I receive:

ASSISTANT: Token indices sequence length is longer than the specified maximum sequence length for this model (3905 > 2048). Running this sequence through the model will result in indexing errors
- - - - - - - - - d - d - d - d - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - d - d d d d d - d - d - - - - - - - - - - - - - - - - - - - - - - d - d - d - d - - - - - - - - - - - - - d - d d - d - d d d d - d - d d d d d d d d - d - d - d - d - d - d - d - - - d - d - d - d - d - d d d d d d d d d d d d - d - d - d - d - d - d - d - d - d - d - d d d d d - d - d - d - d - d - d d d d d d d d d d d d d d d d - d d d d - d0 - d - d - d - d - d - d - d - d - - - - - - - d - d - d - d - - - - - - d - d - - - d - d - d - d - d - d - d - d - - - - - - d - d - - - - - - - - d - d - d d d d d - d - d - d - d - d - d - d - d0 d0 d0 d d d d d d d d d d d d d d d d - d - d - d - - d - d - d - d - d - d d d d d d - d - - - - - - - - - - - - - - - - - - - - d - d - d - - - - - - - d - d - d - d - d - d d d d - d - d - d d d d d - d - d - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - d - d d d d d d d d d - d - d - d - d - d d d d d - d d d d d d - d - d - d - d - d - d - d - d - - - - - d - d - d - d - d - - - - - - - - - - d - d d d d d d d d d s d d d d0 n0 d0 - d - d - d - d - d - d - d - d - - - - - - - - d0 d d d d d d d d d d d d d d d d d d d d d d d d - d - d - d - d - - - - - - - - - d - d - d - d - - - - - - - - - - - - - - - - - - d - d - d - - - - - - - - - - - d - d - d - d - d d d, d, s s s s s s s s s s d s s s s s d d d. d, d d d n n n n d0 d00 d0 d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d, d, d, d, d, d, d, d, d, d, d d d d d0 d00, d, d, d, d000,0,0000000 d0 d. d. d. d, d, d d. d. et d. d d d d d et d et d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d et d et d et d et et et et et d et d d d d d d d d d d d d, d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d. d. d. d d d d d d d d d d d d d d d d d d d d d d d d d d

Did I misunderstand "infinite-length inputs without sacrificing efficiency and performance. "?

Guangxuan-Xiao commented 1 year ago

As illustrated in our run_streaming_llama.py, the KV cache eviction occurs only before the prompt input and generation. This means the demo code isn't designed for single, long input samples.

However, for long text inputs with LLMs, you can reference our perplexity evaluation script. Here, we input text and evict the KV cache token-by-token.

As highlighted in our README's FAQ section, StreamingLLM doesn't enlarge the LLM context window. If you want to expand the context window, consider using a model like Llama-2-7B-32K-Instruct for your experiments.

CorentinvdBdO commented 1 year ago

Ok! Thank you for your answer, I knew it was too good to be true, still a great achievement!

gembancud commented 1 year ago

Hijacking from this, (tell me if I need a separate issue for this) but would adding more sink tokens similarly act as "state" once the sliding mechanism starts evicting tokens? I bet the model would have to learn to use the register cache during training. Similar to the VIT Paper, research direction can be directed towards looking for outliers, if they are similarly removed due to the availability of registers/sinks. That can hopefully perhaps make quantization a bit more easier! What exciting ideas! Amazing work!

EDIT: In case this wasn't clear, sink cache + sliding tokens for computation in an autoregressive manner is similar to RNNs, because of "state". We've somehow backtracked to having our RNN "hidden state" alongside the attention mechanism.