Closed CorentinvdBdO closed 1 year ago
As illustrated in our run_streaming_llama.py, the KV cache eviction occurs only before the prompt input and generation. This means the demo code isn't designed for single, long input samples.
However, for long text inputs with LLMs, you can reference our perplexity evaluation script. Here, we input text and evict the KV cache token-by-token.
As highlighted in our README's FAQ section, StreamingLLM doesn't enlarge the LLM context window. If you want to expand the context window, consider using a model like Llama-2-7B-32K-Instruct for your experiments.
Ok! Thank you for your answer, I knew it was too good to be true, still a great achievement!
Hijacking from this, (tell me if I need a separate issue for this) but would adding more sink tokens similarly act as "state" once the sliding mechanism starts evicting tokens? I bet the model would have to learn to use the register cache during training. Similar to the VIT Paper, research direction can be directed towards looking for outliers, if they are similarly removed due to the availability of registers/sinks. That can hopefully perhaps make quantization a bit more easier! What exciting ideas! Amazing work!
EDIT: In case this wasn't clear, sink cache + sliding tokens for computation in an autoregressive manner is similar to RNNs, because of "state". We've somehow backtracked to having our RNN "hidden state" alongside the attention mechanism.
I tried naively to add examples in https://github.com/mit-han-lab/streaming-llm/blob/main/data/mt_bench.jsonl, including examples with length of 4k tokens, without changing anything in the script. I receive:
Did I misunderstand "infinite-length inputs without sacrificing efficiency and performance. "?