Implementation of `Stream` and `Infinite`?

thunlp / InfLLM

The code of our paper "InfLLM: Unveiling the Intrinsic Capacity of LLMs for Understanding Extremely Long Sequences with Training-Free Memory"

MIT License

285 stars 25 forks source link

Yes, we implemented both methods using multi-stage attention with two groups of QKV inputs.

For more details, you can check inf_llm/attention/stream_llm.py(streamingLLM), inf_llm/attention/infinite_lm.py(infiniteLM) , and inf_llm/attention/dot_production_attention/torch_impl.py(Multi-Stage Attention).

To improve performance, we implemented multi-stage flash attention with Triton. You can test it with fattn:true in your configuration YAML file when using infinite-lm/stream-llm/inf-llm attention.

thunlp / InfLLM

Implementation of `Stream` and `Infinite`? #22