thunlp / InfLLM

The code of our paper "InfLLM: Unveiling the Intrinsic Capacity of LLMs for Understanding Extremely Long Sequences with Training-Free Memory"
MIT License
285 stars 25 forks source link

Implementation of `Stream` and `Infinite`? #22

Closed liyucheng09 closed 6 months ago

liyucheng09 commented 6 months ago

Congrats for the nice work.

I see streamingLLM and InfiniteLLM are used in your experiments.

Have you developled your own implementation for stream and Infinite? The original streamingLLM is only made for decoing phrase and thus not suitable for prefilling. Are you has a prefilling version of implementation?

Thanks!

guyan364 commented 6 months ago

Yes, we implemented both methods using multi-stage attention with two groups of QKV inputs.

For more details, you can check inf_llm/attention/stream_llm.py(streamingLLM), inf_llm/attention/infinite_lm.py(infiniteLM) , and inf_llm/attention/dot_production_attention/torch_impl.py(Multi-Stage Attention).

To improve performance, we implemented multi-stage flash attention with Triton. You can test it with fattn:true in your configuration YAML file when using infinite-lm/stream-llm/inf-llm attention.