Closed liyucheng09 closed 6 months ago
Yes, we implemented both methods using multi-stage attention with two groups of QKV inputs.
For more details, you can check inf_llm/attention/stream_llm.py
(streamingLLM), inf_llm/attention/infinite_lm.py
(infiniteLM) , and inf_llm/attention/dot_production_attention/torch_impl.py
(Multi-Stage Attention).
To improve performance, we implemented multi-stage flash attention with Triton. You can test it with fattn:true
in your configuration YAML file when using infinite-lm/stream-llm/inf-llm attention.
Congrats for the nice work.
I see streamingLLM and InfiniteLLM are used in your experiments.
Have you developled your own implementation for
stream
andInfinite
? The original streamingLLM is only made for decoing phrase and thus not suitable for prefilling. Are you has a prefilling version of implementation?Thanks!