StreamingLLM - Attention sinks

Bec-k commented 9 months ago

Feature request

Implementations: https://github.com/mit-han-lab/streaming-llm/tree/main https://github.com/tomaarsen/attention_sinks/tree/main

Paper: https://arxiv.org/abs/2309.17453

Article from one of the authors of implementation: https://huggingface.co/blog/tomaarsen/attention-sinks

Motivation

Basically it's another evolution step over existing sliding window attention mechanism. Modifications required to preserve and forward attentions values cache to maintain its size and preserve information across multiple calls or long generation cycles. It is keeping memory in manageable size and prevent from OOM cases or at least significantly reduce such cases.

Your contribution

I can help explaining that implementation, from architecture perspective, but article and implementations with the paper are provided, that should be enough.