Closed Bec-k closed 5 months ago
This PR seems to be in progress and linked to this issue:
https://github.com/huggingface/text-generation-inference/pull/1105/
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Huh, this shouldn't be closed yet.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Hi @OlivierDehaene. can we reopen it @OlivierDehaene ?
Feature request
Implementations: https://github.com/mit-han-lab/streaming-llm/tree/main https://github.com/tomaarsen/attention_sinks/tree/main
Paper: https://arxiv.org/abs/2309.17453
Article from one of the authors of implementation: https://huggingface.co/blog/tomaarsen/attention-sinks
Motivation
Basically it's another evolution step over existing sliding window attention mechanism. Modifications required to preserve and forward attentions values cache to maintain its size and preserve information across multiple calls or long generation cycles. It is keeping memory in manageable size and prevent from OOM cases or at least significantly reduce such cases.
Your contribution
I can help explaining that implementation, from architecture perspective, but article and implementations with the paper are provided, that should be enough.