huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
8.44k stars 960 forks source link

StreamingLLM - Attention sinks #1139

Closed Bec-k closed 5 months ago

Bec-k commented 9 months ago

Feature request

Implementations: https://github.com/mit-han-lab/streaming-llm/tree/main https://github.com/tomaarsen/attention_sinks/tree/main

Paper: https://arxiv.org/abs/2309.17453

Article from one of the authors of implementation: https://huggingface.co/blog/tomaarsen/attention-sinks

Motivation

Basically it's another evolution step over existing sliding window attention mechanism. Modifications required to preserve and forward attentions values cache to maintain its size and preserve information across multiple calls or long generation cycles. It is keeping memory in manageable size and prevent from OOM cases or at least significantly reduce such cases.

Your contribution

I can help explaining that implementation, from architecture perspective, but article and implementations with the paper are provided, that should be enough.

jqueguiner commented 9 months ago

This PR seems to be in progress and linked to this issue:

https://github.com/huggingface/text-generation-inference/pull/1105/

github-actions[bot] commented 7 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Bec-k commented 6 months ago

Huh, this shouldn't be closed yet.

github-actions[bot] commented 5 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

jqueguiner commented 3 months ago

Hi @OlivierDehaene. can we reopen it @OlivierDehaene ?