tomaarsen / attention_sinks

Extend existing LLMs way beyond the original training length with constant memory usage, without retraining
https://huggingface.co/blog/tomaarsen/attention-sinks
Apache License 2.0
650 stars 41 forks source link

Completely refactor injection code #16

Closed tomaarsen closed 9 months ago

tomaarsen commented 9 months ago

Hello!

Pull Request overview

Details

The injection is now done at the end of the regular from_pretrained call, and is even possible on the AutoModel... classes. This was not possible before, and was the big motivation for this refactor. With this change implemented, architectures that require trust_remote_code=True can also benefit from attention_sinks, such as Qwen.

This also helps a decent bit with code duplication.