Completely refactor injection code

Hello!

Pull Request overview

Completely refactor injection code

Details

The injection is now done at the end of the regular from_pretrained call, and is even possible on the AutoModel... classes. This was not possible before, and was the big motivation for this refactor. With this change implemented, architectures that require trust_remote_code=True can also benefit from attention_sinks, such as Qwen.

This also helps a decent bit with code duplication.

Tom Aarsen

tomaarsen / attention_sinks

Completely refactor injection code #16

Pull Request overview

Details