The injection is now done at the end of the regular from_pretrained call, and is even possible on the AutoModel... classes. This was not possible before, and was the big motivation for this refactor. With this change implemented, architectures that require trust_remote_code=True can also benefit from attention_sinks, such as Qwen.
This also helps a decent bit with code duplication.
Hello!
Pull Request overview
Details
The injection is now done at the end of the regular
from_pretrained
call, and is even possible on theAutoModel...
classes. This was not possible before, and was the big motivation for this refactor. With this change implemented, architectures that requiretrust_remote_code=True
can also benefit fromattention_sinks
, such as Qwen.This also helps a decent bit with code duplication.