huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.35k stars 27.09k forks source link

Differential Attention implementation for BERT. #34146

Open Chirayu-Tripathi opened 1 month ago

Chirayu-Tripathi commented 1 month ago

Feature request

The paper "Differential Transformers" implements a differential attention mechanism which calculates the attention scores as the difference between two separate softmax attention maps leading to better long-context modeling and key information retrieval.

LINK : Paper

Motivation

Although the paper focuses on decoder only models, I think this differential attention mechanism might be helpful with encoder only models as well.

Your contribution

I can work on this feature, if the HuggingFace team considers it as a valuable addition.

larin92 commented 1 month ago

was thinking about implementing/using it for Longformers as well