Differential Attention implementation for BERT.

Feature request

The paper "Differential Transformers" implements a differential attention mechanism which calculates the attention scores as the difference between two separate softmax attention maps leading to better long-context modeling and key information retrieval.

LINK : Paper

Motivation

Although the paper focuses on decoder only models, I think this differential attention mechanism might be helpful with encoder only models as well.

Your contribution

I can work on this feature, if the HuggingFace team considers it as a valuable addition.

huggingface / transformers

Differential Attention implementation for BERT. #34146

Feature request

Motivation

Your contribution