The paper "Differential Transformers" implements a differential attention mechanism which calculates the attention scores as the difference between two separate softmax attention maps leading to better long-context modeling and key information retrieval.
Feature request
The paper "Differential Transformers" implements a differential attention mechanism which calculates the attention scores as the difference between two separate softmax attention maps leading to better long-context modeling and key information retrieval.
LINK : Paper
Motivation
Although the paper focuses on decoder only models, I think this differential attention mechanism might be helpful with encoder only models as well.
Your contribution
I can work on this feature, if the HuggingFace team considers it as a valuable addition.