🚀 Feature request

I've looked into the paper titled "EL-Attention: Memory Efficient Lossless Attention for Generation". It proposes a method for calculating attention that forgoes creating multi-head attention from the hidden state. This saves computational time and frees memory.

Motivation

El-attention seems to have no downsides, and promises significant memory and performance gains during training and inference.

Your contribution

The main difficulty may be in that it requires being added directly in to each model's attention mechanism code, or requires a ton of new subclasses for each part of each model. Maybe an easier solution to this would be a pipeline to use custom attention mechanism code.

huggingface / transformers

Feature Request: El-Attention #12793

🚀 Feature request

Motivation

Your contribution