Open rajveer43 opened 3 months ago
Hi Rajveer, thank you for your feedback and sharing other variants. We will surely add implementation of other efficient variants of the transformers to this and you are more than welcome to contribute on the same.
Description:
Hello! I appreciate the excellent work on benchmarking Performer and Longformer against the base Transformer. I’d like to propose the implementation of additional efficient Transformer variants to further extend the benchmarking scope. This could provide a more comprehensive comparison and serve as a valuable resource for the community.
Suggested Models:
Reformer:
Description:
Reformer introduces two key innovations: locality-sensitive hashing (LSH) for reducing the attention complexity from O(N^2) to O(N log N) and reversible layers to reduce memory consumption.
Reference Paper:
Reformer: The Efficient Transformer
Implementation Considerations: Implementing the LSH attention mechanism and reversible layers within the current framework could provide significant memory and time savings, especially for long sequences.
Linformer:
Description:
Linformer approximates the self-attention mechanism with linear complexity by projecting the key and value matrices to lower dimensions. This makes the attention computation linear with respect to the sequence length.
Reference Paper:
Linformer: Self-Attention with Linear Complexity Implementation Considerations: The key challenge will be in effectively reducing the dimensionality of the key and value matrices without compromising the model's performance.
BigBird:
Description:
BigBird uses a combination of global, local, and random attention mechanisms to handle sequences of up to thousands of tokens efficiently. It’s especially beneficial for tasks like long document classification.
Reference Paper:
Big Bird: Transformers for Longer Sequences Implementation Considerations: Adapting the attention mechanism to incorporate global, local, and random attention will be critical. This will allow the model to process longer sequences with improved efficiency.
Synthesizer:
Description:
Synthesizer replaces the dot-product self-attention mechanism with synthetic attention, which either learns attention weights through dense or random projection, aiming to simplify the attention computation while maintaining performance.
Reference Paper:
Synthesizer: Rethinking Self-Attention in Transformer Models Implementation Considerations: Implementing synthetic attention mechanisms would provide an interesting comparison to traditional attention-based models, especially in terms of performance and computational cost.
Looking forward to your thoughts on this!