Implement Relative Positional Multi-Head Attention in Transformer Variants

Description:

Hello! I’ve been following the development of this repository and appreciate the efforts to benchmark various efficient Transformer variants. I’d like to propose the implementation of Relative Positional Multi-Head Attention as an enhancement to the current models.

What is Relative Positional Multi-Head Attention?

Relative Positional Multi-Head Attention is a modification to the standard self-attention mechanism in Transformers. Traditional Transformers use absolute positional encodings to provide information about the position of tokens in a sequence. However, relative positional encodings allow the model to focus on the relative distance between tokens, which is often more relevant in tasks where the relationship between tokens matters more than their absolute position.

This method enhances the model's ability to capture local dependencies and handle sequences where the relative position of tokens plays a significant role. It is particularly beneficial for tasks like language modeling, where understanding the proximity of words to each other can be crucial.

Reference Paper:

Title: Attention Is All You Need (with Relative Position Encoding) Link: Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context Implementation Example: Relative Positional Encodings in Transformer Models

infocusp / varformers