Self-Attention with Relative Position Representations

Abstract

present relative position representation in self-attention mechanism to efficiently consider representations of the relative positions
WMT14 EnDe +1.3 BLEU, EnFr +0.3 BLEU

Self-Attention
- input sequence x = (x_1, .., x_n) is mapped to a new sequence z = (z_1, .., z_n) where dot product (attention) between all pairs are calculated in eq. (2)
- attention weight (alpha) is computed by softmax function
- final token representation is computed as weighted sum of a linear transformation as in eq. (1)
Relation-aware Self-Attention
- input, output is same as Self-Attention, edge information is added to the output via addition. eq.(2) in self-attention becomes eq.(4) where relative position representation a_i,j is added
- final token representation is computed with edge information added. eq.(1) becomes eq.(3)

splitting the computation of eq.(4) into two terms makes computation more efficient. ~7% decrease in steps per second

clipping distance k
- k >= 2, no significant improvement beyond
Position of Relative Representation
- a^K suffices to represent relative position. need further work

was always curious about the role and effectiveness of sinusoidal positional encoding, good to see improvement in BLEU, but wish to see qualitative analysis on how each relative position representations are learnt