this paper contributes by proposing relative position representation for self-attention mechanism
Relation-aware Self-Attention
Self-Attention
input sequence x = (x_1, .., x_n) is mapped to a new sequence z = (z_1, .., z_n) where dot product (attention) between all pairs are calculated in eq. (2)
attention weight (alpha) is computed by softmax function
final token representation is computed as weighted sum of a linear transformation as in eq. (1)
Relation-aware Self-Attention
input, output is same as Self-Attention, edge information is added to the output via addition. eq.(2) in self-attention becomes eq.(4) where relative position representation a_i,j is added
final token representation is computed with edge information added. eq.(1) becomes eq.(3)
Relative Position Representation
maximum relative position is clipped to a max absolute value of k
Efficient Implementation
splitting the computation of eq.(4) into two terms makes computation more efficient. ~7% decrease in steps per second
Experiments
WMT14 EnDe +1.3 BLEU
WMT14 EnFr +0.3 BLEU
Ablation Experiments
clipping distance k
k >= 2, no significant improvement beyond
Position of Relative Representation
a^K suffices to represent relative position. need further work
Personal Thoughts
was always curious about the role and effectiveness of sinusoidal positional encoding, good to see improvement in BLEU, but wish to see qualitative analysis on how each relative position representations are learnt
Abstract
Details
Introduction
Relation-aware Self-Attention
x = (x_1, .., x_n)
is mapped to a new sequencez = (z_1, .., z_n)
where dot product (attention) between all pairs are calculated in eq. (2)alpha
) is computed by softmax functiona_i,j
is addedRelative Position Representation
k
Efficient Implementation
Experiments
Ablation Experiments
k
k >= 2
, no significant improvement beyonda^K
suffices to represent relative position. need further workPersonal Thoughts
Link : https://arxiv.org/pdf/1803.02155.pdf Authors : Shaw et al. 2018