Weighted Transformer Network for Machine Translation

Abstract

Propose Weighted Transformer, a Transformer with modified attention layers, that performs better in BLEU score and converges 15~40% faster
Specifically, replace the multi-head attention by multiple self-attention branches

In short, Transformer is Self-Attention + Positional Encoding + Multi-head Attention
Multi-head Attention + FFN
- generate multiple heads from same input QKV, concatenate them after single linear transformation
- After each layer, two-layered FFN is used
Proposed Multi-branch Attention
- κ can be interpreted as a learned concatenation weight and α as the learned addition weight
- κ scales the contribution of the various branches before α is used to sum them in a weighted fashion.
- Unlike the Transformer, which weighs all heads equally, the proposed mechanism allows for ascribing importance to different heads
- it only adds 192 parameter to existing 213M params in big model
Weighted Transformer in Graph
Training Details
- label smoothing with 0.1, dropout with 0.1
- Adam optimizer with same setting as Transformer, but larger learning rate for (κ,α) and freeze weight for (κ,α) for last 10K iterations
- 25,000 token per batch
- Training Time
- Transformer small : big = 100K : 300K vs wTransformer small : big = 60K : 250K
Result
Parameter Search Result
Regularization Effect shown via viz
(κ,α) Trend during training
- each branch gets different amount of weight
- last 10K fixed
Gating
- replace the summation in Equation (7) by a gating structure that sums up the contributions of the top k branches with the highest probabilities
- not sure how it is done.

ICLR 2018 submission
- ICLR review is good learning material.
Improvement in Performance is not of great margin
Adding new structure to existing SOTA, but without strong theoretical background, less novelty
Worth trying, when we already have good Transformer implementation