Reward Augmented Maximum Likelihood for Neural Structured Prediction

Abstract

propose a simple approach to incorporate task reward into maximum likelihood framework in sequence prediction tasks, RAML (reward augmented maximum likelihood)
show that an optimal regularized expected reward is achieved when the conditional distribution of the outputs given the inputs is proportional to their exponentiated scaled rewards

ML (maximum likelihood) training picks single ground truth token as correct and treats all negative outputs as equally wrong, which may not capture the smooth answer space available in machine translation
- fundamentally limited by not incorporating a task reward
RL (reinforcement learning) training uses reward function (eg. BLEU) as a sole metric to train the model, which makes the optimization challenging due to large variance of gradients and sparsity of rewards
- approaches : policy-gradient, Q-learning, actor-critic
RAML generalizes sharp ML objective by allowing exponential payoff distribution into the objective function. Now, the model sees correct answer, somewhat wrong and completely wrong answers.
- Exponentiated Payoff Distribution is a central bridge between ML and RL
- it is a data-augmenting distribution in the output space that has a softmax-like equation. it prefers more correct augmentations more
- In this paper, edit distance is used as a reward value. Ground truth label of length m is edited e times, and sampled via stratified sampling.

this paper supports its claim with good mathematical proofs
Loss functions
- ML optimizes negative log-likelihood to fit the ground truth label
- RL optimizes expected reward
- RAML uses negative log-likelihood with exponentiated payoff distribution
KL Divergence
- three optimization methods can be re-written and analyzed in terms of KL Divergence
- ML optimizes model parameter to sharp payoff distribution (the spirit of "ground truth is the only correct answer)
- RL has model parameter on the left and payoff distribution on the right
- RAML has model parameter on the right and payoff distribution on the left, which is opposite of RL.
- direction of RAML in KL Divergence has more advantages because 1) one samples from a stationary distribution than evolving model distribution, 2) sampling from model is computationally heavy, has large variance and reward is sparse

Machine Translation
- corpus : WMT14 EnFr
- test set : newstest-2014
- model : 3-layer 1024 LSTM
- result : RAML outperforms SoTA attention-rnn baseline
Notes
- average/peak BLEU is calculated over range of 1.1 ~ 1.3 million steps : volatility of BLEU score
- rare words are replaced with special UNK tokens based on first and last character : special treatment of UNKs

motivation is convincing in terms of math, but improvement is not significant and why is it not used in other SoTA papers?
- implementation is simply target data augmentation
- treating character augmented target data as less wrong is not convincing

Link : https://arxiv.org/pdf/1609.00150.pdf Authors : Norouzi et al. 2017