propose a simple approach to incorporate task reward into maximum likelihood framework in sequence prediction tasks, RAML (reward augmented maximum likelihood)
show that an optimal regularized expected reward is achieved when the conditional distribution of the outputs given the inputs is proportional to their exponentiated scaled rewards
Details
Motivation
ML (maximum likelihood) training picks single ground truth token as correct and treats all negative outputs as equally wrong, which may not capture the smooth answer space available in machine translation
fundamentally limited by not incorporating a task reward
RL (reinforcement learning) training uses reward function (eg. BLEU) as a sole metric to train the model, which makes the optimization challenging due to large variance of gradients and sparsity of rewards
RAML generalizes sharp ML objective by allowing exponential payoff distribution into the objective function. Now, the model sees correct answer, somewhat wrong and completely wrong answers.
Exponentiated Payoff Distribution is a central bridge between ML and RL
it is a data-augmenting distribution in the output space that has a softmax-like equation. it prefers more correct augmentations more
In this paper, edit distance is used as a reward value. Ground truth label of length m is edited e times, and sampled via stratified sampling.
Mathematics
this paper supports its claim with good mathematical proofs
Loss functions
ML optimizes negative log-likelihood to fit the ground truth label
RL optimizes expected reward
RAML uses negative log-likelihood with exponentiated payoff distribution
KL Divergence
three optimization methods can be re-written and analyzed in terms of KL Divergence
ML optimizes model parameter to sharp payoff distribution (the spirit of "ground truth is the only correct answer)
RL has model parameter on the left and payoff distribution on the right
RAML has model parameter on the right and payoff distribution on the left, which is opposite of RL.
direction of RAML in KL Divergence has more advantages because 1) one samples from a stationary distribution than evolving model distribution, 2) sampling from model is computationally heavy, has large variance and reward is sparse
Result
Machine Translation
corpus : WMT14 EnFr
test set : newstest-2014
model : 3-layer 1024 LSTM
result : RAML outperforms SoTA attention-rnn baseline
Notes
average/peak BLEU is calculated over range of 1.1 ~ 1.3 million steps : volatility of BLEU score
rare words are replaced with special UNK tokens based on first and last character : special treatment of UNKs
Personal Thoughts
motivation is convincing in terms of math, but improvement is not significant and why is it not used in other SoTA papers?
implementation is simply target data augmentation
treating character augmented target data as less wrong is not convincing
Abstract
RAML
(reward augmented maximum likelihood)Details
Motivation
ML
(maximum likelihood) training picks single ground truth token as correct and treats all negative outputs as equally wrong, which may not capture the smooth answer space available in machine translationRL
(reinforcement learning) training uses reward function (eg. BLEU) as a sole metric to train the model, which makes the optimization challenging due to large variance of gradients and sparsity of rewardsRAML
generalizes sharpML
objective by allowing exponential payoff distribution into the objective function. Now, the model sees correct answer, somewhat wrong and completely wrong answers.Exponentiated Payoff Distribution
is a central bridge betweenML
andRL
more correct
augmentations morem
is editede
times, and sampled via stratified sampling.Mathematics
ML
optimizes negative log-likelihood to fit the ground truth labelRL
optimizes expected rewardRAML
uses negative log-likelihood with exponentiated payoff distributionKL Divergence
ML
optimizes model parameter to sharp payoff distribution (the spirit of "ground truth is the only correct answer)RL
has model parameter on the left and payoff distribution on the rightRAML
has model parameter on the right and payoff distribution on the left, which is opposite ofRL
.RAML
in KL Divergence has more advantages because 1) one samples from a stationary distribution than evolving model distribution, 2) sampling from model is computationally heavy, has large variance and reward is sparseResult
RAML
outperforms SoTA attention-rnn baselinePersonal Thoughts
less wrong
is not convincingLink : https://arxiv.org/pdf/1609.00150.pdf Authors : Norouzi et al. 2017