Present training NN to generate sequences using actor-critic method from RL
Introduce critic network that is trained to predict the value of an output token, given the policy of an actor network
This training procedure is much closer to the test phase, and allows direct optimization of task-specific score such as BLEU
Rather than traditional RL setting, technique utilizes supervised manner
Shows improved performance in De-En machine translation
Details
Standard way : Maximizing Log-likelihood
a.k.a Teacher Forcing
during inference, the model is conditioned on its own guesses cumulatively, which may lead to a compounding of errors, especially for long sentences
Our approach
Actor-Critic
Critic network outputs the value of each token
under the assumption that the critic computes exact values, the expression that we train actor is an unbiased estimate of the gradient of the expected task-specific score
use ground-truth answer as an input, RL + supervised
Actor-Critic for Sequence Prediction
Actor
RNN network that actually does the work of inference
Given that Q is a value function that outputs exact value of a token, the gradient of actor network is
which can be approximated into expectation as
Critic
a separate RNN network, consumes the tokens that the actor outputs and produces the estimates (Value function of Q)
correct answer is given to the critic as an input to return ground-truth estimate, whereas actor network does not
Policy Evaluation
crucial component of this method is to train the critic that produces useful estimate of Q
use temporal difference(TD) method for policy evaluation
Additional Works
Applying Deep RL Techniques
if Q is non-linear, the TD policy evaluation might diverge, but the problem can be alleviated by using an additional target network Q' to compute q_t which is updated less often and more slowly than Q
Dealing with large action spaces
Too large action space hinders RL model to converge, so we put constraints on the critic value for actions that are rarely sampled.
specifically, add a term C_t for every step t to the critic's optimization objective which drives all value predictions of the critic closer to their mean, which effectively penalizes variance of the outputs of the critic.
Reward Shaping
instead of sparse reward at the last step of sequence prediction, we use potential-based reward even with incomplete sentences
Putting it all together
Related Works
REINFORCE is a popular RL method, but has high variance and does not exploit ground-truth label like critic network does
Actor-Critic is high in bias but low in variance
Experiments
IWSLT14, WMT14 MT task
Contributions
describe how actor-critic (RL) method can be applied to supervised learning problems with structured outputs
investigate performance and behavior of the new method on both a synthetic task and a real-world task of machine translation - comparing maximum-likelihood, REINFORCE and actor-critic method
Personal Thoughts
difficult paper, but RL is an important domain that I need to be prepared for.
In our environment we have only sequential observation and have to generate action in sequence.
How to train the the same model when we don't have output labels/true action?
Abstract
Details
Standard way : Maximizing Log-likelihood
Our approach
Actor-Critic for Sequence Prediction
Actor
RNN network that actually does the work of inference
Given that Q is a value function that outputs exact value of a token, the gradient of actor network is which can be approximated into expectation as
Critic
a separate RNN network, consumes the tokens that the actor outputs and produces the estimates (Value function of Q)
correct answer is given to the critic as an input to return ground-truth estimate, whereas actor network does not
Policy Evaluation
crucial component of this method is to train the critic that produces useful estimate of Q
use temporal difference(TD) method for policy evaluation
Additional Works
Putting it all together
Related Works
Experiments
Contributions
Personal Thoughts
Link : https://arxiv.org/pdf/1607.07086.pdf Authors : Bahdanau et al. 2017