An Actor-Critic Algorithm for Sequence Prediction

Abstract

Present training NN to generate sequences using actor-critic method from RL
Introduce critic network that is trained to predict the value of an output token, given the policy of an actor network
This training procedure is much closer to the test phase, and allows direct optimization of task-specific score such as BLEU
Rather than traditional RL setting, technique utilizes supervised manner
Shows improved performance in De-En machine translation

Details

Standard way : Maximizing Log-likelihood
- a.k.a Teacher Forcing
- during inference, the model is conditioned on its own guesses cumulatively, which may lead to a compounding of errors, especially for long sentences
Our approach
- Actor-Critic
- Critic network outputs the value of each token
- under the assumption that the critic computes exact values, the expression that we train actor is an unbiased estimate of the gradient of the expected task-specific score
- use ground-truth answer as an input, RL + supervised
Actor-Critic for Sequence Prediction
- Actor
- RNN network that actually does the work of inference
- Given that Q is a value function that outputs exact value of a token, the gradient of actor network is which can be approximated into expectation as
- Critic
- a separate RNN network, consumes the tokens that the actor outputs and produces the estimates (Value function of Q)
- correct answer is given to the critic as an input to return ground-truth estimate, whereas actor network does not
- Policy Evaluation
- crucial component of this method is to train the critic that produces useful estimate of Q
- use temporal difference(TD) method for policy evaluation
Additional Works
- Applying Deep RL Techniques
- if Q is non-linear, the TD policy evaluation might diverge, but the problem can be alleviated by using an additional target network Q' to compute q_t which is updated less often and more slowly than Q
- Dealing with large action spaces
- Too large action space hinders RL model to converge, so we put constraints on the critic value for actions that are rarely sampled.
- specifically, add a term C_t for every step t to the critic's optimization objective which drives all value predictions of the critic closer to their mean, which effectively penalizes variance of the outputs of the critic.
- Reward Shaping
- instead of sparse reward at the last step of sequence prediction, we use potential-based reward even with incomplete sentences
Putting it all together
Related Works
- REINFORCE is a popular RL method, but has high variance and does not exploit ground-truth label like critic network does
- Actor-Critic is high in bias but low in variance
Experiments
- IWSLT14, WMT14 MT task
Contributions
- describe how actor-critic (RL) method can be applied to supervised learning problems with structured outputs
- investigate performance and behavior of the new method on both a synthetic task and a real-world task of machine translation - comparing maximum-likelihood, REINFORCE and actor-critic method

Personal Thoughts

difficult paper, but RL is an important domain that I need to be prepared for.

Link : https://arxiv.org/pdf/1607.07086.pdf Authors : Bahdanau et al. 2017

kweonwooj / papers

An Actor-Critic Algorithm for Sequence Prediction #60

Abstract

Details

Personal Thoughts