SEQ2SEQ models are trained by predicting the next dialogue turn in a given conversational context using the maximum-likelihood estimation (MLE) objective function. However, it is not clear how well MLE approximates the real-world goal of chatbot development.
We view the generated sentences as actions that are taken according to a policy defined by an encoder-decoder recurrent neural network language model
Policy gradient methods are more appropriate for our scenario than Q-learning , because we can
initialize the encoder-decoder RNN using MLE parameters that already produce plausible responses, before changing the objective and tuning towards a policy that maximizes long-term reward. Q-learning directly estimates the future expected reward of each action, which can differ from the MLE objective by orders of magnitude, thus making MLE parameters inappropriate for initialization.
A turn generated by a machine should be easy to respond to. We propose to measure the ease of answering a generated turn by using the negative log likelihood of responding to that utterance with a dull response. We manually constructed a list of dull responses S consisting 8 turns such as “I don’t know what you are talking about”, “I have no idea”, etc. The reward function is given as follows:
We want each agent to contribute new information at each turn to keep the dialogue moving and avoid repetitive sequences. We therefore propose penalizing semantic similarity between consecutive turns from the same agent. Let h_p_i and h_p_i+1 denote representations obtained from the encoder for two consecutive turns p_i and p_i+1. The reward is given by the negative log of the cosine similarity between them:
We also need to measure the adequacy of responses to avoid situations in which the generated replies are highly rewarded but are ungrammatical or not coherent. We therefore consider the mutual information between the action a and previous turns in the history to ensure the generated responses are coherent and appropriate.
A curriculum learning strategy is adopted such that, for every sequence of length T we use the MLE loss for the first L tokens and the reinforcement algorithm for the remaining T − L tokens. We gradually
anneal the value of L to zero.
Since the goal of the proposed system is not to predict the highest probability response, but rather the long-term success of the dialogue, we do not employ BLEU or perplexity for evaluation.
h_p_i
andh_p_i+1
denote representations obtained from the encoder for two consecutive turnsp_i
andp_i+1
. The reward is given by the negative log of the cosine similarity between them: