Representation Degeneration Problem in Training Natural Language Generation Models

Abstract

representation degeneration == NLG (Natural Language Generation) tasks trained with MLE (Maximum Likelihood Estimation) with weight tying tricks (word embedding & pre-softmax layer) with big training dataset suffer from most of the learnt word embeddings concentrating into a narrow cone, which limits the representation power of word embeddings
propose a novel regularization method to address this problem
WMT14 EnDe +1.08 BLEU with Transformer Base, +0.54 BLEU with Transformer Big

screen shot 2019-01-11 at 2 19 34 pm

word2vec (b) and softmax param learnt from classification task (c) (MNIST) are diversely distributed around the origin using SVD projection
whereas, Transformer word embedding is saturated in a narrow cone

word embedding is tied to softmax layer, then
- word representation should be widely distributed to represent different semantic meanings
- softmax with more diverse distribution is expected to obtain a large margin result
however, in reality, the learnt word embeddings are clustered into a narrow cone and the model faces the challenge of limited expressiveness
Cause : Intuitively speaking, during the training process of a model with likelihood loss, for any given hidden state, the embedding of the corresponding ground-truth word will be pushed towards the direction of the hidden state in order to get a larger likelihood, while the embeddings of all other words will be pushed towards the negative direction of the hidden state to get a smaller likelihood. As in natural language, word frequency is very low, the embedding of the word will be pushed towards the negative directions of most hidden states which drastically vary. As a result, the embeddings of most words in the vocabulary will be pushed towards similar directions negatively correlated with most hidden states and thus are clustered together in a local region of the embedding space.

screen shot 2019-01-11 at 3 27 55 pm

propose a MLE-CosReg, which adds cosine similarity as a loss objective hence making word embeddings different from each other (pushing away from word embeddings clustering into a narrow cone)

screen shot 2019-01-11 at 3 29 27 pm