representation degeneration == NLG (Natural Language Generation) tasks trained with MLE (Maximum Likelihood Estimation) with weight tying tricks (word embedding & pre-softmax layer) with big training dataset suffer from most of the learnt word embeddings concentrating into a narrow cone, which limits the representation power of word embeddings
propose a novel regularization method to address this problem
WMT14 EnDe +1.08 BLEU with Transformer Base, +0.54 BLEU with Transformer Big
Details
Representation Degeneration
word2vec (b) and softmax param learnt from classification task (c) (MNIST) are diversely distributed around the origin using SVD projection
whereas, Transformer word embedding is saturated in a narrow cone
Understanding the Problem
word embedding is tied to softmax layer, then
word representation should be widely distributed to represent different semantic meanings
softmax with more diverse distribution is expected to obtain a large margin result
however, in reality, the learnt word embeddings are clustered into a narrow cone and the model faces the challenge of limited expressiveness
Cause : Intuitively speaking, during the training process of a model with likelihood loss, for any given hidden state, the embedding of the corresponding ground-truth word will be pushed towards the direction of the hidden state in order to get a larger likelihood, while the embeddings of all other words will be pushed towards the negative direction of the hidden state to get a smaller likelihood. As in natural language, word frequency is very low, the embedding of the word will be pushed towards the negative directions of most hidden states which drastically vary. As a result, the embeddings of most words in the vocabulary will be pushed towards similar directions negatively correlated with most hidden states and thus are clustered together in a local region of the embedding space.
MLE with Cosine Regularization
propose a MLE-CosReg, which adds cosine similarity as a loss objective hence making word embeddings different from each other (pushing away from word embeddings clustering into a narrow cone)
Experimental Result
WMT14 EnDe & DeEn task with Transformer base lead to increase in BLEU score
word embeddings are now distributed more equally around the origin
Personal Thoughts
interesting phenomena, and simple solution
paper is well-written
not sure whether the impact will be visible in NMT outputs
Abstract
representation degeneration
== NLG (Natural Language Generation) tasks trained with MLE (Maximum Likelihood Estimation) with weight tying tricks (word embedding & pre-softmax layer) with big training dataset suffer from most of the learnt word embeddings concentrating into a narrow cone, which limits the representation power of word embeddingsTransformer Base
, +0.54 BLEU withTransformer Big
Details
Representation Degeneration
Understanding the Problem
Intuitively speaking, during the training process of a model with likelihood loss, for any given hidden state, the embedding of the corresponding ground-truth word will be pushed towards the direction of the hidden state in order to get a larger likelihood, while the embeddings of all other words will be pushed towards the negative direction of the hidden state to get a smaller likelihood. As in natural language, word frequency is very low, the embedding of the word will be pushed towards the negative directions of most hidden states which drastically vary. As a result, the embeddings of most words in the vocabulary will be pushed towards similar directions negatively correlated with most hidden states and thus are clustered together in a local region of the embedding space.
MLE with Cosine Regularization
MLE-CosReg
, which adds cosine similarity as a loss objective hence making word embeddings different from each other (pushing away from word embeddings clustering into a narrow cone)Experimental Result
Transformer base
lead to increase in BLEU scorePersonal Thoughts
Link : https://openreview.net/pdf?id=SkEYojRqtm Authors : Gao et al. 2018