howardyclo / papernotes

My personal notes and surveys on DL, CV and NLP papers.
128 stars 6 forks source link

Strategies for Training Large Vocabulary Neural Language Models #19

Open howardyclo opened 6 years ago

howardyclo commented 6 years ago

Metadata

howardyclo commented 6 years ago

Summary

This paper presents a systematic comparison of strategies to represent and train large vocabularies, including classical softmax, hierarchical softmax, target sampling, noise contrastive estimation and self normalization (infrequent normalization). They further extend self normalization to be a proper estimator of likelihood and introduce an efficient variant of softmax, differentiated softmax. They evaluated on 3 language modeling benchmark dataset: Penn Treebank, Gigaword and One Billion Word. They examine performance on rare words, speed/accuracy trade-off and complementarity to Kneser-Ney. Many insightful plots are presented in this paper (must read). Well written paper!

Problems of softmax

Overview of Different Softmax Approximation Methods

Findings

In-depth Hierarchical Softmax (HSM)

HSM organizes the output vocabulary into a tree where the leaves are the words and the intermediate nodes are latent variables, or class. The tree has potentially many levels and there is a unique path from root to each word (leaf). The probability of word is the product of the probabilities along the path from root, class, to the word itself.

They experiment with 2-level tree, which first predicts the class of the word c^t, and then predict the actual word w^t. Thus, the word probability (usually conditioned on context x) is computed as: p(w^t|x) = p(c^t|x) p(w^t|c^t, x).

Strategies for Clustering Words

HSM needs a pre-defined tree, thus we need to assign class for each word. Several strategies can be used:

Dataset Statistics

Table 1

Experiment Settings: Training Time

They train for 168 hours (one week) on the large datasets (billionW, gigaword) and 24 hours (one day) for Penn Treebank.

Figure 2

Personal Thoughts

We may improve hierarchical softmax by giving a tree with good semantic clusters (using word2vec, glove or fasttext + clustering algorithm).

Reference