SVD-Softmax: Fast Softmax Approximation on Large Vocabulary Neural Networks

Abstract

Propose SVD-Softmax a fast approximation method of a softmax function with large vocabulary using singular value decomposition (SVD)
Accurately approximates probability of topmost words during inference on neural network language models
reduces 80% of arithmetic operations for 800K vocab, showing x3.00 speed-up on inference without performance degradation

Details

Softmax is Time-consuming
- softmax computation dominates the complexity of LM, MT task when vocab is large
- because softmax requires the calculation of all probabilities in its exponent and sigma
Related Works
- Sampling-based approximation
- choose a subset of possible outputs and train only with them (important sampling, noise contrastive estimation, negative sampling and blackout)
- can improve training speed, but no improvement in inference speed
- Hierarchical Softmax
- constructs a tree structure of words (Binary HS, gives log(V) in depth)
- softmax has to calculate all words anyway
- Self-Normalization approach
- employ additional training loss term that leads normalization factor Z (denominator of softmax) close to 1, but does not guarantee denominator always appears correctly
- Differentiated Softmax
- restricts the effective parameters, using the fraction of the full output softmax
- close to the idea of this paper, but D-softmax is somewhat heuristic and requires specified training procedures
SVD-Softmax
- In short, quickly calculate preview outputs to determine where to focus, and re-compute outputs for those important
- Computational Complexity : O(V x D) -> O(V x W + N x D) (W, N are hyperparameters)
- Algorithm
- 1) Factorize weight matrix A into SVD, and set B = U * Sigma
- 2) Compute preview outputs : B[:, :W] * h[:W] + b
- 3) For Top-N vocabs, compute full-view outputs : z[id] = B[id, :] * h + b[id]
- 4) Denominator of softmax, Z is the sum of z~
- 5) Compute softmax value
Experiments (Empirical)
- Language Model task
- ratio of Z, KLD, NLL, Top-k preserve ratio shows good sign
- Machine Translation task
- BLEU performance is not hurt by SVD-Softmax
- x3.19 speed-up in GPU
Hyperparameter for W, N
- Best Value of W is 1/8 of D
- Best Value of N is 1/16 of V

Personal Thoughts

Must implement for Papago
how can I build custom CUDA kernel so that it's faster?

Link : http://papers.nips.cc/paper/7130-svd-softmax-fast-softmax-approximation-on-large-vocabulary-neural-networks.pdf Authors : Shim et al. 2017

kweonwooj / papers

SVD-Softmax: Fast Softmax Approximation on Large Vocabulary Neural Networks #67

Abstract

Details

Personal Thoughts