Propose SVD-Softmax a fast approximation method of a softmax function with large vocabulary using singular value decomposition (SVD)
Accurately approximates probability of topmost words during inference on neural network language models
reduces 80% of arithmetic operations for 800K vocab, showing x3.00 speed-up on inference without performance degradation
Details
Softmax is Time-consuming
softmax computation dominates the complexity of LM, MT task when vocab is large
because softmax requires the calculation of all probabilities in its exponent and sigma
Related Works
Sampling-based approximation
choose a subset of possible outputs and train only with them (important sampling, noise contrastive estimation, negative sampling and blackout)
can improve training speed, but no improvement in inference speed
Hierarchical Softmax
constructs a tree structure of words (Binary HS, gives log(V) in depth)
softmax has to calculate all words anyway
Self-Normalization approach
employ additional training loss term that leads normalization factor Z (denominator of softmax) close to 1, but does not guarantee denominator always appears correctly
Differentiated Softmax
restricts the effective parameters, using the fraction of the full output softmax
close to the idea of this paper, but D-softmax is somewhat heuristic and requires specified training procedures
SVD-Softmax
In short, quickly calculate preview outputs to determine where to focus, and re-compute outputs for those important
Computational Complexity : O(V x D) -> O(V x W + N x D) (W, N are hyperparameters)
Algorithm
1) Factorize weight matrix A into SVD, and set B = U * Sigma
2) Compute preview outputs : B[:, :W] * h[:W] + b
3) For Top-N vocabs, compute full-view outputs : z[id] = B[id, :] * h + b[id]
4) Denominator of softmax, Z is the sum of z~
5) Compute softmax value
Experiments (Empirical)
Language Model task
ratio of Z, KLD, NLL, Top-k preserve ratio shows good sign
Machine Translation task
BLEU performance is not hurt by SVD-Softmax
x3.19 speed-up in GPU
Hyperparameter for W, N
Best Value of W is 1/8 of D
Best Value of N is 1/16 of V
Personal Thoughts
Must implement for Papago
how can I build custom CUDA kernel so that it's faster?
Abstract
Details
Softmax is Time-consuming
Related Works
SVD-Softmax
O(V x D) -> O(V x W + N x D)
(W, N
are hyperparameters)A
into SVD, and setB = U * Sigma
B[:, :W] * h[:W] + b
z[id] = B[id, :] * h + b[id]
Z
is the sum ofz~
Experiments (Empirical)
Hyperparameter for
W, N
Personal Thoughts
Link : http://papers.nips.cc/paper/7130-svd-softmax-fast-softmax-approximation-on-large-vocabulary-neural-networks.pdf Authors : Shim et al. 2017