Propose new compression methods that jointly leverage weight quantization and distillation
Quantized Distillation distills and quantizes at the same time
Differentiable Quantization optimizes the location of quantization points via SGD
Both methods achieve similar accuracy as SoTA full-precision teacher model in convolutional and recurrent architectures
CIFAR, ImageNet, OpenNMT, WMT tasks
Details
Introduction
Is large model necessary for good accuracy?
Yes, overcomplete representations are necessary because they transform local minima into saddle points (Dauphin et al. 2014), or to discover robust solutions
If large models are needed for robust solutions, then significant compression of these models should be achievable, without impacting accuracy
two research areas are active, 1) training quantized networks where model parameters are low-bit precision from scratch or 2) compressing network where fully trained full-precision teacher network is to be compressed into smaller student model
Quantized Distillation
key is that forward/backward pass and gradient is calculated on quantized model, and actual gradient descent is applied in full precision model.
reason : calculating gradient in full precision model and doing gradient descent in quantized model leads to accumulation of projection error
distillation loss is computed by weighted average between CE of soft target (teacher model) with controlled temperature T and CE with correct label
Differentiable Quantization
the choice of quantization points is the hyperparameter
we can model the quantization function Q(v,p) (deciding which points to use as a quantization points) using SGD because Q is differentiable w.r.t p, but the decision is discrete and hence the gradient is zero almost everywhere
to resolve, a variant of straight-through estimator is used where the model is continuous and gradient is well defined
Heuristics in Training
training both distillation and quantization does not converge easily
uniform quantization is a good initialization scheme
distributing usage of bits according to the sensitivity of the layer is important. sensitivity of the layer is computed by averaging the norm of the gradient in respective layer
using distillation loss during training helps train better
Compression Gain
b = bits to compress, k = bucket size, f = size of full precision weight (32bit), N = size of vector we are quantizing
full precision model requires fN bits
quantized model requires bN + 2fN/k
example, with 512 bucket size, 4bit compression yields 7.75x compression
Result
CIFAR10 (small data on Image)
percentage below student model definition are accuracy of normal and distilled model
Teacher model : 5.3M param / 21MB / accuracy 89.71%
Abstract
Details
Introduction
Is large model necessary for good accuracy?
Yes
, overcomplete representations are necessary because they transform local minima into saddle points (Dauphin et al. 2014), or to discover robust solutionsIf large models are needed for robust solutions, then significant compression of these models should be achievable, without impacting accuracy
Quantized Distillation
Differentiable Quantization
Q(v,p)
(deciding which points to use as a quantization points) using SGD becauseQ
is differentiable w.r.tp
, but the decision is discrete and hence the gradient is zero almost everywhereHeuristics in Training
Compression Gain
b
= bits to compress,k
= bucket size,f
= size of full precision weight (32bit),N
= size of vector we are quantizingfN
bitsbN + 2fN/k
512
bucket size, 4bit compression yields 7.75x compressionResult
Personal Thoughts
Link : https://openreview.net/pdf?id=S1XolQbRW Authors : Anonymous et al. 2018