Model Compression via Distillation and Quantization

Abstract

Propose new compression methods that jointly leverage weight quantization and distillation
Quantized Distillation distills and quantizes at the same time
Differentiable Quantization optimizes the location of quantization points via SGD
Both methods achieve similar accuracy as SoTA full-precision teacher model in convolutional and recurrent architectures
- CIFAR, ImageNet, OpenNMT, WMT tasks

Is large model necessary for good accuracy?
- Yes, overcomplete representations are necessary because they transform local minima into saddle points (Dauphin et al. 2014), or to discover robust solutions
If large models are needed for robust solutions, then significant compression of these models should be achievable, without impacting accuracy
- two research areas are active, 1) training quantized networks where model parameters are low-bit precision from scratch or 2) compressing network where fully trained full-precision teacher network is to be compressed into smaller student model

key is that forward/backward pass and gradient is calculated on quantized model, and actual gradient descent is applied in full precision model.
- reason : calculating gradient in full precision model and doing gradient descent in quantized model leads to accumulation of projection error
distillation loss is computed by weighted average between CE of soft target (teacher model) with controlled temperature T and CE with correct label

screen shot 2018-01-27 at 2 53 55 pm

the choice of quantization points is the hyperparameter
we can model the quantization function Q(v,p) (deciding which points to use as a quantization points) using SGD because Q is differentiable w.r.t p, but the decision is discrete and hence the gradient is zero almost everywhere
- to resolve, a variant of straight-through estimator is used where the model is continuous and gradient is well defined

training both distillation and quantization does not converge easily
- uniform quantization is a good initialization scheme
- distributing usage of bits according to the sensitivity of the layer is important. sensitivity of the layer is computed by averaging the norm of the gradient in respective layer
- using distillation loss during training helps train better

b = bits to compress, k = bucket size, f = size of full precision weight (32bit), N = size of vector we are quantizing
full precision model requires fN bits
quantized model requires bN + 2fN/k
example, with 512 bucket size, 4bit compression yields 7.75x compression

CIFAR10 (small data on Image)
- percentage below student model definition are accuracy of normal and distilled model
- Teacher model : 5.3M param / 21MB / accuracy 89.71%
- Student model 1 (4bit) : 1M param / 4MB / accuarcy 88.00% (81% compression)

screen shot 2018-01-27 at 3 03 42 pm

OpenNMT (small data on NLP)
- Teacher : 84.8M param / 340 MB / 15.99 BLEU / 26.1 ppl
- Student model 2 (4bit) : 64.8M param / 249 MB / 15.19 BLEI / 28.95 ppl (27% compression)
- lower compression rate

screen shot 2018-01-27 at 3 06 19 pm