propose SoTA model compression technique, Apprentice, using ternary precision on ResNet architecture
One-liner : a combination of low-precision and knowledge distillation
low-precision improves storage and memory footprint while compromising accuracy
knowledge distillation improves the lost accuracy during low-precision compression
Details
Background on Model Compression
Quantization == low-precision
active research area where model weights can be compressed significantly while sacrificing small margin of accuracy
quantization leads to less memory footprint and faster inference due to optimized arithmetic operations in low-precision (2,3,4,8-bit)
Knowledge Distillation
transferring knowledge from teacher network to student network, which outperforms training student network from scratch using same data
classic method of learning (by Hinton et al. 2015)the logit of teacher model is the baseline
Sparsity and Hashing
pruning, hashing, weight sharing, training the model with sparsity-inducing tricks is also a branch of model compression
to realize the benefit of sparsity and hashing, sparsity of > 95% must be achieved when directly compared to quantization, or often the custom hardware support is required
Motivation
Benefit of low-precision : less memory footprint
memory footprint during inference is composed of weights and activation values. when batch is small, memory portion of weight is the majority. Hence, reducing the memory footprint of weights via low-precision directly yields lower memory footprint.
Knowledge Distillation + low-precision
knowledge distillation improves the lost accuracy of low-precision network
Knowledge Distillation
Apprentice uses below scheme for knowledge distillation
temperature = 1, distillation applied to logits of teacher model
loss function with a=1, b=0.5, c=0.5
Proposed Schemes
A) jointly training full-precision teacher and low-precision student model
reason : to guide student network from the initialization to convergence
worry : loss function for teacher network is different from simply training teacher model alone, the additional loss from student may degrade the training of teacher --> empirically, teacher network is always within 0.1% accuracy of standalone version
B) transfer knowledge from pre-trained full-precision teacher to low-precision student network from scratch
similar performance as A, but converges faster
C) teacher and student both initialized with pre-trained full-precision model, student network is fine-tuned via knowledge distillation to lower the precision of student network
best performance
Tips
Choice of a,b,c has been tested, but a=1, b=0.5, c=0.5 performs best
choice of loss function of H(z_T, z_A), direct comparison of logit to logit, instead of logit to probability showed no improvement
in quantization, first and last layer is NOT quantized due to significant accuracy loss
Result
with ResNet variants (18, 34, 50) in ImageNet task, Apprentice model is SoTA
Abstract
low-precision
andknowledge distillation
low-precision
improves storage and memory footprint while compromising accuracyknowledge distillation
improves the lost accuracy duringlow-precision
compressionDetails
Background on Model Compression
Motivation
Knowledge Distillation
a=1, b=0.5, c=0.5
Proposed Schemes
A
) jointly training full-precision teacher and low-precision student modelB
) transfer knowledge from pre-trained full-precision teacher to low-precision student network from scratchA
, but converges fasterC
) teacher and student both initialized with pre-trained full-precision model, student network is fine-tuned via knowledge distillation to lower the precision of student networkTips
a,b,c
has been tested, buta=1, b=0.5, c=0.5
performs bestH(z_T, z_A)
, direct comparison of logit to logit, instead of logit to probability showed no improvementResult
Personal Thoughts
Link : https://arxiv.org/pdf/1711.05852.pdf Authors : Mishra et al. 2017