Apprentice : Using Knowledge Distillation Techniques to Improve Low-Precision Network Accuracy

Abstract

propose SoTA model compression technique, Apprentice, using ternary precision on ResNet architecture
One-liner : a combination of low-precision and knowledge distillation
- low-precision improves storage and memory footprint while compromising accuracy
- knowledge distillation improves the lost accuracy during low-precision compression

Quantization == low-precision
- active research area where model weights can be compressed significantly while sacrificing small margin of accuracy
- quantization leads to less memory footprint and faster inference due to optimized arithmetic operations in low-precision (2,3,4,8-bit)
Knowledge Distillation
- transferring knowledge from teacher network to student network, which outperforms training student network from scratch using same data
- classic method of learning (by Hinton et al. 2015)the logit of teacher model is the baseline
Sparsity and Hashing
- pruning, hashing, weight sharing, training the model with sparsity-inducing tricks is also a branch of model compression
- to realize the benefit of sparsity and hashing, sparsity of > 95% must be achieved when directly compared to quantization, or often the custom hardware support is required

Benefit of low-precision : less memory footprint
- memory footprint during inference is composed of weights and activation values. when batch is small, memory portion of weight is the majority. Hence, reducing the memory footprint of weights via low-precision directly yields lower memory footprint.

screen shot 2018-01-31 at 11 26 22 am

Knowledge Distillation + low-precision
- knowledge distillation improves the lost accuracy of low-precision network

screen shot 2018-01-31 at 11 33 24 am

Apprentice uses below scheme for knowledge distillation
- temperature = 1, distillation applied to logits of teacher model

screen shot 2018-01-31 at 11 26 11 am

screen shot 2018-01-31 at 11 26 15 am

A) jointly training full-precision teacher and low-precision student model
- reason : to guide student network from the initialization to convergence
- worry : loss function for teacher network is different from simply training teacher model alone, the additional loss from student may degrade the training of teacher --> empirically, teacher network is always within 0.1% accuracy of standalone version
B) transfer knowledge from pre-trained full-precision teacher to low-precision student network from scratch
- similar performance as A, but converges faster
C) teacher and student both initialized with pre-trained full-precision model, student network is fine-tuned via knowledge distillation to lower the precision of student network
- best performance
Tips
- Choice of a,b,c has been tested, but a=1, b=0.5, c=0.5 performs best
- choice of loss function of H(z_T, z_A), direct comparison of logit to logit, instead of logit to probability showed no improvement
- in quantization, first and last layer is NOT quantized due to significant accuracy loss
Result
- with ResNet variants (18, 34, 50) in ImageNet task, Apprentice model is SoTA