Distilling the Knowledge in a Neural Network

Abstract

Propose a method of distilling the knowledge in an ensemble of models into a single model - in MNIST and acoustic model (speech recognition)
Propose a Mixture of Specialist model where one generalist model and many specialist models can make inferences - unlike Mixture of Experts, MoS can be trained rapidly and in parallel.

Introduction
- Many insects have a larval form that is optimized for extracting energy and nutrients from the environment and a completely different adult form that is optimized for the very different requirements of traveling and reproduction == in Machine Learning, training and deployment has very different requirements (~ Hinton loves the analogy to nature)
- Let's train a large ensemble of models using as many cores in acceptable period, and distill the knowledge into single model which preserves the accuracy of an ensemble and meets the latency requirement of deployment.

Overview
- Training
- cluster covariance matrix to generate 'confusable' classes, which are assigned to specialist model for fine-tuning with treating other classes as single dustbin.
- Algorithm
  1. For each test case, make inference class k using the generalist model
  2. choose m specialist models that have k in their labels, and sum all the probabilities of class k
Result
- test accuracy improves with Specialist Models
- higher the coverage of class with specialist models, higher the relative improvement

Q. How is soft target better than hard target in knowledge distillation?
Experiment : train distilled model with 3% of the original data using soft/hard target
Result : hard target overfits, soft target learns well (soft target has a regularization effect)

one of the earliest papers on distillation by Hinton
it's interesting that Hinton played with Temperature in Softmax computation, I personally assumed T=1 is always the best
I should implement distillation using soft targets as well