Propose a method of distilling the knowledge in an ensemble of models into a single model - in MNIST and acoustic model (speech recognition)
Propose a Mixture of Specialist model where one generalist model and many specialist models can make inferences - unlike Mixture of Experts, MoS can be trained rapidly and in parallel.
Details
Introduction
Many insects have a larval form that is optimized for extracting energy and nutrients from the environment and a completely different adult form that is optimized for the very different requirements of traveling and reproduction == in Machine Learning, training and deployment has very different requirements (~ Hinton loves the analogy to nature)
Let's train a large ensemble of models using as many cores in acceptable period, and distill the knowledge into single model which preserves the accuracy of an ensemble and meets the latency requirement of deployment.
Distillation
Overview
an obvious way to transfer the generalization ability of the cumbersome model to a small model is to use the class probabilities produced by cumbersome model as "soft targets" for training the small model
"soft targets" are class probabilities generated by softmax layer with temperature T
when the correct labels are known for all or some of the transfer sets, a weighted average of "soft target" objective function and ground-truth ("hard target") objective function can significantly improve the training
matching logits instead of a class probability, is a special case of distillation
value of temperature T plays an important role in filtering noise and capturing patterns in soft target values. when the distilled model is much too small to capture all the knowledge in the cumbersome model, intermediate temperature works best, which suggests that ignoring the large negative logits can be helpful
Experiments
Preliminary experiments on MNIST
smaller distilled model required smaller temperature for good distillation
omitting certain classes during distillation training still transfers the knowledge because "soft target" is indirectly giving information
Speech Recognition
single model performs equivalent to ensemble
Specialist Models (Mixture of Specialists)
Overview
Training
cluster covariance matrix to generate 'confusable' classes, which are assigned to specialist model for fine-tuning with treating other classes as single dustbin.
Algorithm
For each test case, make inference class k using the generalist model
choose m specialist models that have k in their labels, and sum all the probabilities of class k
Result
test accuracy improves with Specialist Models
higher the coverage of class with specialist models, higher the relative improvement
Soft Targets as Regularizers
Q. How is soft target better than hard target in knowledge distillation?
Experiment : train distilled model with 3% of the original data using soft/hard target
Result : hard target overfits, soft target learns well (soft target has a regularization effect)
Personal Thoughts
one of the earliest papers on distillation by Hinton
it's interesting that Hinton played with Temperature in Softmax computation, I personally assumed T=1 is always the best
I should implement distillation using soft targets as well
Abstract
Details
Many insects have a larval form that is optimized for extracting energy and nutrients from the environment and a completely different adult form that is optimized for the very different requirements of traveling and reproduction
== in Machine Learning, training and deployment has very different requirements (~ Hinton loves the analogy to nature)Distillation
Overview
cumbersome
model to a small model is to use the class probabilities produced by cumbersome model as "soft targets" for training the small modelExperiments
Specialist Models (Mixture of Specialists)
Overview
k
using the generalist modelm
specialist models that havek
in their labels, and sum all the probabilities of classk
Result
Soft Targets as Regularizers
Personal Thoughts
Link : https://arxiv.org/pdf/1503.02531.pdf Authors : Hinton et al. 2015