Data Distillation: Towards Omni-Supervised Learning

Abstract

Omni-supervised learning
- a semi-supervised learning in which the learner exploits all available labeled data plus internet-scale source of unlabeled data.
Idea of Self-Training the model with external unlabeled dataset

Details

Omni-Supervised Learning
- lower-bounded by performance of existing labeled dataset
Model Distillation vs Data Distillation
- Model : use multiple models and ensemble to generate new training annotation
- Data : a method that ensembles predictions from multiple transformations of unlabeled data, using a single model, to automatically generate new training annotations (~ specific to image task)
Related Works
- related to knowledge distillation, which uses soft predictions of teacher model as the student model's target, it uses external unlabeled dataset, which is equivalent to semi-supervised learning, but utilize single model to generate new training annotations, instead of previous works using multiple models
Data Distillation
- 1) train a model on manually labeled data
- 2) apply trained model to multiple transformations of unlabeled data
- 3) convert predictions on the unlabeled data into labels by ensembling multiple predictions
- Ensemble takes different forms depending on the task
  - can average soft predictions, can vote on hard predictions, can take top-k during inference etc
- 4) retrain the model on the union of the manually labeled data and automatically labeled data
- retrain from scratch assuming that trained model is in sup-optimal region
- new model is trained on the UNION of labeled + generated data
- ensure that each training mini-batch contains a mixture of manually labeled data and automatically labeled data
Experiments on Keypoint Detection
- 1) small scale data as a sanity-check
- data distillation helps, but is upper-bounded by all fully supervised learning
- 2) large scale data with similar distribution
- data distillation improves performance in a saturating manner
- 3) large scale data with dissimilar distribution
- performance improves on dissimilar distribution also
Ablation Experiments on Number of Iterations
- fully supervised method reaches high performance faster (90k), but overall, the performance is better via data distillation (270k)
- more unlabeled data used, better the performance
- Quality of teacher model is important

Personal Thoughts

Love the idea of self-training the model using unlimited internet-scale external data
Improvement in performance saturates and is not highly significant due to the quality of unlabeled data being used
How to ensemble using single model is important, in NLP there is no concept of data transformation (step 2,3 of data distillation)

Link : https://arxiv.org/pdf/1712.04440.pdf Authors : Radosavovic et al. 2017

kweonwooj / papers

Data Distillation: Towards Omni-Supervised Learning #64

Abstract

Details

Personal Thoughts