a semi-supervised learning in which the learner exploits all available labeled data plus internet-scale source of unlabeled data.
Idea of Self-Training the model with external unlabeled dataset
Details
Omni-Supervised Learning
lower-bounded by performance of existing labeled dataset
Model Distillation vs Data Distillation
Model : use multiple models and ensemble to generate new training annotation
Data : a method that ensembles predictions from multiple transformations of unlabeled data, using a single model, to automatically generate new training annotations (~ specific to image task)
Related Works
related to knowledge distillation, which uses soft predictions of teacher model as the student model's target, it uses external unlabeled dataset, which is equivalent to semi-supervised learning, but utilize single model to generate new training annotations, instead of previous works using multiple models
Data Distillation
1) train a model on manually labeled data
2) apply trained model to multiple transformations of unlabeled data
3) convert predictions on the unlabeled data into labels by ensembling multiple predictions
Ensemble takes different forms depending on the task
can average soft predictions, can vote on hard predictions, can take top-k during inference etc
4) retrain the model on the union of the manually labeled data and automatically labeled data
retrain from scratch assuming that trained model is in sup-optimal region
new model is trained on the UNION of labeled + generated data
ensure that each training mini-batch contains a mixture of manually labeled data and automatically labeled data
Experiments on Keypoint Detection
1) small scale data as a sanity-check
data distillation helps, but is upper-bounded by all fully supervised learning
2) large scale data with similar distribution
data distillation improves performance in a saturating manner
3) large scale data with dissimilar distribution
performance improves on dissimilar distribution also
Ablation Experiments on Number of Iterations
fully supervised method reaches high performance faster (90k), but overall, the performance is better via data distillation (270k)
more unlabeled data used, better the performance
Quality of teacher model is important
Personal Thoughts
Love the idea of self-training the model using unlimited internet-scale external data
Improvement in performance saturates and is not highly significant due to the quality of unlabeled data being used
How to ensemble using single model is important, in NLP there is no concept of data transformation (step 2,3 of data distillation)
Abstract
Details
Omni-Supervised Learning
Model Distillation vs Data Distillation
Related Works
Data Distillation
Experiments on Keypoint Detection
Ablation Experiments on Number of Iterations
Personal Thoughts
Link : https://arxiv.org/pdf/1712.04440.pdf Authors : Radosavovic et al. 2017