Do we need labels when existing labels are class imbalanced (some classes have more labeled examples than others) and we have a lot of unlabeled data?
Positive. Yes, we need labels. Self-train on the unlabeled data and you would be golden. (Self-training is a process where an intermediate model, which is trained on human-labeled data, is used to create ‘labels’ (thus, pseudo labels) and then the final model is trained on both human-labeled and intermediate model labeled data).
Negative. We may do away with the labels. One can use self-supervised pretraining on all the data available to learn meaningful representations and then learn the actual classification task. It is shown that this approach improves performance.
Takeaway: If you have class-imbalanced labels and more unlabeled data, do self-training or self-supervised pretraining. (It is shown that self-training beats self-supervised learning on CIFAR-10-LT though).
Below are notes from here.
Keywords: imbalanced classification,
What is