google-research / uda

Unsupervised Data Augmentation (UDA)
https://arxiv.org/abs/1904.12848
Apache License 2.0
2.17k stars 312 forks source link

Possible Overlap in CIFAR Data #30

Closed wilbry closed 5 years ago

wilbry commented 5 years ago

Looking at main in https://github.com/google-research/uda/blob/master/image/preprocess.py, it seems that the supervised and unsupervised sets are being drawn from the same data independently, so there is likely to be overlap of images in the supervised and unsupervised sets. Do you think this affects your data in any way? Or am I misreading the code?

michaelpulsewidth commented 5 years ago

Yes, we followed ICT and included labeled training data in the unlabeled set. The code for the data split of ICT is available here.

We also tried to exclude the labeled training data from the unlabeled set. The accuracy on CIFAR-10 with 4,000 examples is 94.66+-0.17, similar to the original performance 94.73+-0.11.

wilbry commented 5 years ago

Thanks for the explanation!