google-research / mixmatch

Apache License 2.0
1.13k stars 163 forks source link

How to chose total number of training steps #33

Closed hzhz2020 closed 4 years ago

hzhz2020 commented 4 years ago

Dear Authors,

Thanks for your very inspiring works. I saw that in the implementation details section of your paper "MixMatch: A Holistic Approach to Semi-Supervised Learning", you said that "we checkpoint every 2^16 training samples and report the median error rate of the last 20 checkpoints. This simplifies the analysis at a potential cost to accuracy...". I know you already mentioned that reporting the median would possibly come with cost to accuracy, I am wondering do you have some estimates on how large the cost would be, and do you have any suggestions or rule of thumbs on how to choose the total number of training steps, since we might imagine that with a poorly chosen total number of training steps, the last 20 checkpoints could fall into the region where the model is already overfitting, and the performance on test set is much worse than it could be?

Thanks!

david-berthelot commented 4 years ago

This methodology is not MixMatch specific, we used it for all methods. Its purpose was not to stop overfitting but to minimize variance of measurements. Generally we found the longer you train the better, we trained for 2^16K images (2^26 images) for time constraints (that's roughly a 1000 epochs). Parameters and settings in our experiments seemed robust across all datasets. The best suggestion I can offer is to experiment on your data, there's no secret recipe.

hzhz2020 commented 4 years ago

thanks a lot for your response and advice.

hzhz2020 commented 4 years ago

Hello again

regarding your previous reply. I am a little confused do you mean 2^16K images is an epoch? If so, why is that (if we think of an epoch as going through the whole dataset once)?

'we trained for 2^16K images (2^26 images) for the time constraints (that's roughly a 1000 epochs)'.

Thank you!

david-berthelot commented 4 years ago

cifar10 has 50K images, 2^26/50000 = 1342 the exact number of training epochs. However since almost every dataset has a different number of samples, to keep the code simple we decide to call an epoch 65536 images (2^16) and share that number for all datasets, so that every experiment on every dataset is trained on the same number of images.

hzhz2020 commented 4 years ago

Thanks for clear things up!