Closed xyzacademic closed 6 years ago
Hi, I went through my code and logs for the STL-10 experiments and found two things:
1) In the paper I stated the patch size used for STL-10 to be 24 for the no data augmentation case and 32 with data augmentation. On looking at my logs, it appears that the values used were actually 48 and 60, respectively.
2) It appears that I accidentally used the normalization parameters from CIFAR-10 for normalizing STL-10 instead of calculating the new mean and std. While the CIFAR-10 values are pretty close to what you used, it could still have caused some non-negligible change in model performance, especially considering the small train set size of STL-10. So that being said, these test results should not be compared to other STL-10 results that normalize the dataset properly. You could try substituting in the CIFAR-10 normalization values to your pipeline to see if it increases the score at all, since that may be what is causing the difference.
Let me know if those changes allow you to reproduce the results, otherwise we can look into it further.
I find the reason is I use FP16(even though BN are float32) rather than FP32. When I use float32, final error rate is about 12+. But I have no idea why the gap is so big. Since I also use FP16 on cifar10 experiment, the result is the same as you post.
Okay, cool.
I've tried messing around with the FP16 in PyTorch before, but it seems very finicky whenever using it with batchnorm. Strange that it works for CIFAR-10 but not STL-10.
Hi @TDeVries and @xyzacademic, I'm trying to reproduce the result for STL10 with no data augmentation nor cutout. I adapt the setting described in the paper with the changes mentioned above. However, I cannot get the errors 23.48% ± 0.68% presented in the paper. Instead, I got errors about 30%. The test errors get stuck to 30% after epoch 400 with perturbation within 1%, while training errors stays less than 0.1% and xentropy < 0.01. Could you help me on reproducing this, or possibly upload your code? More specifically, I set parameters as follows.
image size = 48, normalization mean = [0.44671097, 0.4398105 , 0.4066468], std = [0.2603405 , 0.25657743, 0.27126738], wide resnet depth = 16, widenn factor = 8, dropRate=0.3, inital learning rate = 0.1, momentum=0.9, weight_decay=5e-4, , number of epochs = 1000, date type = FP32, learning rate scheduler = MultiStepLR(cnn_optimizer, milestones=[300, 400, 600, 800], gamma=0.2)
Could it be image size? You said you are using 48x48 resolution. For the results in the paper I used the original image size of 96x96.
@TDeVries Thanks! I'll give it a try. Just to confirm that with the image size changed (from 32 to 96 compare to the published code), do you keep nChannels and avg_pool kernel size unchanged, but input dimension of the fully connected layer increased by a factor of 3*3 = 9? Or you increase the avg_pool kernel size from 8 to 24?
nChannels is unchanged. I think the main differences are that I changed the stride in block1 from 1 to 2, and the avg_pool kernel size from 8 to 12.
self.block1 = NetworkBlock(n, nChannels[0], nChannels[1], block, 2, dropRate)
...
out = F.avg_pool2d(out, 12)
Hopefully that should give the correct output size.
Another thing you could try to improve results is to increase the dropout probability from 0.3 to 0.5. I'm not sure how much of an effect it has though.
Thanks for your advice. FYI: I am actually trying to test my hyperparameter optimization algorithm on this problem :)
I only modify the global pooling according to STL10's image size. I also follow the implementation details you mentioned in the paper. But I cannot reproduce the result of STL10+ as 14:21 ± 0:29. My result is about 18. So I was wondering did you make some change in WRN? My normalize parameters are mean as [0.44671097, 0.4398105 , 0.4066468 ], std as [0.2603405 , 0.25657743, 0.27126738]. Could you help me to reproduce the results?