uoguelph-mlrg / Cutout

2.56%, 15.20%, 1.30% on CIFAR10, CIFAR100, and SVHN https://arxiv.org/abs/1708.04552
Other
542 stars 156 forks source link

Is there any change in Wide ResNet for STL10? #2

Closed xyzacademic closed 6 years ago

xyzacademic commented 6 years ago

I only modify the global pooling according to STL10's image size. I also follow the implementation details you mentioned in the paper. But I cannot reproduce the result of STL10+ as 14:21 ± 0:29. My result is about 18. So I was wondering did you make some change in WRN? My normalize parameters are mean as [0.44671097, 0.4398105 , 0.4066468 ], std as [0.2603405 , 0.25657743, 0.27126738]. Could you help me to reproduce the results?

TDeVries commented 6 years ago

Hi, I went through my code and logs for the STL-10 experiments and found two things:

1) In the paper I stated the patch size used for STL-10 to be 24 for the no data augmentation case and 32 with data augmentation. On looking at my logs, it appears that the values used were actually 48 and 60, respectively.

2) It appears that I accidentally used the normalization parameters from CIFAR-10 for normalizing STL-10 instead of calculating the new mean and std. While the CIFAR-10 values are pretty close to what you used, it could still have caused some non-negligible change in model performance, especially considering the small train set size of STL-10. So that being said, these test results should not be compared to other STL-10 results that normalize the dataset properly. You could try substituting in the CIFAR-10 normalization values to your pipeline to see if it increases the score at all, since that may be what is causing the difference.

Let me know if those changes allow you to reproduce the results, otherwise we can look into it further.

xyzacademic commented 6 years ago

I find the reason is I use FP16(even though BN are float32) rather than FP32. When I use float32, final error rate is about 12+. But I have no idea why the gap is so big. Since I also use FP16 on cifar10 experiment, the result is the same as you post.

TDeVries commented 6 years ago

Okay, cool.

I've tried messing around with the FP16 in PyTorch before, but it seems very finicky whenever using it with batchnorm. Strange that it works for CIFAR-10 but not STL-10.

ghost commented 5 years ago

Hi @TDeVries and @xyzacademic, I'm trying to reproduce the result for STL10 with no data augmentation nor cutout. I adapt the setting described in the paper with the changes mentioned above. However, I cannot get the errors 23.48% ± 0.68% presented in the paper. Instead, I got errors about 30%. The test errors get stuck to 30% after epoch 400 with perturbation within 1%, while training errors stays less than 0.1% and xentropy < 0.01. Could you help me on reproducing this, or possibly upload your code? More specifically, I set parameters as follows.

image size = 48, normalization mean = [0.44671097, 0.4398105 , 0.4066468], std = [0.2603405 , 0.25657743, 0.27126738], wide resnet depth = 16, widenn factor = 8, dropRate=0.3, inital learning rate = 0.1, momentum=0.9, weight_decay=5e-4, , number of epochs = 1000, date type = FP32, learning rate scheduler = MultiStepLR(cnn_optimizer, milestones=[300, 400, 600, 800], gamma=0.2)

TDeVries commented 5 years ago

Could it be image size? You said you are using 48x48 resolution. For the results in the paper I used the original image size of 96x96.

ghost commented 5 years ago

@TDeVries Thanks! I'll give it a try. Just to confirm that with the image size changed (from 32 to 96 compare to the published code), do you keep nChannels and avg_pool kernel size unchanged, but input dimension of the fully connected layer increased by a factor of 3*3 = 9? Or you increase the avg_pool kernel size from 8 to 24?

TDeVries commented 5 years ago

nChannels is unchanged. I think the main differences are that I changed the stride in block1 from 1 to 2, and the avg_pool kernel size from 8 to 12.

self.block1 = NetworkBlock(n, nChannels[0], nChannels[1], block, 2, dropRate) ... out = F.avg_pool2d(out, 12) Hopefully that should give the correct output size.

Another thing you could try to improve results is to increase the dropout probability from 0.3 to 0.5. I'm not sure how much of an effect it has though.

ghost commented 5 years ago

Thanks for your advice. FYI: I am actually trying to test my hyperparameter optimization algorithm on this problem :)