prlz77 / ResNeXt.pytorch

Reproduces ResNet-V3 with pytorch
MIT License
505 stars 121 forks source link

Questions about the performances. #1

Open D-X-Y opened 7 years ago

D-X-Y commented 7 years ago

Hi,

May I ask your final performance, the curves are a little confusing. I also implement a different version (https://github.com/D-X-Y/ResNeXt), my results are a little bit lower than the official code, about 0.2 for cifar10 and 1.0 for cifar100. I really want to what causes the differences.

And I also try training resnet20,32,44,56 , I'm pretty sure the model archieteture is the same as the official code but even obtain a much lower accuracy.

Would you mind to give me some suggestions?

wangdelp commented 7 years ago

I am also curious about the training performance. BTW, I need to run the training many times with different hyper-parameters, and running 300 epochs takes days even with four titan X. Did you guys tried use less epochs and other learning rate schedule? Please let me know if you have any suggestions. Thank you.

prlz77 commented 7 years ago

@D-X-Y On CIFAR-10 it reaches 96.44%, and on CIFAR-100 81.62%. However, I am not keeping the random seed each run, so it sometimes achieves better than the baseline, and sometimes worse.

As for what would be causing a difference of performance, I talked with the author of the original paper, and he told me (he was right) that since I was using batch_size = 128 instead of 256, the lr should be divided by two. I have checked your code and I see not much difference with mine, so could it be just a matter of finding the correct random seed? Is the initialization of the weights exactly the same as in their code?

prlz77 commented 7 years ago

@wangdelp Using a single TITANX it takes me roughly one day on CIFAR. Which is your batch size and learning rate?

D-X-Y commented 7 years ago

@prlz77 Thanks for your responses. The initialization is the same and I only train on CIFAT-10 once, so maybe the average performance will be better.

There are two versions of the ResNeXt paper, they change the batchsize for CIFAR from 256 to 128 in the Version2.0. I notice that your performance on CIFAR-100 is lower than the original paper about 1 point, do you think this is caused by learning rate and multi-gpu?

prlz77 commented 7 years ago

@D-X-Y Since the performance in CIFAR-10 is correct, it is difficult to guess what is happening on CIFAR-100. Some possibilities are:

prlz77 commented 7 years ago

btw, take into account that the results I am providing are for the small net! (cardinality 8, widen factor 4) So it gets 0.1 better on CIFAR10 and 0.6 worse on CIFAR100. When I have some time, I will provide multi-run results to see if it is always like this.

wangdelp commented 7 years ago

@prlz77 I was using batch size 64 since I want to reduce the memory consumption, and distributed among 4 gpus. I am using the default learning rate 0.1 and decay at [0.5, 0.75] * args.epochs, run it with 300 epochs. It sounds like I need two days to complete training on cifar100. Maybe it's due to other lab members are also using GPUs.

Using batchsize 256 would lead to Out of Memory on 12GB GPU. Maybe I should try use 128 batchsize on two gpus.

prlz77 commented 7 years ago

@wangdelp in my experience, bs=128 distributed on two 1080TI takes about one day. bs=128 on only one gpu takes a little bit more. bs=64 takes almost double the time for the same 300epochs. I would suggest you to use bs=128 (note that if ngpu=4, you will be loading 128/4 for gpu, which is a small amount of memory). If GPUs are already in use, that could be causing a performance issue, as you say. Although it is improvable, check that data is not the issue, for instance increase the number of prefetching threads.

wangdelp commented 7 years ago

@prlz77 Thank you. Should I use initial lr 0.05 when batchsize=128, and lr 0.025 when batchsize=64?

prlz77 commented 7 years ago

@wangdelp Exact!

Queequeg92 commented 7 years ago

Hi, guys. I have a question about the results reported in the paper. Did they report the median of best test error during training or the median of test error after training?@prlz77 @wangdelp

prlz77 commented 7 years ago

@Queequeg92 I think it is the median of the best test error during training.

Queequeg92 commented 7 years ago

@prlz77 I agree with you since models are likely to be overfitting at the end of training process. I have sent emails to some authors to confirm.

Queequeg92 commented 7 years ago

@prlz77 I think Part D of this paper gives the answer.

wandering007 commented 6 years ago

@D-X-Y @prlz77 I'm faced with the same problem when reproducing the performance of DenseNet-40 on CIFAR100. With the exactly same configuration, the acc of PyTorch version is often 1 point lower than Torch version. I don't think it is caused by random seeds. However, after digging into the implementation details of the two frameworks, I find no differences. I am so confused...

prlz77 commented 6 years ago

In the past I've noticed up to 1% difference just by using cudnn fastest options due to noise introduced by numerical imprecisions.

wandering007 commented 6 years ago

@prlz77 I set cudnn.benchmark = True and cudnn.deterministic = True. Is that ok?

prlz77 commented 6 years ago

@wandering007 maybe with cudnn.deterministic = False you get better results.

wandering007 commented 6 years ago

@prlz77 No improvements from my experiments. Thank you anyway.

prlz77 commented 6 years ago

@wandering007 I'm sorry to hear that, I found this behaviour some years ago, maybe the library has changed or noise is not that important in this model.

boluoweifenda commented 6 years ago

@wandering007 I'm also confused about the differences between two CIFAR datasets. I have got similar accuracy with Wide-DenseNet on CIFAR10. But on CIFAR100 with exactly the same model and training details, the accuracies are always lower than reported in the paper, about 1%. Do you have any suggestion on that? BTW, I'm using tensorflow.

wandering007 commented 6 years ago

@boluoweifenda I haven't train it via tensorflow. There are a lot of ways to improve performance if you don't care about the fair comparison, like using dropout, a better lr schedule, better data augmentation. Personally, 1% performance difference between two frameworks is acceptable. BTW,same settings for different frameworks are not very fair itself :-)

boluoweifenda commented 6 years ago

@wandering007 Thanks for your reply~ But I just care about the fair comparison. Maybe I need to dig deeply to find the differences between frameworks. However, I got the same accuracy on CIFAR10 using tensorflow. It's quite strange for the accuracy drop on CIFAR100.
(╯°Д°)╯︵┻━┻