tdeboissiere / DeepLearningImplementations

Implementation of recent Deep Learning papers
MIT License
1.81k stars 650 forks source link

Max pooling/Avg pooling #1

Closed liuzhuang13 closed 7 years ago

liuzhuang13 commented 7 years ago

Hi, thanks for reimplementing the densenets. I took a look at your structure diagram and noticed you used max pooling in transition layers. We used avg pooling instead. Max pooling might work too though. Just to let you know :)

Cheers,

tdeboissiere commented 7 years ago

Thanks for pointing that out ! I had trouble reproducing your results with L=40, k = 12 on CIFAR10 (no data augmentation) as my test accuracy saturated around 90 %.

I'm gonna run tests with both to find out the best option. If you find any other mistake please let me know.

Edit:

Changed model and figure to averagepooling

liuzhuang13 commented 7 years ago

Hi, I saw your description and results in README, and noticed one more different thing... We didn't use any lr decay, it's only divided twice, in other time it's a constant. The 1e-4 decay in our setting is the weight decay.

In our curve(in paper) you can see a big error drop at epoch 150 and 225, but in yours there isn't. That's due to your lr decay probably.

Cheers,

tdeboissiere commented 7 years ago

Thanks!

2 more questions if you can spare the time:

liuzhuang13 commented 7 years ago

Do you apply weight_decay to all layers (input, inside the dense block, in the transitions and the fully connected output) ?

Yes, even including batch_norm layer's scale parameters.

Do you apply bias decay as well ?

Yes, Torch applies weight decay to bias, too. However, note in convolutional layers we didn't use bias, since batch norm will undo what bias does.

Good luck!

tdeboissiere commented 7 years ago

All right, so now I have:

Sounds good ?

Edit:

I applied the above and reproduced your CIFAR10 (no augmentation) results: sweet !

ruudvlutters commented 7 years ago

Hi,

I'm also trying to replicate the Densenet in Keras, and I found that Keras uses a momentum default of 0.99 in the batchnorm layers. Whereas, torch is using 0.1 (corresponding to 0.9 in Keras, as it uses an opposite definition). Maybe another reason for a difference ?

liuzhuang13 commented 7 years ago

@ruudvlutters In the Torch code we use a momentum of 0.9 to all weights and biases including batch norm layers. Yes that could be a reason for difference. Thanks for pointing out.

And I just found the initialization is different too, the initialization used in our Torch code (which is copied from fb.resnet.torch) is indeed different from the commonly used "he" or "msra" scheme.

tdeboissiere commented 7 years ago

Given I was able to reproduce your results (see update) without the momentum/initialization stage, it seems that these differences do not matter that much.