Why did you use MomentumOptimizer? and dropout...

liuzhuang13 / DenseNet

Densely Connected Convolutional Networks, In CVPR 2017 (Best Paper Award).

BSD 3-Clause "New" or "Revised" License

4.69k stars 1.06k forks source link

Why did you use MomentumOptimizer? and dropout... #28

Open taki0112 opened 6 years ago

taki0112 commented 6 years ago

Hello When I saw DenseNet, I implemented it with Tensorflow. (Using MNIST data)

The Questions are :

When I experimented, AdamOptimizer performed better than MomentumOptimizer. Is this just MNIST? I do not yet have an experiment with CIFAR.
In the case of dropout, I apply only to the bottleneck layer, not to the transition layer. is this right?
Does Batch Normalization only apply when training? Or does it apply to both test and training?
I wonder what global average pooling is.
And I wonder how to do it in tensorflow.

Please advise if you have any special reason. And if you can see the tensorflow code, I'd like you to see if I implemented it correctly. https://github.com/taki0112/Densenet-Tensorflow

Thank you

liuzhuang13 commented 6 years ago

Hello @taki0112

A1. As we mentioned in the paper, we directly followed ResNet's optimization settings (https://github.com/facebook/fb.resnet.torch), except that we train 300 epochs instead of ~160 epochs. We didn't try any other optimizers.

A2. In our experiment, we applied dropout to every conv layer except the first one of the network. But I guess there should be no significant difference whether you apply dropout on trans layers or not.

A3. This depends on what package you are using. Sorry I'm not familiar with Tensorflow's details.

A4. Global Average Pooling means you pool a feature map to a single number by taking average. For example, you have a 8x8 feature map, you take average of those 64 numbers and produce one number.

For tensorflow usage question like 3 and 4, you can probably find answers by looking at the third- party tensorflow implementations we posted on our readme page. Thanks

taki0112 commented 6 years ago

Thank you I think I can do global average pooling as follows.

    def Global_Average_Pooling(x, stride=1) :
        width = np.shape(x)[1]
        height = np.shape(x)[2]
        pool_size = [width, height]
        return tf.layers.average_pooling2d(inputs=x, pool_size=pool_size, strides=stride) 
        # The stride value does not matter

But I have some questions.

I experimented with MNIST data for a total of 100 layers and growth_k = 12. However, the result is worse than 20 layers. The training speed is very slow and the increase in accuracy is very narrow.
why is not there a Transition Layer (4) in paper ? There are only 3 (Dense Block + Transition Layers) and final dense block and Classification layer..

What is the reason?

liuzhuang13 commented 6 years ago

@taki0112

Most people train a network with less than 5 layers and achieve very high accuracy on MNIST because it is such a simple dataset. If you train a too large network on MNIST, it might overfit to the training set and the accuracy might be worse. Thanks
Because transition layers serves the purpose of downsampling. At last we have the global average pooling to do the downsample but we don't call it a transition layer.

John1231983 commented 6 years ago

I think the author has a good explanation. Regarding the dropout, why did not use dropout in imagenet case? it is the big dataset, so we do not need it, right? Dropout often uses in before fully connected layer. But, you did not use it in both imagenet and cifar10, why? Thanks

liuzhuang13 commented 6 years ago

@John1231983 Because ImageNet is big and also because we use heavy data augmentation, so we don't use dropout. This is also following our base code framework fb.resnet.torch.

For CIFAR10, when we use data augmentation (C10+), we don't use dropout. When we don't use data augmentation (C10), we actually use dropout. We've mentioned this in the paper.