liuzhuang13 / DenseNet

Densely Connected Convolutional Networks, In CVPR 2017 (Best Paper Award).
BSD 3-Clause "New" or "Revised" License
4.69k stars 1.06k forks source link

The layers within the second and third dense block don't assign the least weight to the outputs of the transition layer in my trained model #53

Open seasonyc opened 5 years ago

seasonyc commented 5 years ago

I am not sure if it's appropriate to open this issue in github project, this is a question about the heatmap in your paper.

I trained a DenseNet on C10+ with L = 40 and k = 12, which is same as yours , and then I verified the weights on a trained model with 94.6% accuracy, but I didn't get the same result as your observation 3. In my test, the layers within the second and third dense block assign considerable weight to the outputs of the transition layer.

For example, the first conv layer in the second dense block has 0.013281956 average weight on the 1st transition layer output(168 channels, i.e. all the input channels), the second conv layer has 0.011933382 average weight on the 1st transition layer output(first 168 channels), and 0.024417713 average weight on the 12 channels outputted from the first conv layer. This is reasonable because closer channels are more important. The rest layers have similar weights distributions on the old channels and the new channels. And similar condition is in dense block 3.

My densenet and training code is aligned to yours, including augmentation and input norm, see https://github.com/seasonyc/densenet/blob/master/densenet.py and https://github.com/seasonyc/densenet/blob/master/cifar10-test.py. The model file is in https://github.com/seasonyc/densenet/blob/master/dense_augmodel-ep0300-loss0.112-acc0.999-val_loss0.332-val_acc0.946.h5, and my code to count the weights is in https://github.com/seasonyc/densenet/blob/master/weights-verify.py.

I know the models trained in different times are different, even the features of conv filters are different, but I believe the weights distributions are similar in statistics. So although we have different models, we should have similar result.

I did this verification because I feel the observation 3 is a little unreasonable. The 1st conv layer uses the information from the previous dense block very much, and then the 2nd conv layer ignores the information from hundreds of channels but only uses the information from 12 channels, can the 1st conv layer really concentrate hundreds of channels into 12 channels by training?

Do you want to double-check this?

Thanks YC