Question: why use bn_function on 1x1 conv, not on 3x3 conv

gpleiss / efficient_densenet_pytorch

A memory-efficient implementation of DenseNets

MIT License

1.51k stars 329 forks source link

Hi @lizhenstat - the first set of convolutions has a quadratic memory cost, whereas the second set only has a linear memory cost.

To see this - the first set of convolutions maps from num_previous_filters -> bn_size * growth_rate. Storing the normalized inputs to this operations requires storing a feature map of size num_previous_filters. Doing this for all the layers will incur a quadratic cost.

However, the second set of convolutions only has an input of bn_size * growth_rate. Since this is a constant for all layers, it incurs a linear cost.

See the tech report for more details :)

gpleiss / efficient_densenet_pytorch

Question: why use bn_function on 1x1 conv, not on 3x3 conv #60