Open dkatsios opened 4 years ago
From herr it appears that indeed, the size of the squeezed tensor is input_channels/SE_ratio, but as per original paper implementation this should be expanded_size/SE_ratio. The SE size appears to be consistent to the one it would have if it would be applied to the residual layer as proposed in MobileNetV3. However, from here this does not seem to be the case, but I am not entirely sure about it. I am also a bit puzzled on this.
Hi @dkatsios and @Renthal,
Yes, we intended to use SE ratio respecting to the input filters, following the same design as MnasNet. Our purpose is to minimize the runtime overhead caused by SE.
MobileNetV3 uses SE ratio respecting to expanded_size, as explained in section 5.3. Notably, MobileNetV3 uses hard-swish, and we have to write a highly-optimized and dedicated low-level kernel implementation for hard-swish in order to make it fast on mobile CPUs. With all these extra system-level (TF&TFLite) optimizations, MobileNetV3 uses effectively larger SE ratio.
Why is the squeezed number of filters at SE based on the block's input filters number and not the tensor's channels? Based on the SE-Nets paper (https://arxiv.org/pdf/1709.01507.pdf) Fig. 3 it is C/r where C the number of input channels. For example at EfficientetB0 at the 2nd block the input_filters is 16 but inside the SE block the tensor has 96 filters (expanded). the squeezed tensor is 1, 1, 4 (16/4) instead of 1, 1, 24 (96/4). So it goes 96 -> 4 -> 96 instead of 96->24->96 (which seems more intuitive).