mit-han-lab / once-for-all

[ICLR 2020] Once for All: Train One Network and Specialize it for Efficient Deployment
https://ofa.mit.edu/
MIT License
1.87k stars 334 forks source link

Rationale for not having the first MB layer Dynamic in mbv3 backbone #44

Open vinodganesan opened 3 years ago

vinodganesan commented 3 years ago

Hi,

Is there a design rationale for not making the first bottleneck layer Dynamic? Instead, the first bottleneck layer is used as a simple Residual block (Link). I believe a similar setup was carried out in ProxylessNAS as well. I am interested to hear the insights on why.

Thanks, Vinod

xuefei1 commented 3 years ago

I think it is because the input H/W to the first layer is quite large, e.g., 112x112. So if you allow this layer to have larger kernel sizes or expansion ratios it would hurt the flops, latency and make training slower, so usually people do not search for this first layer and just fix it to be something small.

vinodganesan commented 3 years ago

Ok, thanks for the insight.
In case you make it searchable, does it or would it have an impact on the final accuracy of the trained superNet? Do you have some insights or observations on the same?

Thanks, Vinod

xuefei1 commented 3 years ago

I never tried that, but I think it shouldn't impact the final acc too much. It's just that the training will be more costly (longer time, large VRAM usage) because of the larger FLOPS in the first block.