Open vinodganesan opened 3 years ago
I think it is because the input H/W to the first layer is quite large, e.g., 112x112. So if you allow this layer to have larger kernel sizes or expansion ratios it would hurt the flops, latency and make training slower, so usually people do not search for this first layer and just fix it to be something small.
Ok, thanks for the insight.
In case you make it searchable, does it or would it have an impact on the final accuracy of the trained superNet? Do you have some insights or observations on the same?
Thanks, Vinod
I never tried that, but I think it shouldn't impact the final acc too much. It's just that the training will be more costly (longer time, large VRAM usage) because of the larger FLOPS in the first block.
Hi,
Is there a design rationale for not making the first bottleneck layer Dynamic? Instead, the first bottleneck layer is used as a simple Residual block (Link). I believe a similar setup was carried out in ProxylessNAS as well. I am interested to hear the insights on why.
Thanks, Vinod