idstcv / GPU-Efficient-Networks

Apache License 2.0
191 stars 37 forks source link

Small questions #5

Closed bonlime closed 4 years ago

bonlime commented 4 years ago

First of all thanks for a very interesting and thorough paper. A fair comparison with other models is extremely rare lately and you did it well. I'm also conducting a lot of experiments and have several questions about the setup of your experiment.

1) Did you check how much of a difference you get from random seed? The reason I ask is because the difference between your hand-crafted Net 3 and Net 1 is very subtle but the former looks like a better option (IMHO) 2) Why didn't you include some kind of attention? Both Mobilenet and EfficientNet have them. SE attention, for example, doesn't give much of a slowdown while boosting performance significantly 3) Why didn't you make your Inverted Bottlenecks an Inverted Linear Bottlenecks? Most of the recent SOTA papers follow Mobilenet and don't add activations after residual.

p.s. I don't know if you did it consciously or not, but training on 192 images actually boosts the performance on validation for 224 images compared to training on 224 images directly. According to Fixing the train-test resolution discrepancy. I've validated you pretrained weights for GNet-normal and got 80.7 Acc@1 for 224 images. Which is quite impressive

MingLin-home commented 4 years ago

Hi bonlime, Thank your for your feedback! Regarding to your questions,

A1: We did not test different random seed. According to our experience, the results should be very stable after long enough training. Hand-crafted Net1 and Net3 are trained by 120 epochs so there are still some variance.

A2: The attention module will considerably reduce the inference speed. Therefore you should use SE attention only when absolutely necessary. In this work we wish to keep things simple. We believe GENet could be even better when we add SE in the final few layers.

A3: We found that the reslink + relu structure usually has better accuracy (not 100% guaranteed). Beside, this makes the structure more regularized to a conventional ResNet block. Again we try to keep things simple here. It is possible to further improve the accuracy by smartly adopting Linear-Bottleneck structure, as you said.

A4: Inferencing on 224x224 rather than 192x192 will improve the accuracy but slow down the inference speed. We also noted that a model trained at low resolution could achieve better accuracy when testing at high resolution. But the best performance is still achieved by using the same resolution in both training and testing (that is, high resolution in training and high resolution in testing).

bonlime commented 4 years ago

A3: What do you mean by more regularized. Is it more similar?

The fact that the linear bottleneck doesn't work is very confusing. I've tried training your original (GENet normal) model and one without activations in stem and also got a significant drop in accuracy. While for default ResNet50 linear bottleneck gives +0.5%. Do you have any thoughts on why linear bottleneck doesn't work for your model? Because the idea from Mobilnenetv2 about not introducing extra distortions in stem is quite believable.

I personally think it may be due to ~1/3 reduction of the number of activations which prevents the model from capturing difficult classes but your thoughts would be highly valuable.

MingLin-home commented 4 years ago

Hi bonlime,

"More regularized" means "more close to/similar to a conventional ResNet structure".

We do not understand why linear bottleneck is problematic in our structure, or, in some compact networks. Our conjecture is that, in a compact network, there is no redundant parameters so the over-parameterization no longer holds. The optimization becomes more difficult. It would be necessary to use an over-parameterized teacher network to help the compact network escape bad local minima during early training stages. As for activation, removing the activation in the linear bottleneck will considerably reduce the representation power of a compact network. This might not be a big issue to an over-parameterized network.

yxchng commented 4 years ago

@MingLin-home Isn't linear bottleneck introduced in MobileNetV2 to prevent information loss in compact network? Does this experiment contradict the conjecture proposed in MobileNetV2?

bonlime commented 4 years ago

The reason why linear bottleneck works well for MobileNet and doesn't work for GENet is difficult to understand. Since the time I've asked the question I've tried using the pre-activation idea from the ResNet v2 paper. And surprisingly it doesn't work for MobileNet (gives lower accuracy than linear bottleneck) but does increase the accuracy of GENet. WIthout proper deep learning theory trial and error is the only way to understand what works and what doesn't

MingLin-home commented 4 years ago

@MingLin-home Isn't linear bottleneck introduced in MobileNetV2 to prevent information loss in compact network? Does this experiment contradict the conjecture proposed in MobileNetV2?

Thank you @bonlime for sharing your interesting findings! We empirically find that using RELU in the bottleneck block improves the representation power of compact networks, especially for narrow ones. Of course nothing is 100% guaranteed before we really understand why deep learning works.