Closed Callidior closed 5 years ago
Hi Callidior, interesting observation. In general, moving average is mainly used to stabilize the training rather than improving accuracy. Of course, if you train with EMA but eval without EMA, then the accuracy will drop a lot. (Similarly, if you train with batch norm, but eval without batch norm, then the accuracy will also drop a lot). For EfficientNets, I use EMA purely because it is based on MnasNet, which also uses EMA.
If you are interested in how much accuracy gain from EMA, maybe you can train Inception-V4 or ResNet-50 with EMA, and let me know how much accuracy gain you get from EMA. Thanks!
Empirically, when the model is trained with EMA, we should use the EMA variables for evaluation. There is some study to bridge the EMA with linear learning rate decay.
Hi @mingxingtan, I am wondering "moving average is mainly used to stabilize the training" and "train with EMA but eval without EMA, then the accuracy will drop a lot" means? I thought EMA won't affect the normal training process and is an independent weight copy. So how should we use EMA to stabilize the training? @saberkun can you kindly provide the related papers you mentioned related to EMA because I also observe the similar phenomena.
Hi @mingxingtan, I have a question regarding the comparison of EfficientNet with other architectures in Table 2 of the EfficientNet paper.
It does not seem to be mentioned in the paper but is clear from the code and the checkpoints you released that you are using an exponential moving average (EMA) over the network weights for inference instead of the final weights obtained after the last epoch. If I use the final weights for inference with EfficientNet-B5 instead, I obtain a top-1 accuracy on ImageNet of 80%, compared to 83% with EMA. This would be on par with the performance of Inception-V4.
Thus, I wonder whether the competitor architectures in that table also use an EMA of the weights for inference? Otherwise, it is difficult to judge whether the advantage of EfficientNet stems from the architecture itself or from the use of EMA.