megvii-model / MABN

MIT License
182 stars 27 forks source link

Replacing only GroupNorm with MABN #4

Closed andgitchang closed 4 years ago

andgitchang commented 4 years ago

Hi Do you think it's possible to replace only GroupNorm part with MABN and the according Conv with CenConv, the rest of well-pretrained Conv+Bn remaining the same? I followed this kind of setting to maskrcnn with MobileNet backbone and also faced the nan training problem. I don't realize the nan problem coming from which part of codes. Deos MABN works only when the whole network deploying Centralized Weight, or theoretically part of it just fine? Thanks.

RuosOne commented 4 years ago

CenConv is just for theoretical benefits to MABN, you can use MABN with ordinary conv, that will bring a little drop on performance I think, but the code should go normally. MABN just behaves like the original BN during the first several iterations. So I think nan training problem is still caused by some bugs especially when you change the backbone or using pretrain models.

Notice, the original maskrcnn-benchmark code doesn't have BN layer for pretrain BN checkpoint( weights, running_mean, running_var, etc). We modify the code for resnet50 in our demo, if you want to try other structures, you need to register it at first. You can refer to #2 .

If you make sure the implementation is correct, I suggest to train models from scratch at first, or retrain a new moblienet backbone on imagenet with CenConv2d and MABN, or increase the warmup iterations of MABN to see if it goes normally.

Feel free to contact us if you have further questions.

andgitchang commented 4 years ago

Thanks for your detailed reply. The nan problem is fixed. However following your best practices, I did both (1) train the model equipped with CenConv/MABN from scratch and (2) pretrain backbone of MobileNetV2 on ImageNet then COCO. Neither of them could converge upon 2x schedules. Is there any training recipe for Weight Centralization? The accuracy of pretrained WC MobileNetv2 on ImageNet (120 epochs Cosine LR schedule) only reached 56.7%. For the best of your experience, Is MABN easy to apply?

RuosOne commented 4 years ago

We only performed MABN on resnet50 during our experiments. To my experience, it should be very easy to apply MABN, since it's has only two extra hyper parameters to tune: the warm up iters and buffer size of simple moving average statistics. Rest of hyper parameters are same as regular BN(except momentum, momentum need to be close to 1, but 0.98 should work well). In the worst cases, MABN should get comparable performance as regular BN. So regardless of det, the severe drop of performance on Imagenet is very weired to me, could you please provide more implementing details of your experiments to allow me to reproduce the result?

andgitchang commented 4 years ago

In my experiments, I tried to append MABN after CenConv/Conv in cls_tower and bbox_tower of RetinaNet heads in maskrcnn-benchmark project, and followed the convention of FrozenBN throughout the backbone of pretrained MobileNetV2/ResNet50. None of with/without CenConv MobileNetV2/ResNet50 settings could converge.

RuosOne commented 4 years ago

I try to reproduce the MobileNetv2 on imagenet with BN and MABN respectively with large batch size(1024), the hyper parameters settings follows shufflenetv2, my experiment results shows that mbv2 with BN is 72% while MABN is 68%. MABN seems to be not suitable for small models. But your experiments on MABN must have some mistakes, you need to double check the experiment settings, I suspect you add weight decay on weights of BN in your experiments; As for coco experiments, RetinaNet itself is really unstable, you need to carefully tune the hyper parameters as long as you modify the model, you can try original BN at first to see if training goes normally at first.

andgitchang commented 4 years ago

Thanks for pointing out weight decay concerns for BN parameters. Both cls and det experiments in MABN adopted weight decay on weights of BN, see cls and det. If I understand correctly, your suggestion of filtering out weights of BN from the weight decay parameter group will be established especially when considering compact models, e.g. SuffleNets and MobileNets? As for RetinaNet performing MABN, I totally agree with you that the conv towers is unstable. Hence it might not be an issue of MABN. Thanks.

RuosOne commented 4 years ago

Yes, you got my point. Weight decay on BN won't hurt the performance of ResNet50 but shufflenet or mobilenet with small number of weights (1.0x, 1.4x). There's no regular pattern to decide whether BN needs weight decay, you need to carefully check it out if you try other structures.