megvii-model / MABN

MIT License
182 stars 27 forks source link

det finetune's result is poor #2

Closed qianyizhang closed 4 years ago

qianyizhang commented 4 years ago

det from_scratch seems fine, with 36.4 AP but fine_tune only has 19.6 AP the only thing I modified is reusing the MSRA weight

WEIGHT: "catalog://ImageNetPretrained/MSRA/R-50"

and register "R-50-FPN-MABN" to load_resnet_c2_format accordingly.

Furthermore, the native e2e_mask_rcnn_R_50_FPN_1x doesn't work anymore, the box_roi gradients would blow up in a few iterations. It seems toe be a FPN problem, since C4 model can still be trained.

ps: I am running on coco2017 dataset, but it can't be the problem

RuosOne commented 4 years ago

For the first problem, I think you use the wrong pretrained weight. When using MABN, Weight Centralization is required, that means the pretrained weight also needs weight centralization. We have provided the pretrained R-50 weights with weight centralization in https://www.dropbox.com/sh/fbsi6935vmatbi9/AAA2jv0EBcSgySTgZnNZ3lmPa?dl=0.

As for the second problem, there are some bugs we are working on it. We will let you know as long as we fix them.

By the way, the MSRA weight is also not suitable for fine-tune with syncbn, because the MSRA weight have absorbed the running_mean, running_var and weights of BN in Conv weights. It's OK to use MSRA weight for fine-tune with frozen BN, but will degrade the performance of syncbn. Besides, I suggest you do fine-tune experiment at least 2x when using syncbn or MABN. In our experiments, it's impossible to get good results with syncbn or MABN for only 1x; On the other hand, to our knowledge, no paper had reported the fine-tune result with syncbn for only 1x, we think these authors might meet the same problem as we do.

qianyizhang commented 4 years ago

I see, thanks for your timely reply.

For the 1st problem Can you upload to an pan.baidu.com account XD

For the 2nd problem. Can you be more specific? What exactly did you change that prevents native training from converging? I've been debugging this for a day but everything seems fine imo...

RuosOne commented 4 years ago

Here's baiduyun within pretrained weights: link: https://pan.baidu.com/s/1Md_UzwWEiZZKu84R0yZ6aw password: zww2

For the 2nd problem, we also don't know why yet. We just added implementation of syncbn and MABN to the original maskrcnn repo, and didn't modify any rest part. Technically, the original experiments should go well. But in our experiment, "e2e_mask_rcnn_R_50_FPN_1x" went normal at first, when I removed build files, re-compiled the projects then ran the experiment, nan comes out occasionally, there must be some hidden bugs we made in the original part of code. We are still looking for them.

qianyizhang commented 4 years ago

exactly, I think the original code is mostly untouched, which should leave the native experient as it was. I narrowed down the problem to be related to FPN experiments (C4 ones seem to be fine)

I am not familiar with the compilation, do you think your extra syncbn operator could change the native OP, specifically the pooler (ROIAlign)

RuosOne commented 4 years ago

@qianyizhang We have fixed the bugs. Now you can pull and rebuild the repo.

The problem is caused by the modification we did in det/maskrcnn_benchmark/utils/c2_model_loading.py. We modified the _rename function to load pre-trained weights without fusing BN in conv layer. But such modification is not suitable for MSRA weights.