hustvl / MIMDet

[ICCV 2023] You Only Look at One Partial Sequence
https://arxiv.org/abs/2204.02964
MIT License
336 stars 31 forks source link

Inf/NaN error happens during the training #7

Closed junchen14 closed 2 years ago

junchen14 commented 2 years ago

File "MIMDet/detectron2/detectron2/modeling/proposal_gen erator/proposal_utils.py", line 99, in find_top_rpn_proposals raise FloatingPointError( FloatingPointError: Predicted boxes or scores contain Inf/NaN. Training has diver ged.

Hi, firstly thanks for your work. but while I train your model with the default configuration, I faced this problem

the default configuration is as the following: 'python lazyconfig_train_net.py --config-file configs/mimdet/mimdet_vit_base_mask_rcnn_fpn_mr_0p25_800_1333_4xdec_coco_3x.py --num-gpus 4 --num-machines 1 --master_addr 127.0.0.1 --master_port 9998 mae_checkpoint.path=mae_pretrain_vit_base.pth'

do you have any ideas why this happen?

simonJJJ commented 2 years ago

Please use the full MAE pretrained weight (including the decoder, not just the encoder) as mentioned in https://github.com/hustvl/MIMDet#training.

junchen14 commented 2 years ago

thanks for this quick response and clear pointer.

Will follow your correct instruction and re-run the model again!

Yuxin-CV commented 2 years ago

Hi @junchen14, it looks like you are using 4xGPUs to training our model with sample ratio = 0.25.

We just updated a new training config with a batch size of 16 instead of 64, which matches the result of our default settings.

Specifically, the results (49.9 Box AP / 44.6 Mask AP) match our default settings (49.9 Box AP / 44.7 Mask AP), and are better than the Swin-Base counterpart (49.2 Box AP / 43.5 Mask AP) under a similar total training time (~2d6h on 8x V100).

You can try our new config :)

junchen14 commented 2 years ago

thanks very much. I will give it a try

mike-huangdj commented 1 year ago

Hello, did you solve this problem? @junchen14 ,I'm having the same problem. When my default configuration: "python lazyconfig_train_net.py --num-gpus 1 --config-file configs/mimdet/mimdet_vit_base_mask_rcnn_fpn_sr_0p5800 1333_4xdec_coco_3x.py --num-machines 1". Training has diverged.", "FloatingPointError: Predicted boxes or scores contain Inf/NaN. I should follow the complete MAE pre-training weights provided by the author@Yuxin-CV , but I am confused, I did not find the training weights for the decoder in question, where is this decoder? How should I place it?

mike-huangdj commented 1 year ago

When running: "python lazyconfig_train_net.py --num-gpus 1 --config-file configs/mimdet/mimdet_vit_base_mask_rcnn_fpn_sr_0p5800 1333_4xdeccoco 3x.py --num-machines 1", appears: 11 22 33

And I found the definition of encoder and decoder in "mimdet_vit_base_mask_rcnn_fpn_sr_0p5800 1333_4xdec_coco_3x.py --num-machines 1", as shown in the figure: 44

For the encoder weights, modify them in common.py, 55

I look forward to your reply, if there is any disturbance, please bear with me, sincerely.