microsoft / SoftTeacher

Semi-Supervised Learning, Object Detection, ICCV2021
MIT License
892 stars 123 forks source link

sup_loss_bbox: nan and loss:nan #232

Closed Simeon340703 closed 1 year ago

Simeon340703 commented 1 year ago

I am training the SoftTeacher model with a custom dataset with a single GPU. I reduced the image scale to (512, 200) and (512, 400). I converted the dataset into a COCO format. I am using the config file configs/soft_teacher/soft_teacher_faster_rcnn_r50_caffe_fpn_coco_180k.py . I lowered the lr=0.0001 and fp16=None, but it did not help. If the error was related to unsup_loss_bbox, it would make sense. However, this is for a supervised bbox. Is it related to my data annotation when I converted to MS COCO? Any help is appreciated. Here is the training output: 'single_level_grid_anchors would be deprecated soon. ' 2022-10-19 17:16:35,728 - mmcv - INFO - Reducer buckets have been rebuilt in this iteration. 2022-10-19 17:16:45,023 - mmdet.ssod - INFO - Iter [50/180000] lr: 9.890e-06, eta: 9:56:46, time: 0.199, data_time: 0.008, memory: 4210, ema_momentum: 0.9800, sup_loss_rpn_cls: 0.6945, sup_loss_rpn_bbox: 0.2861, sup_loss_cls: 3.2772, sup_acc: 52.2891, sup_loss_bbox: nan, unsup_loss_rpn_cls: 2.8103, unsup_loss_rpn_bbox: 0.5344, unsup_loss_cls: 6.9318, unsup_acc: 61.1875, unsup_loss_bbox: 0.0000, loss: nan 2022-10-19 17:16:54,061 - mmdet.ssod - INFO - Iter [100/180000] lr: 1.988e-05, eta: 9:29:24, time: 0.181, data_time: 0.006, memory: 4210, ema_momentum: 0.9900, sup_loss_rpn_cls: 0.6529, sup_loss_rpn_bbox: 0.2539, sup_loss_cls: 1.4665, sup_acc: 80.2716, sup_loss_bbox: nan, unsup_loss_rpn_cls: 2.3978, unsup_loss_rpn_bbox: 0.4155, unsup_loss_cls: 0.7863, unsup_acc: 92.9492, unsup_loss_bbox: 0.0000, loss: nan 2022-10-19 17:17:03,632 - mmdet.ssod - INFO - Iter [150/180000] lr: 2.987e-05, eta: 9:30:35, time: 0.191, data_time: 0.007, memory: 4210, ema_momentum: 0.9933, sup_loss_rpn_cls: 0.6296, sup_loss_rpn_bbox: 0.2572, sup_loss_cls: 1.5224, sup_acc: 80.0698, sup_loss_bbox: nan, unsup_loss_rpn_cls: 1.9233, unsup_loss_rpn_bbox: 0.3586, unsup_loss_cls: 0.7335, unsup_acc: 95.7422, unsup_loss_bbox: 0.0207, loss: nan 2022-10-19 17:17:13,179 - mmdet.ssod - INFO - Iter [200/180000] lr: 3.986e-05, eta: 9:30:52, time: 0.191, data_time: 0.007, memory: 4210, ema_momentum: 0.9950, sup_loss_rpn_cls: 0.6550, sup_loss_rpn_bbox: 0.2446, sup_loss_cls: 1.2197, sup_acc: 82.3859, sup_loss_bbox: nan, unsup_loss_rpn_cls: 1.1558, unsup_loss_rpn_bbox: 0.1479, unsup_loss_cls: 0.4731, unsup_acc: 97.7305, unsup_loss_bbox: 0.0000, loss: nan 2022-10-19 17:17:22,559 - mmdet.ssod - INFO - Iter [250/180000] lr: 4.985e-05, eta: 9:28:59, time: 0.188, data_time: 0.007, memory: 4210, ema_momentum: 0.9960, sup_loss_rpn_cls: 0.7083, sup_loss_rpn_bbox: 0.2544, sup_loss_cls: 1.0939, sup_acc: 84.3413, sup_loss_bbox: nan, unsup_loss_rpn_cls: 0.6311, unsup_loss_rpn_bbox: 0.0342, unsup_loss_cls: 0.2559, unsup_acc: 98.9141, unsup_loss_bbox: 0.0000, loss: nan 2022-10-19 17:17:31,845 - mmdet.ssod - INFO - Iter [300/180000] lr: 5.984e-05, eta: 9:26:43, time: 0.186, data_time: 0.006, memory: 4210, ema_momentum: 0.9967, sup_loss_rpn_cls: 0.6974, sup_loss_rpn_bbox: 0.2387, sup_loss_cls: 1.1349, sup_acc: 82.1629, sup_loss_bbox: nan, unsup_loss_rpn_cls: 0.3992, unsup_loss_rpn_bbox: 0.0052, unsup_loss_cls: 0.1377, unsup_acc: 99.8633, unsup_loss_bbox: 0.0000, loss: nan 2022-10-19 17:17:40,439 - mmdet.ssod - INFO - Iter [350/180000] lr: 6.983e-05, eta: 9:19:08, time: 0.172, data_time: 0.007, memory: 4210, ema_momentum: 0.9971, sup_loss_rpn_cls: 0.6573, sup_loss_rpn_bbox: 0.2187, sup_loss_cls: 0.9288, sup_acc: 85.2164, sup_loss_bbox: nan, unsup_loss_rpn_cls: 0.3262, unsup_loss_rpn_bbox: 0.0021, unsup_loss_cls: 0.0993, unsup_acc: 99.9805, unsup_loss_bbox: 0.0000, loss: nan 2022-10-19 17:17:49,335 - mmdet.ssod - INFO - Iter [400/180000] lr: 7.982e-05, eta: 9:15:41, time: 0.178, data_time: 0.008, memory: 4210, ema_momentum: 0.9975, sup_loss_rpn_cls: 0.6640, sup_loss_rpn_bbox: 0.2234, sup_loss_cls: 1.0066, sup_acc: 83.5162, sup_loss_bbox: nan, unsup_loss_rpn_cls: 0.3610, unsup_loss_rpn_bbox: 0.0165, unsup_loss_cls: 0.1516, unsup_acc: 99.7539, unsup_loss_bbox: 0.0000, loss: nan 2022-10-19 17:17:58,275 - mmdet.ssod - INFO - Iter [450/180000] lr: 8.981e-05, eta: 9:13:15, time: 0.179, data_time: 0.006, memory: 4210, ema_momentum: 0.9978, sup_loss_rpn_cls: 0.7262, sup_loss_rpn_bbox: 0.2608, sup_loss_cls: 1.1092, sup_acc: 81.7103, sup_loss_bbox: nan, unsup_loss_rpn_cls: 0.2871, unsup_loss_rpn_bbox: 0.0102, unsup_loss_cls: 0.1245, unsup_acc: 99.7812, unsup_loss_bbox: 0.0000, loss: nan 2022-10-19 17:18:07,025 - mmdet.ssod - INFO - Iter [500/180000] lr: 9.980e-05, eta: 9:10:12, time: 0.175, data_time: 0.008, memory: 4210, ema_momentum: 0.9980, sup_loss_rpn_cls: 0.6046, sup_loss_rpn_bbox: 0.2242, sup_loss_cls: 0.8927, sup_acc: 84.4848, sup_loss_bbox: nan, unsup_loss_rpn_cls: 0.2420, unsup_loss_rpn_bbox: 0.0060, unsup_loss_cls: 0.1130, unsup_acc: 99.7812, unsup_loss_bbox: 0.0000, loss: nan 2022-10-19 17:18:17,033 - mmdet.ssod - INFO - Iter [550/180000] lr: 1.000e-04, eta: 9:14:25, time: 0.200, data_time: 0.011, memory: 4210, ema_momentum: 0.9982, sup_loss_rpn_cls: 0.5994, sup_loss_rpn_bbox: 0.2472, sup_loss_cls: 1.0185, sup_acc: 83.1883, sup_loss_bbox: nan, unsup_loss_rpn_cls: 0.3435, unsup_loss_rpn_bbox: 0.0157, unsup_loss_cls: 0.3107, unsup_acc: 99.6367, unsup_loss_bbox: 0.0000, loss: nan 2022-10-19 17:18:26,438 - mmdet.ssod - INFO - Iter [600/180000] lr: 1.000e-04, eta: 9:15:00, time: 0.188, data_time: 0.008, memory: 4210, ema_momentum: 0.9983, sup_loss_rpn_cls: 0.6129, sup_loss_rpn_bbox: 0.2450, sup_loss_cls: 1.1227, sup_acc: 81.1367, sup_loss_bbox: nan, unsup_loss_rpn_cls: 0.2543, unsup_loss_rpn_bbox: 0.0024, unsup_loss_cls: 0.1428, unsup_acc: 99.9102, unsup_loss_bbox: 0.0000, loss: nan 2022-10-19 17:18:35,738 - mmdet.ssod - INFO - Iter [650/180000] lr: 1.000e-04, eta: 9:14:52, time: 0.186, data_time: 0.006, memory: 4210, ema_momentum: 0.9985, sup_loss_rpn_cls: 0.5956, sup_loss_rpn_bbox: 0.2429, sup_loss_cls: 1.0238, sup_acc: 82.1718, sup_loss_bbox: nan, unsup_loss_rpn_cls: 0.2639, unsup_loss_rpn_bbox: 0.0052, unsup_loss_cls: 0.1380, unsup_acc: 99.8750, unsup_loss_bbox: 0.0000, loss: nan 2022-10-19 17:18:45,028 - mmdet.ssod - INFO - Iter [700/180000] lr: 1.000e-04, eta: 9:14:45, time: 0.186, data_time: 0.009, memory: 4210, ema_momentum: 0.9986, sup_loss_rpn_cls: 0.5572, sup_loss_rpn_bbox: 0.2352, sup_loss_cls: 1.0872, sup_acc: 81.8292, sup_loss_bbox: nan, unsup_loss_rpn_cls: 0.2256, unsup_loss_rpn_bbox: 0.0000, unsup_loss_cls: 0.1419, unsup_acc: 99.8711, unsup_loss_bbox: 0.0000, loss: nan 2022-10-19 17:18:54,566 - mmdet.ssod - INFO - Iter [750/180000] lr: 1.000e-04, eta: 9:15:37, time: 0.191, data_time: 0.010, memory: 4210, ema_momentum: 0.9987, sup_loss_rpn_cls: 0.5361, sup_loss_rpn_bbox: 0.2191, sup_loss_cls: 1.0985, sup_acc: 83.1454, sup_loss_bbox: nan, unsup_loss_rpn_cls: 0.1921, unsup_loss_rpn_bbox: 0.0012, unsup_loss_cls: 0.0937, unsup_acc: 99.9688, unsup_loss_bbox: 0.0000, loss: nan

Simeon340703 commented 1 year ago

The problem was related to the dataset. I