同样的错误loss_cls?

qq1243196045 commented 2 years ago

我训练的时候同样遇到了KeyError: 'loss_cls' ,我因此在报错前面print了loss,发现在报错之前是正常的,但是报错的时候,loss少了loss_cls. 这是我print的地方:ssod/models/soft_teacher.py:243-245

    loss = self.student.roi_head.bbox_head.loss(
        bbox_results["cls_score"],
        bbox_results["bbox_pred"],
        rois,
        *bbox_targets,
        reduction_override="none",
    )
   print("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`")
    print(loss)
    print("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`")
    loss["loss_cls"] = loss["loss_cls"].sum() / max(bbox_targets[1].sum(), 1.0)
    loss["loss_bbox"] = loss["loss_bbox"].sum() / max(
        bbox_targets[1].size()[0], 1.0
    )
    if len(gt_bboxes[0]) > 0:
        log_image_with_boxes(
            "rcnn_cls",
            student_info["img"][0],
            gt_bboxes[0],
            bbox_tag="pseudo_label",
            labels=gt_labels[0],
            class_names=self.CLASSES,
            interval=500,
            img_norm_cfg=student_info["img_metas"][0]["img_norm_cfg"],
        )
    return loss

这是print的结果

{'loss_cls': tensor([3.6832e-05, 1.8751e-05, -0.0000e+00,  ..., -0.0000e+00, -0.0000e+00,
        -0.0000e+00], device='cuda:0', grad_fn=<MulBackward0>), 'acc': tensor([100.], device='cuda:0'), 'loss_bbox': tensor(0., device='cuda:0', grad_fn=<SumBackward0>)}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`
{'loss_cls': tensor([2.9819e-05, 5.8828e-07, 4.6680e-07,  ..., 3.0180e-08, -0.0000e+00,
        1.9123e-03], device='cuda:0', grad_fn=<MulBackward0>), 'acc': tensor([100.], device='cuda:0'), 'loss_bbox': tensor(0., device='cuda:0', grad_fn=<SumBackward0>)}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`
{'loss_cls': tensor([1.6779e-09, 1.6779e-09, 8.3553e-04,  ..., 1.1414e-07, 5.6461e-07,
        5.7065e-08], device='cuda:0', grad_fn=<MulBackward0>), 'acc': tensor([100.], device='cuda:0'), 'loss_bbox': tensor(0., device='cuda:0', grad_fn=<SumBackward0>)}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`
{'loss_cls': tensor([-0., -0., -0., -0., -0., -0., -0., -0., -0.], device='cuda:0',
       grad_fn=<MulBackward0>), 'acc': tensor([100.], device='cuda:0'), 'loss_bbox': tensor(0., device='cuda:0', grad_fn=<SumBackward0>)}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`
{'loss_bbox': tensor(0., device='cuda:0', grad_fn=<SumBackward0>)}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`
Traceback (most recent call last):

在最后一次print的时候,loss中少了loss_cls,在前一次print的时候,出现了-0.在前两次的时候,loss非常小,1e-9,我不知道这有什么联系,但是每次报错钱,都会有这种情况,前一次loss变成了-0.我将fp16设置成false也同样会出现这个问题

phelps-matthew commented 2 years ago

I have encountered similar issues but have found that mine are difficult to reproduce. As per #101 you may want to try increasing the batch size.

The keyerror occurs so far into the training run (~8000 steps) that I cannot yet confirm if batch samples is the core issue.

ericpresas commented 2 years ago

I have the same error and I noticed that in last training iteration before the error raises the computed loss is really high. It is related with the data?


2022-03-19 17:07:50,410 - mmdet.ssod - INFO - Iter [100/180000] lr: 1.988e-03, eta: 4 days, 16:52:48, time: 2.144, data_time: 0.074, memory: 6457, ema_momentum: 0.9900, sup_loss_rpn_cls: 0.4799, sup_loss_rpn_bbox: 0.2061, sup_loss_cls: 0.4395, sup_acc: 88.4707, sup_loss_bbox: 0.3006, unsup_loss_rpn_cls: 0.1637, unsup_loss_rpn_bbox: 0.0000, unsup_loss_cls: 0.0843, unsup_acc: 99.9947, unsup_loss_bbox: 0.0000, loss: 1.6742
2022-03-19 17:09:35,866 - mmdet.ssod - INFO - Iter [150/180000] lr: 2.987e-03, eta: 4 days, 14:21:18, time: 2.109, data_time: 0.074, memory: 6457, ema_momentum: 0.9933, sup_loss_rpn_cls: 0.4306, sup_loss_rpn_bbox: 0.1952, sup_loss_cls: 1.1652, sup_acc: 83.9617, sup_loss_bbox: 0.4929, unsup_loss_rpn_cls: 0.1454, unsup_loss_rpn_bbox: 0.0000, unsup_loss_cls: 0.1593, unsup_acc: 99.1749, unsup_loss_bbox: 0.0000, loss: 2.5886
2022-03-19 17:11:20,942 - mmdet.ssod - INFO - Iter [200/180000] lr: 3.986e-03, eta: 4 days, 12:58:59, time: 2.102, data_time: 0.072, memory: 6457, ema_momentum: 0.9950, sup_loss_rpn_cls: 0.3917, sup_loss_rpn_bbox: 0.1895, sup_loss_cls: 0.9367, sup_acc: 83.9802, sup_loss_bbox: 0.5335, unsup_loss_rpn_cls: 0.1688, unsup_loss_rpn_bbox: 0.0000, unsup_loss_cls: 0.1491, unsup_acc: 99.9062, unsup_loss_bbox: 0.0000, loss: 2.3693
2022-03-19 17:13:05,290 - mmdet.ssod - INFO - Iter [250/180000] lr: 4.985e-03, eta: 4 days, 12:00:09, time: 2.087, data_time: 0.070, memory: 6457, ema_momentum: 0.9960, sup_loss_rpn_cls: 3.6690, sup_loss_rpn_bbox: 2.2216, sup_loss_cls: 380.6252, sup_acc: 82.1258, sup_loss_bbox: 28.5039, unsup_loss_rpn_cls: 15.4285, unsup_loss_rpn_bbox: 0.0000, unsup_loss_cls: 1.3266, unsup_acc: 95.7066, unsup_loss_bbox: 0.0000, loss: 431.7747

phelps-matthew commented 2 years ago

If it is of any help, I believe I have also seen mention of perhaps needing to ensure the learning rate is not too high when adjusting batch size (i.e. larger batch sizer, smaller learning rate). For my data, I have found that the settings of sample ratio, batch size, and lr to be unstable, such that only a very small volume of hyperparameter space yielded decent results, albeit underwhelming given the amount of labeled data I had.

ericpresas commented 2 years ago

Setting workers_per_gpu=0 I was able to see that loss=nan. Giving an smaller value to learning rate solves the problem in my case. Thanks @phelps-matthew !!!

microsoft / SoftTeacher

同样的错误loss_cls? #168