Open qq1243196045 opened 2 years ago
I have encountered similar issues but have found that mine are difficult to reproduce. As per #101 you may want to try increasing the batch size.
The keyerror occurs so far into the training run (~8000 steps) that I cannot yet confirm if batch samples is the core issue.
I have the same error and I noticed that in last training iteration before the error raises the computed loss is really high. It is related with the data?
2022-03-19 17:07:50,410 - mmdet.ssod - INFO - Iter [100/180000] lr: 1.988e-03, eta: 4 days, 16:52:48, time: 2.144, data_time: 0.074, memory: 6457, ema_momentum: 0.9900, sup_loss_rpn_cls: 0.4799, sup_loss_rpn_bbox: 0.2061, sup_loss_cls: 0.4395, sup_acc: 88.4707, sup_loss_bbox: 0.3006, unsup_loss_rpn_cls: 0.1637, unsup_loss_rpn_bbox: 0.0000, unsup_loss_cls: 0.0843, unsup_acc: 99.9947, unsup_loss_bbox: 0.0000, loss: 1.6742
2022-03-19 17:09:35,866 - mmdet.ssod - INFO - Iter [150/180000] lr: 2.987e-03, eta: 4 days, 14:21:18, time: 2.109, data_time: 0.074, memory: 6457, ema_momentum: 0.9933, sup_loss_rpn_cls: 0.4306, sup_loss_rpn_bbox: 0.1952, sup_loss_cls: 1.1652, sup_acc: 83.9617, sup_loss_bbox: 0.4929, unsup_loss_rpn_cls: 0.1454, unsup_loss_rpn_bbox: 0.0000, unsup_loss_cls: 0.1593, unsup_acc: 99.1749, unsup_loss_bbox: 0.0000, loss: 2.5886
2022-03-19 17:11:20,942 - mmdet.ssod - INFO - Iter [200/180000] lr: 3.986e-03, eta: 4 days, 12:58:59, time: 2.102, data_time: 0.072, memory: 6457, ema_momentum: 0.9950, sup_loss_rpn_cls: 0.3917, sup_loss_rpn_bbox: 0.1895, sup_loss_cls: 0.9367, sup_acc: 83.9802, sup_loss_bbox: 0.5335, unsup_loss_rpn_cls: 0.1688, unsup_loss_rpn_bbox: 0.0000, unsup_loss_cls: 0.1491, unsup_acc: 99.9062, unsup_loss_bbox: 0.0000, loss: 2.3693
2022-03-19 17:13:05,290 - mmdet.ssod - INFO - Iter [250/180000] lr: 4.985e-03, eta: 4 days, 12:00:09, time: 2.087, data_time: 0.070, memory: 6457, ema_momentum: 0.9960, sup_loss_rpn_cls: 3.6690, sup_loss_rpn_bbox: 2.2216, sup_loss_cls: 380.6252, sup_acc: 82.1258, sup_loss_bbox: 28.5039, unsup_loss_rpn_cls: 15.4285, unsup_loss_rpn_bbox: 0.0000, unsup_loss_cls: 1.3266, unsup_acc: 95.7066, unsup_loss_bbox: 0.0000, loss: 431.7747
If it is of any help, I believe I have also seen mention of perhaps needing to ensure the learning rate is not too high when adjusting batch size (i.e. larger batch sizer, smaller learning rate). For my data, I have found that the settings of sample ratio, batch size, and lr to be unstable, such that only a very small volume of hyperparameter space yielded decent results, albeit underwhelming given the amount of labeled data I had.
Setting workers_per_gpu=0
I was able to see that loss=nan
. Giving an smaller value to learning rate solves the problem in my case. Thanks @phelps-matthew !!!
我训练的时候同样遇到了KeyError: 'loss_cls' ,我因此在报错前面print了loss,发现在报错之前是正常的,但是报错的时候,loss少了loss_cls. 这是我print的地方:ssod/models/soft_teacher.py:243-245
这是print的结果
在最后一次print的时候,loss中少了loss_cls,在前一次print的时候,出现了-0.在前两次的时候,loss非常小,1e-9,我不知道这有什么联系,但是每次报错钱,都会有这种情况,前一次loss变成了-0.我将fp16设置成false也同样会出现这个问题