Open yazidoudou18 opened 3 years ago
I had similar problem. You can try use bigger batch size in samples_per_gpu in config file. You can try too the suggestions from #69
Thank you for responding. I will give it a try. Do you know to how to define numbers and ids of gpus during the training? Looks like if I defined
Hello. The train file tools/dist_train.sh
use multiple distributed training processes by torch.distributed.launch. In this file the parameter nproc_per_node
is the number of process per node, this number needs to be less or equal to the number of GPUs on the current system and each process will be operating on a single GPU from GPU 0 to GPU (nproc_per_node - 1).
If you want change the default gpu for the first process you can add --node_rank=<ANOTHER_GPU_NUMBER>
in tools/dist_train.sh
. Another way is before your code started the training, you try use:
torch.cuda.set_device('YOUR_GPU_ID')
As the tutorial suggested, I started training the soft_teacher_faster_rcnn_r50_caffe_fpn_coco_180k.py model with the default settings on coco dataset.
The script that I used for training: for FOLD in 1 2 3 4 5; do bash tools/dist_train_partially.sh semi ${FOLD} 10 1 done
After 10m 33s of training, this is the error that I have received. Can anyone help me to debug this error please?
2021-11-09 22:53:41,757 - mmdet.ssod - INFO - Iter [150/180000] lr: 2.987e-03, eta: 2 days, 2:53:22, time: 0.962, data_time: 0.038, memory: 6776, ema_momentum: 0.9933, sup_loss_rpn_cls: 0.3320, sup_loss_rpn_bbox: 0.1104, sup_loss_cls: 0.5242, sup_acc: 94.4134, sup_loss_bbox: 0.2345, unsup_loss_rpn_cls: 0.1124, unsup_loss_rpn_bbox: 0.0000, unsup_loss_cls: 0.0598, unsup_acc: 100.0000, unsup_loss_bbox: 0.0000, loss: 1.3733 2021-11-09 22:54:27,720 - mmdet.ssod - INFO - Iter [200/180000] lr: 3.986e-03, eta: 2 days, 1:38:03, time: 0.919, data_time: 0.037, memory: 6776, ema_momentum: 0.9950, sup_loss_rpn_cls: 0.2594, sup_loss_rpn_bbox: 0.0862, sup_loss_cls: 0.5218, sup_acc: 94.4144, sup_loss_bbox: 0.2317, unsup_loss_rpn_cls: 0.1035, unsup_loss_rpn_bbox: 0.0000, unsup_loss_cls: 0.0579, unsup_acc: 100.0000, unsup_loss_bbox: 0.0000, loss: 1.2605 2021-11-09 22:55:17,469 - mmdet.ssod - INFO - Iter [250/180000] lr: 4.985e-03, eta: 2 days, 1:37:55, time: 0.995, data_time: 0.037, memory: 6776, ema_momentum: 0.9960, sup_loss_rpn_cls: 0.2289, sup_loss_rpn_bbox: 0.0875, sup_loss_cls: 0.5456, sup_acc: 94.2955, sup_loss_bbox: 0.2385, unsup_loss_rpn_cls: 0.0753, unsup_loss_rpn_bbox: 0.0000, unsup_loss_cls: 0.0549, unsup_acc: 100.0000, unsup_loss_bbox: 0.0000, loss: 1.2307 2021-11-09 22:55:58,398 - mmdet.ssod - INFO - Iter [300/180000] lr: 5.984e-03, eta: 2 days, 0:09:31, time: 0.819, data_time: 0.036, memory: 6776, ema_momentum: 0.9967, sup_loss_rpn_cls: 0.2532, sup_loss_rpn_bbox: 0.0975, sup_loss_cls: 0.5785, sup_acc: 93.4809, sup_loss_bbox: 0.2680, unsup_loss_rpn_cls: 0.0862, unsup_loss_rpn_bbox: 0.0000, unsup_loss_cls: 0.0669, unsup_acc: 100.0000, unsup_loss_bbox: 0.0000, loss: 1.3503 2021-11-09 22:56:44,082 - mmdet.ssod - INFO - Iter [350/180000] lr: 6.983e-03, eta: 1 day, 23:46:51, time: 0.914, data_time: 0.035, memory: 6776, ema_momentum: 0.9971, sup_loss_rpn_cls: 0.2057, sup_loss_rpn_bbox: 0.0756, sup_loss_cls: 0.5141, sup_acc: 94.4669, sup_loss_bbox: 0.2291, unsup_loss_rpn_cls: 0.0679, unsup_loss_rpn_bbox: 0.0000, unsup_loss_cls: 0.0541, unsup_acc: 99.9980, unsup_loss_bbox: 0.0000, loss: 1.1465 2021-11-09 22:57:32,200 - mmdet.ssod - INFO - Iter [400/180000] lr: 7.982e-03, eta: 1 day, 23:47:52, time: 0.962, data_time: 0.035, memory: 6776, ema_momentum: 0.9975, sup_loss_rpn_cls: 0.2563, sup_loss_rpn_bbox: 0.1113, sup_loss_cls: 0.5627, sup_acc: 93.1599, sup_loss_bbox: 0.2776, unsup_loss_rpn_cls: 0.0925, unsup_loss_rpn_bbox: 0.0000, unsup_loss_cls: 0.0738, unsup_acc: 99.9895, unsup_loss_bbox: 0.0000, loss: 1.3742 2021-11-09 22:58:20,723 - mmdet.ssod - INFO - Iter [450/180000] lr: 8.981e-03, eta: 1 day, 23:51:11, time: 0.970, data_time: 0.036, memory: 6776, ema_momentum: 0.9978, sup_loss_rpn_cls: 0.2339, sup_loss_rpn_bbox: 0.1020, sup_loss_cls: 0.5518, sup_acc: 94.3613, sup_loss_bbox: 0.2326, unsup_loss_rpn_cls: 0.0830, unsup_loss_rpn_bbox: 0.0000, unsup_loss_cls: 0.0477, unsup_acc: 100.0000, unsup_loss_bbox: 0.0000, loss: 1.2511 2021-11-09 22:59:08,703 - mmdet.ssod - INFO - Iter [500/180000] lr: 9.980e-03, eta: 1 day, 23:50:25, time: 0.960, data_time: 0.035, memory: 6776, ema_momentum: 0.9980, sup_loss_rpn_cls: 0.2801, sup_loss_rpn_bbox: 0.1220, sup_loss_cls: 0.4242, sup_acc: 95.6570, sup_loss_bbox: 0.1740, unsup_loss_rpn_cls: 0.0971, unsup_loss_rpn_bbox: 0.0000, unsup_loss_cls: 0.0479, unsup_acc: 100.0000, unsup_loss_bbox: 0.0000, loss: 1.1452 Traceback (most recent call last): File "./train.py", line 198, in
main()
File "./train.py", line 186, in main
train_detector(
File "/raid/watk681/Chip_Terminator/Object_Detection/SoftTeacher/ssod/apis/train.py", line 206, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/home/watk681/anaconda3/envs/mage-mage-mmdetection+lrp/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 133, in run
iter_runner(iter_loaders[i], kwargs)
File "/home/watk681/anaconda3/envs/mage-mage-mmdetection+lrp/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 60, in train
outputs = self.model.train_step(data_batch, self.optimizer, kwargs)
File "/home/watk681/anaconda3/envs/mage-mage-mmdetection+lrp/lib/python3.8/site-packages/mmcv/parallel/distributed.py", line 52, in train_step
output = self.module.train_step(inputs[0], kwargs[0])
File "/home/watk681/anaconda3/envs/mage-mage-mmdetection+lrp/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 238, in train_step
losses = self(data)
File "/home/watk681/anaconda3/envs/mage-mage-mmdetection+lrp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(input, *kwargs)
File "/home/watk681/anaconda3/envs/mage-mage-mmdetection+lrp/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 128, in new_func
output = old_func(new_args, new_kwargs)
File "/home/watk681/anaconda3/envs/mage-mage-mmdetection+lrp/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 172, in forward
return self.forward_train(img, img_metas, kwargs)
File "/raid/watk681/Chip_Terminator/Object_Detection/SoftTeacher/ssod/models/soft_teacher.py", line 49, in forward_train
self.foward_unsup_train(
File "/raid/watk681/Chip_Terminator/Object_Detection/SoftTeacher/ssod/models/soft_teacher.py", line 77, in foward_unsup_train
return self.compute_pseudo_label_loss(student_info, teacher_info)
File "/raid/watk681/Chip_Terminator/Object_Detection/SoftTeacher/ssod/models/soft_teacher.py", line 110, in compute_pseudo_label_loss
self.unsup_rcnn_cls_loss(
File "/raid/watk681/Chip_Terminator/Object_Detection/SoftTeacher/ssod/models/soft_teacher.py", line 243, in unsup_rcnn_cls_loss
loss["loss_cls"] = loss["loss_cls"].sum() / max(bbox_targets[1].sum(), 1.0)
KeyError: 'loss_cls'