open-mmlab / mmdetection3d

OpenMMLab's next-generation platform for general 3D object detection.
https://mmdetection3d.readthedocs.io/en/latest/
Apache License 2.0
5.2k stars 1.53k forks source link

Nuimages segmentation task error #1255

Closed konyul closed 2 years ago

konyul commented 2 years ago
  1. when I train htc with command python tools/test.py configs/nuimages/cascade_mask_rcnn_r50_fpn_1x_nuim.py cascade_mask_rcnn_r50_fpn_1x_nuim_20201008_195342-1147c036.pth --eval segm, error occurs like Exception has occurred: IndexError (note: full exception trace is shown but execution is paused at: _run_module_as_main) only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices File "[/opt/conda/lib/python3.8/site-packages/mmdet/datasets/coco.py]()", line 288, in _segm2json if isinstance(segms[i]['counts'], bytes): File "[/opt/conda/lib/python3.8/site-packages/mmdet/datasets/coco.py]()", line 320, in results2json json_results = self._segm2json(results) File "[/opt/conda/lib/python3.8/site-packages/mmdet/datasets/coco.py]()", line 383, in format_results result_files = self.results2json(results, jsonfile_prefix) File "[/opt/conda/lib/python3.8/site-packages/mmdet/datasets/coco.py]()", line 438, in evaluate result_files, tmp_dir = self.format_results(results, jsonfile_prefix) File "[/mnt/sda/kypark/mmdetection3d/tools/test.py]()", line 234, in main print(dataset.evaluate(outputs, **eval_kwargs)) File "[/mnt/sda/kypark/mmdetection3d/tools/test.py]()", line 238, in main() File "[/opt/conda/lib/python3.8/runpy.py]()", line 87, in _run_code exec(code, run_globals) File "[/opt/conda/lib/python3.8/runpy.py]()", line 97, in _run_module_code _run_code(code, mod_globals, init_globals, File "[/opt/conda/lib/python3.8/runpy.py]()", line 265, in run_path return _run_module_code(code, init_globals, run_name, File "[/opt/conda/lib/python3.8/runpy.py]()", line 87, in _run_code exec(code, run_globals) File "[/opt/conda/lib/python3.8/runpy.py]()", line 194, in _run_module_as_main (Current frame) return _run_code(code, main_globals, None,

  2. when I train htc with following command : CUDA_VISIBLE_DEVICES=0,1 tools/dist_train.sh configs/nuimages/htc_x101_64x4d_fpn_dconv_c3-c5_coco-20e_16x1_20e_nuim.py 2, the training loss becomes nan after few iterations. 2022-02-19 12:52:48,269 - mmdet - INFO - Epoch [1][50/30105] lr: 1.978e-03, eta: 10 days, 5:31:04, time: 1.468, data_time: 0.097, memory: 12798, loss_rpn_cls: 0.0068, loss_rpn_bbox: 0.0114, loss_semantic_seg: 0.5636, s0.loss_cls: 0.1252, s0.acc: 95.0645, s0.loss_bbox: 0.0670, s0.loss_mask: 0.2330, s1.loss_cls: 0.0577, s1.acc: 95.4884, s1.loss_bbox: 0.1056, s1.loss_mask: 0.1148, s2.loss_cls: 0.0297, s2.acc: 95.2546, s2.loss_bbox: 0.0784, s2.loss_mask: 0.0555, loss: 1.4487 2022-02-19 12:53:59,565 - mmdet - INFO - Epoch [1][100/30105] lr: 3.976e-03, eta: 10 days, 1:58:29, time: 1.426, data_time: 0.032, memory: 13125, loss_rpn_cls: 0.0060, loss_rpn_bbox: 0.0132, loss_semantic_seg: 0.0275, s0.loss_cls: 0.1275, s0.acc: 94.9297, s0.loss_bbox: 0.0736, s0.loss_mask: 0.2282, s1.loss_cls: 0.0604, s1.acc: 95.3703, s1.loss_bbox: 0.1142, s1.loss_mask: 0.1121, s2.loss_cls: 0.0315, s2.acc: 95.0123, s2.loss_bbox: 0.0871, s2.loss_mask: 0.0552, loss: 0.9365 2022-02-19 12:55:11,869 - mmdet - INFO - Epoch [1][150/30105] lr: 5.974e-03, eta: 10 days, 1:54:03, time: 1.446, data_time: 0.027, memory: 13125, loss_rpn_cls: 0.0109, loss_rpn_bbox: 0.0140, loss_semantic_seg: 0.0277, s0.loss_cls: 0.1552, s0.acc: 94.1270, s0.loss_bbox: 0.0862, s0.loss_mask: 0.2366, s1.loss_cls: 0.0741, s1.acc: 94.4801, s1.loss_bbox: 0.1260, s1.loss_mask: 0.1171, s2.loss_cls: 0.0376, s2.acc: 94.2086, s2.loss_bbox: 0.0870, s2.loss_mask: 0.0571, loss: 1.0296 2022-02-19 12:56:22,025 - mmdet - INFO - Epoch [1][200/30105] lr: 7.972e-03, eta: 10 days, 0:03:31, time: 1.403, data_time: 0.044, memory: 13272, loss_rpn_cls: 0.0211, loss_rpn_bbox: 0.0258, loss_semantic_seg: 0.0369, s0.loss_cls: 0.1864, s0.acc: 92.9004, s0.loss_bbox: 0.1127, s0.loss_mask: 0.2577, s1.loss_cls: 0.0904, s1.acc: 93.2562, s1.loss_bbox: 0.1449, s1.loss_mask: 0.1242, s2.loss_cls: 0.0454, s2.acc: 92.8702, s2.loss_bbox: 0.0924, s2.loss_mask: 0.0600, loss: 1.1978 2022-02-19 12:57:30,625 - mmdet - INFO - Epoch [1][250/30105] lr: 9.970e-03, eta: 9 days, 21:54:11, time: 1.372, data_time: 0.035, memory: 13272, loss_rpn_cls: 0.0269, loss_rpn_bbox: 0.0291, loss_semantic_seg: 0.0324, s0.loss_cls: 0.2031, s0.acc: 92.5410, s0.loss_bbox: 0.1045, s0.loss_mask: 0.2751, s1.loss_cls: 0.0991, s1.acc: 92.7682, s1.loss_bbox: 0.1308, s1.loss_mask: 0.1356, s2.loss_cls: 0.0490, s2.acc: 92.5318, s2.loss_bbox: 0.0882, s2.loss_mask: 0.0676, loss: 1.2413 2022-02-19 12:58:40,022 - mmdet - INFO - Epoch [1][300/30105] lr: 1.197e-02, eta: 9 days, 20:54:20, time: 1.388, data_time: 0.032, memory: 13272, loss_rpn_cls: 0.0291, loss_rpn_bbox: 0.0266, loss_semantic_seg: 0.0356, s0.loss_cls: 0.2330, s0.acc: 92.1406, s0.loss_bbox: 0.1194, s0.loss_mask: 0.3224, s1.loss_cls: 0.1156, s1.acc: 91.9972, s1.loss_bbox: 0.1478, s1.loss_mask: 0.1520, s2.loss_cls: 0.0563, s2.acc: 91.8879, s2.loss_bbox: 0.0879, s2.loss_mask: 0.0738, loss: 1.3997 2022-02-19 12:59:50,268 - mmdet - INFO - Epoch [1][350/30105] lr: 1.397e-02, eta: 9 days, 20:35:39, time: 1.405, data_time: 0.041, memory: 13272, loss_rpn_cls: 0.0434, loss_rpn_bbox: 0.0288, loss_semantic_seg: 0.0487, s0.loss_cls: 0.2206, s0.acc: 92.7305, s0.loss_bbox: 0.1241, s0.loss_mask: 0.3407, s1.loss_cls: 0.1056, s1.acc: 93.0553, s1.loss_bbox: 0.1470, s1.loss_mask: 0.1621, s2.loss_cls: 0.0488, s2.acc: 93.3164, s2.loss_bbox: 0.0788, s2.loss_mask: 0.0766, loss: 1.4252 2022-02-19 13:00:56,281 - mmdet - INFO - Epoch [1][400/30105] lr: 1.596e-02, eta: 9 days, 18:35:09, time: 1.320, data_time: 0.030, memory: 13272, loss_rpn_cls: nan, loss_rpn_bbox: nan, loss_semantic_seg: nan, s0.loss_cls: nan, s0.acc: 79.7120, s0.loss_bbox: nan, s0.loss_mask: nan, s1.loss_cls: nan, s1.acc: 80.0003, s1.loss_bbox: nan, s1.loss_mask: nan, s2.loss_cls: nan, s2.acc: 79.9697, s2.loss_bbox: nan, s2.loss_mask: nan, loss: nan 2022-02-19 13:01:56,643 - mmdet - INFO - Epoch [1][450/30105] lr: 1.796e-02, eta: 9 days, 14:55:16, time: 1.207, data_time: 0.039, memory: 13272, loss_rpn_cls: nan, loss_rpn_bbox: nan, loss_semantic_seg: nan, s0.loss_cls: nan, s0.acc: 38.5134, s0.loss_bbox: nan, s0.loss_mask: nan, s1.loss_cls: nan, s1.acc: 38.5134, s1.loss_bbox: nan, s1.loss_mask: nan, s2.loss_cls: nan, s2.acc: 38.5134, s2.loss_bbox: nan, s2.loss_mask: nan, loss: nan 2022-02-19 13:02:53,735 - mmdet - INFO - Epoch [1][500/30105] lr: 1.996e-02, eta: 9 days, 10:53:31, time: 1.142, data_time: 0.042, memory: 13272, loss_rpn_cls: nan, loss_rpn_bbox: nan, loss_semantic_seg: nan, s0.loss_cls: nan, s0.acc: 35.8741, s0.loss_bbox: nan, s0.loss_mask: nan, s1.loss_cls: nan, s1.acc: 35.8741, s1.loss_bbox: nan, s1.loss_mask: nan, s2.loss_cls: nan, s2.acc: 35.8741, s2.loss_bbox: nan, s2.loss_mask: nan, loss: nan 2022-02-19 13:03:53,128 - mmdet - INFO - Epoch [1][550/30105] lr: 2.000e-02, eta: 9 days, 8:17:33, time: 1.188, data_time: 0.038, memory: 13272, loss_rpn_cls: nan, loss_rpn_bbox: nan, loss_semantic_seg: nan, s0.loss_cls: nan, s0.acc: 43.2959, s0.loss_bbox: nan, s0.loss_mask: nan, s1.loss_cls: nan, s1.acc: 43.2959, s1.loss_bbox: nan, s1.loss_mask: nan, s2.loss_cls: nan, s2.acc: 43.2959, s2.loss_bbox: nan, s2.loss_mask: nan, loss: nan 2022-02-19 13:04:51,275 - mmdet - INFO - Epoch [1][600/30105] lr: 2.000e-02, eta: 9 days, 5:46:36, time: 1.163, data_time: 0.031, memory: 13272, loss_rpn_cls: nan, loss_rpn_bbox: nan, loss_semantic_seg: nan, s0.loss_cls: nan, s0.acc: 41.6359, s0.loss_bbox: nan, s0.loss_mask: nan, s1.loss_cls: nan, s1.acc: 41.6359, s1.loss_bbox: nan, s1.loss_mask: nan, s2.loss_cls: nan, s2.acc: 41.6359, s2.loss_bbox: nan, s2.loss_mask: nan, loss: nan 2022-02-19 13:05:46,677 - mmdet - INFO - Epoch [1][650/30105] lr: 2.000e-02, eta: 9 days, 2:56:23, time: 1.108, data_time: 0.037, memory: 13272, loss_rpn_cls: nan, loss_rpn_bbox: nan, loss_semantic_seg: nan, s0.loss_cls: nan, s0.acc: 35.3376, s0.loss_bbox: nan, s0.loss_mask: nan, s1.loss_cls: nan, s1.acc: 35.3376, s1.loss_bbox: nan, s1.loss_mask: nan, s2.loss_cls: nan, s2.acc: 35.3376, s2.loss_bbox: nan, s2.loss_mask: nan, loss: nan 2022-02-19 13:06:44,278 - mmdet - INFO - Epoch [1][700/30105] lr: 2.000e-02, eta: 9 days, 1:01:53, time: 1.152, data_time: 0.037, memory: 13272, loss_rpn_cls: nan, loss_rpn_bbox: nan, loss_semantic_seg: nan, s0.loss_cls: nan, s0.acc: 42.5351, s0.loss_bbox: nan, s0.loss_mask: nan, s1.loss_cls: nan, s1.acc: 42.5351, s1.loss_bbox: nan, s1.loss_mask: nan, s2.loss_cls: nan, s2.acc: 42.5351, s2.loss_bbox: nan, s2.loss_mask: nan, loss: nan

Tai-Wang commented 2 years ago

Please show the complete log for the first error. For the second one, the original model is trained on 16 GPUs and batch_size 1 for each GPU. So you may need to reduce the learning to 1/8 in your case.

konyul commented 2 years ago

Please show the complete log for the first error. For the second one, the original model is trained on 16 GPUs and batch_size 1 for each GPU. So you may need to reduce the learning to 1/8 in your case.

Thank you for your reply!! I will try it by reducing the lr to 1/8

And for the first error the full error log is

root@f5a3d94f3145:/mnt/sda/kypark/mmdetection3d# python3 tools/test.py configs/nuimages/cascade_mask_rcnn_r50_fpn_1x_nuim.py cascade_mask_rcnn_r50_fpn_1x_nuim_20201008_195342-1147c036.pth --eval segm

loading annotations into memory... Done (t=0.00s) creating index... index created! load checkpoint from local path: cascade_mask_rcnn_r50_fpn_1x_nuim_20201008_195342-1147c036.pth [>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 50/50, 6.7 task/s, elapsed: 7s, ETA: 0sTraceback (most recent call last): File "tools/test.py", line 238, in main() File "tools/test.py", line 234, in main print(dataset.evaluate(outputs, **eval_kwargs)) File "/opt/conda/lib/python3.8/site-packages/mmdet/datasets/coco.py", line 438, in evaluate result_files, tmp_dir = self.format_results(results, jsonfile_prefix) File "/opt/conda/lib/python3.8/site-packages/mmdet/datasets/coco.py", line 383, in format_results result_files = self.results2json(results, jsonfile_prefix) File "/opt/conda/lib/python3.8/site-packages/mmdet/datasets/coco.py", line 320, in results2json json_results = self._segm2json(results) File "/opt/conda/lib/python3.8/site-packages/mmdet/datasets/coco.py", line 288, in _segm2json if isinstance(segms[i]['counts'], bytes): IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices

Thank you

Tai-Wang commented 2 years ago

Sorry for the late reply. Is this problem solved now? It's really strange because the error occurs inside mmdet, and the possible reason is an incorrect i or 'counts' index?

Bin-ze commented 1 year ago

Please show the complete log for the first error. For the second one, the original model is trained on 16 GPUs and batch_size 1 for each GPU. So you may need to reduce the learning to 1/8 in your case.

Thank you for your reply!! I will try it by reducing the lr to 1/8

And for the first error the full error log is

root@f5a3d94f3145:/mnt/sda/kypark/mmdetection3d# python3 tools/test.py configs/nuimages/cascade_mask_rcnn_r50_fpn_1x_nuim.py cascade_mask_rcnn_r50_fpn_1x_nuim_20201008_195342-1147c036.pth --eval segm

loading annotations into memory... Done (t=0.00s) creating index... index created! load checkpoint from local path: cascade_mask_rcnn_r50_fpn_1x_nuim_20201008_195342-1147c036.pth [>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 50/50, 6.7 task/s, elapsed: 7s, ETA: 0sTraceback (most recent call last): File "tools/test.py", line 238, in main() File "tools/test.py", line 234, in main print(dataset.evaluate(outputs, **eval_kwargs)) File "/opt/conda/lib/python3.8/site-packages/mmdet/datasets/coco.py", line 438, in evaluate result_files, tmp_dir = self.format_results(results, jsonfile_prefix) File "/opt/conda/lib/python3.8/site-packages/mmdet/datasets/coco.py", line 383, in format_results result_files = self.results2json(results, jsonfile_prefix) File "/opt/conda/lib/python3.8/site-packages/mmdet/datasets/coco.py", line 320, in results2json json_results = self._segm2json(results) File "/opt/conda/lib/python3.8/site-packages/mmdet/datasets/coco.py", line 288, in _segm2json if isinstance(segms[i]['counts'], bytes): IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices

Thank you

Can you tell me how your nuimages data set was prepared, I would like to surface some paper results on it, but the data set is huge, I don't know how to prepare it and double track the paper results