Error when training htc_ResNeXt101

123dddd commented 2 years ago

Describe the bug When I am training the htc_ResNeXt101 detector by using this command: python tools/train.py configs/elephant/cityperson/htc_ResNeXt101.py --validate --work_dir ./models_trained/HTC/, the training process successfully finish the first 6 epoches, but during the 7th epoch, the following error appeared:

2022-01-16 02:00:08,981 - INFO - Epoch [7][1900/2778]   lr: 0.02000, eta: 8:54:06, time: 0.851, data_time: 0.014, memory: 10554, loss_rpn_cls: 537457662808784960.0000, loss_rpn_bbox: 177877539514035232.0000, s0.loss_cls: 8779796694583555072.0000, s0.acc: 79.4747, s0.loss_bbox: 2188881145142797056.0000, s0.loss_mask: 1706358569175435.5000, s1.loss_cls: 149179163544190432.0000, s1.acc: 81.2150, s1.loss_bbox: 307971099801862912.0000, s1.loss_mask: 1895659153800507.2500, s2.loss_cls: 593093651999759104.0000, s2.acc: 80.3081, s2.loss_bbox: 26326124493392948.0000, s2.loss_mask: 1347776918467266.5000, loss: 12765533010470387712.0000
2022-01-16 02:00:52,071 - INFO - Epoch [7][1950/2778]   lr: 0.02000, eta: 8:53:23, time: 0.862, data_time: 0.014, memory: 10554, loss_rpn_cls: 703597399817019434898090671210496.0000, loss_rpn_bbox: 142211833770289042386915505995776.0000, s0.loss_cls: 96590708623828695557353355803099136.0000, s0.acc: 27.1664, s0.loss_bbox: 220626973202716463704370523184037888.0000, s0.loss_mask: 7084849259671005658590220386304.0000, s1.loss_cls: 304813336321379756568929992966144.0000, s1.acc: 71.3955, s1.loss_bbox: 3196226430554024956218411789058048.0000, s1.loss_mask: 24094007064355318124536649482240.0000, s2.loss_cls: 2452373500545427948693333055897600.0000, s2.acc: 79.7458, s2.loss_bbox: 636005480000575789205006416084992.0000, s2.loss_mask: 30171707257473904515461601558528.0000, loss: 324714277385474614732487472367796224.0000
Traceback (most recent call last):
  File "tools/train.py", line 99, in <module>
    main()
  File "tools/train.py", line 95, in main
    logger=logger)
  File "/content/Pedestron/mmdet/apis/train.py", line 63, in train_detector
    _non_dist_train(model, dataset, cfg, validate=validate)
  File "/content/Pedestron/mmdet/apis/train.py", line 219, in _non_dist_train
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
  File "/usr/local/lib/python3.7/site-packages/mmcv/runner/runner.py", line 358, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/usr/local/lib/python3.7/site-packages/mmcv/runner/runner.py", line 264, in train
    self.model, data_batch, train_mode=True, **kwargs)
  File "/content/Pedestron/mmdet/apis/train.py", line 41, in batch_processor
    losses = model(**data)
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/Pedestron/mmdet/core/fp16/decorators.py", line 49, in new_func
    return old_func(*args, **kwargs)
  File "/content/Pedestron/mmdet/models/detectors/base.py", line 86, in forward
    return self.forward_train(img, img_meta, **kwargs)
  File "/content/Pedestron/mmdet/models/detectors/htc.py", line 238, in forward_train
    rois, roi_labels, bbox_pred, pos_is_gts, img_meta)
  File "/content/Pedestron/mmdet/core/fp16/decorators.py", line 127, in new_func
    return old_func(*args, **kwargs)
  File "/content/Pedestron/mmdet/models/bbox_heads/bbox_head.py", line 195, in refine_bboxes
    img_meta_)
  File "/content/Pedestron/mmdet/core/fp16/decorators.py", line 127, in new_func
    return old_func(*args, **kwargs)
  File "/content/Pedestron/mmdet/models/bbox_heads/bbox_head.py", line 218, in regress_by_class
    assert rois.size(1) == 4 or rois.size(1) == 5
IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)

Reproduction

What command or script did you run? python tools/train.py configs/elephant/cityperson/htc_ResNeXt101.py --validate --work_dir ./models_trained/HTC/
Did you make any modifications on the code or config? Did you understand what you have modified? In the config file of HTC, I set the imgs_per_gpu=1, workers_per_gpu=2 and resume_from = None

Environment

PyTorch version 1.4.0; mmcv version 0.2.10
- How you installed PyTorch conda install pytorch==1.4.0 torchvision==0.5.0 cudatoolkit=10.1 -c pytorch
- GPU model P100Nvidia
- CUDA and CUDNN 10.1

hasanirtiza commented 2 years ago

I am afraid this is a tricky problem. You can read in more detail here. In short, as stated the problem is due to :

Some images have no gt annotations: Allow for images to contain zero true detections #1531 solves it.
Some images have no proposals: the model is not trained well so that RPN generates no proposals, in this case you need to check hyper-parameters.

A quick fix is to resume training, by loading the last saved check point and hopefully it will work. There is field at the end of the config, load_from, pass the path of your last model and see if it works.

123dddd commented 2 years ago

Thanks! I will have a look on the similar issues from mmdet and try to change some hype-parameters in the config. I thought maybe the default LR 0.02 is bit larger in my single GPU case. I will reduce it to 0.005 to see if this error will happen again.

hasanirtiza / Pedestron

Error when training htc_ResNeXt101 #129