open-mmlab / mmdetection

OpenMMLab Detection Toolbox and Benchmark
https://mmdetection.readthedocs.io
Apache License 2.0
29.45k stars 9.43k forks source link

RuntimeError: Default process group has not been initialized, please make sure to call init_process_group. #6237

Closed CharlesNJ closed 3 years ago

CharlesNJ commented 3 years ago

Describe the bug I think this is something related to PyTorch, I am trying to understand what is the problem. I am just trying to run a pre-trained model to transfer the weights to train a custom model.

Reproduction

  1. What command or script did you run? python tools/train.py 'configs/truck/cascade_mask_rcnn_swin_small_patch4_window7_mstrain_480-800_giou_4conv1f_adamw_3x_truck.py'

  2. Did you make any modifications on the code or config? Did you understand what you have modified? Yes, but does not seem to be the problem in a config file.

  3. What dataset did you use? Custom dataset with annotation for detection

Environment

I don't think this is env problem

Error traceback

    main()
  File "tools/train.py", line 177, in main
    train_detector(
  File "/home/w/cbn/CBNetV2/mmdet/apis/train.py", line 185, in train_detector
    runner.run(data_loaders, cfg.workflow)
  File "/home/w/cbn/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/w/cbn/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
    self.run_iter(data_batch, train_mode=True, **kwargs)
  File "/home/w/cbn/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 29, in run_iter
    outputs = self.model.train_step(data_batch, self.optimizer,
  File "/home/w/cbn/lib/python3.8/site-packages/mmcv/parallel/data_parallel.py", line 67, in train_step
    return self.module.train_step(*inputs[0], **kwargs[0])
  File "/home/w/cbn/CBNetV2/mmdet/models/detectors/base.py", line 237, in train_step
    losses = self(**data)
  File "/home/w/cbn/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/w/cbn/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 128, in new_func
    output = old_func(*new_args, **new_kwargs)
  File "/home/w/cbn/CBNetV2/mmdet/models/detectors/base.py", line 171, in forward
    return self.forward_train(img, img_metas, **kwargs)
  File "/home/w/cbn/CBNetV2/mmdet/models/detectors/two_stage.py", line 263, in forward_train
    roi_losses = self.roi_head.forward_train(x, img_metas, proposal_list,
  File "/home/w/cbn/CBNetV2/mmdet/models/roi_heads/cascade_roi_head.py", line 246, in forward_train
    bbox_results = self._bbox_forward_train(i, x, sampling_results,
  File "/home/w/cbn/CBNetV2/mmdet/models/roi_heads/cascade_roi_head.py", line 146, in _bbox_forward_train
    bbox_results = self._bbox_forward(stage, x, rois)
  File "/home/w/cbn/CBNetV2/mmdet/models/roi_heads/cascade_roi_head.py", line 136, in _bbox_forward
    cls_score, bbox_pred = bbox_head(bbox_feats)
  File "/home/w/cbn/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/w/cbn/CBNetV2/mmdet/models/roi_heads/bbox_heads/convfc_bbox_head.py", line 155, in forward
    x = conv(x)
  File "/home/w/cbn/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/w/cbn/lib/python3.8/site-packages/mmcv/cnn/bricks/conv_module.py", line 201, in forward
    x = self.norm(x)
  File "/home/w/cbn/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/w/cbn/lib/python3.8/site-packages/torch/nn/modules/batchnorm.py", line 731, in forward
    world_size = torch.distributed.get_world_size(process_group)
  File "/home/w/cbn/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 748, in get_world_size
    return _get_group_size(group)
  File "/home/w/cbn/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 274, in _get_group_size
    default_pg = _get_default_group()
  File "/home/w/cbn/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 358, in _get_default_group
    raise RuntimeError("Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
CharlesNJ commented 3 years ago

I think this has something to do with DistributedDataParallel or DataParallel but not sure what the problem is. Would appreciate if you can direct me where to look for the problem. Thanks!

RangiLyu commented 3 years ago

Could you provide more details about what you modified to the code and config?

CharlesNJ commented 3 years ago

@RangiLyu Thanks for responding! I have been making some silly mistakes in using the wrong config file. Once I have that sorted out, if the error still persists, I will reopen the issue!

Rubenman16 commented 2 years ago

Hello @CharlesNJ , how did you solve it? I'm pretty sure I used the correct config file.

CharlesNJ commented 2 years ago

Hello @CharlesNJ , how did you solve it? I'm pretty sure I used the correct config file.

Yeah, sorry I should have documented it better, but I think I was just the wrong file usage for me. Could you post your error with tb here?

CharlesNJ commented 2 years ago

Or even better just open a new issue and tag me, I will see if I can recognize any error I've faced.

MohitBurkule commented 1 year ago

This can be resolved by changing SyncBN with BN in model.roi_head.bbox_head (and also everyhere else) Or alternatively increasing the number of GPUs available It is because SyncBN does not work in a single GPU setup (https://github.com/pytorch/pytorch/issues/63662)