open-mmlab / mmaction2

OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark
https://mmaction2.readthedocs.io
Apache License 2.0
4.14k stars 1.22k forks source link

RuntimeError: Default process group has not been initialized, finetuning r2plus1d on ucf101 #849

Closed richardkxu closed 3 years ago

richardkxu commented 3 years ago

Describe the bug

Hi I have encountered an error complaining about pytorch dist training not being initialized properly when finetuning r2plus1d model on ucf101 dataset. I followed the "fine tuning tutorial' to setup a new config file for ucf101. The only changes are the dataset and the same config works for finetuning irCSN on ucf101. I think the r2plus1d and irCSN have the same runtime config and they use the same mmaction2/tools/train.py. I am really confused on where this error is coming from and how to fix it? Thank you!

2021-05-01 15:27:29,626 - mmaction - INFO - workflow: [('train', 1)], max: 90 epochs
Traceback (most recent call last):
  File "/home/richardkxu/Documents/mmaction2/tools/train.py", line 199, in <module>
    main()
  File "/home/richardkxu/Documents/mmaction2/tools/train.py", line 195, in main
    meta=meta)
  File "/home/richardkxu/Documents/mmaction2/mmaction/apis/train.py", line 163, in train_model
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs, **runner_kwargs)
  File "/home/richardkxu/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 125, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/richardkxu/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
    self.run_iter(data_batch, train_mode=True, **kwargs)
  File "/home/richardkxu/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter
    **kwargs)
  File "/home/richardkxu/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py", line 67, in train_step
    return self.module.train_step(*inputs[0], **kwargs[0])
  File "/home/richardkxu/Documents/mmaction2/mmaction/models/recognizers/base.py", line 267, in train_step
    losses = self(imgs, label, return_loss=True, **aux_info)
  File "/home/richardkxu/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/richardkxu/Documents/mmaction2/mmaction/models/recognizers/base.py", line 229, in forward
    return self.forward_train(imgs, label, **kwargs)
  File "/home/richardkxu/Documents/mmaction2/mmaction/models/recognizers/recognizer3d.py", line 17, in forward_train
    x = self.extract_feat(imgs)
  File "/home/richardkxu/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 84, in new_func
    return old_func(*args, **kwargs)
  File "/home/richardkxu/Documents/mmaction2/mmaction/models/recognizers/base.py", line 130, in extract_feat
    x = self.backbone(imgs)
  File "/home/richardkxu/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/richardkxu/Documents/mmaction2/mmaction/models/backbones/resnet2plus1d.py", line 42, in forward
    x = self.conv1(x)
  File "/home/richardkxu/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/richardkxu/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/cnn/bricks/conv_module.py", line 200, in forward
    x = self.norm(x)
  File "/home/richardkxu/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/richardkxu/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 532, in forward
    world_size = torch.distributed.get_world_size(process_group)
  File "/home/richardkxu/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 711, in get_world_size
    return _get_group_size(group)
  File "/home/richardkxu/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 263, in _get_group_size
    default_pg = _get_default_group()
  File "/home/richardkxu/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 347, in _get_default_group
    raise RuntimeError("Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

Process finished with exit code 1

Reproduction

  1. What command or script did you run?
CUDA_VISIBLE_DEVICES=0; python mmaction2/tools/train.py configs/recognition/r2plus1d/r2plus1d_r34_8x8x1_180e_ucf101_rgb.py --validate --seed 0 --deterministic

I did not modify any built-in r2plus1d config files and my config file is the following:

_base_ = [
    '../../_base_/models/r2plus1d_r34.py',
    '../../_base_/default_runtime.py'
]

# dataset settings
dataset_type = 'RawframeDataset'
data_root = 'data/ucf101/rawframes/'
data_root_val = 'data/ucf101/rawframes/'
split = 1  # official train/test splits. valid numbers: 1, 2, 3
ann_file_train = f'data/ucf101/ucf101_train_split_{split}_rawframes.txt'
ann_file_val = f'data/ucf101/ucf101_val_split_{split}_rawframes.txt'
ann_file_test = f'data/ucf101/ucf101_val_split_{split}_rawframes.txt'
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_bgr=False)
train_pipeline = [
    dict(type='SampleFrames', clip_len=8, frame_interval=8, num_clips=1),
    dict(type='RawFrameDecode'),
    dict(type='Resize', scale=(-1, 256)),
    dict(type='RandomResizedCrop'),
    dict(type='Resize', scale=(224, 224), keep_ratio=False),
    dict(type='Flip', flip_ratio=0.5),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='FormatShape', input_format='NCTHW'),
    dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
    dict(type='ToTensor', keys=['imgs', 'label'])
]
val_pipeline = [
    dict(
        type='SampleFrames',
        clip_len=8,
        frame_interval=8,
        num_clips=1,
        test_mode=True),
    dict(type='RawFrameDecode'),
    dict(type='Resize', scale=(-1, 256)),
    dict(type='CenterCrop', crop_size=224),
    dict(type='Flip', flip_ratio=0),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='FormatShape', input_format='NCTHW'),
    dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
    dict(type='ToTensor', keys=['imgs'])
]
test_pipeline = [
    dict(
        type='SampleFrames',
        clip_len=8,
        frame_interval=8,
        num_clips=10,
        test_mode=True),
    dict(type='RawFrameDecode'),
    dict(type='Resize', scale=(-1, 256)),
    dict(type='ThreeCrop', crop_size=256),
    dict(type='Flip', flip_ratio=0),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='FormatShape', input_format='NCTHW'),
    dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
    dict(type='ToTensor', keys=['imgs'])
]
data = dict(
    videos_per_gpu=8,
    workers_per_gpu=4,
    train=dict(
        type=dataset_type,
        ann_file=ann_file_train,
        data_prefix=data_root,
        pipeline=train_pipeline),
    val=dict(
        type=dataset_type,
        ann_file=ann_file_val,
        data_prefix=data_root_val,
        pipeline=val_pipeline),
    test=dict(
        type=dataset_type,
        ann_file=ann_file_val,
        data_prefix=data_root_val,
        pipeline=test_pipeline))
# optimizer
optimizer = dict(
    type='SGD', lr=0.1, momentum=0.9,
    weight_decay=0.0001)  # this lr is used for 8 gpus
optimizer_config = dict(grad_clip=dict(max_norm=40, norm_type=2))
# learning policy
lr_config = dict(policy='CosineAnnealing', min_lr=0)
#total_epochs = 180
total_epochs = 90

# runtime settings
checkpoint_config = dict(interval=5)
evaluation = dict(
    interval=5, metrics=['top_k_accuracy', 'mean_class_accuracy'])
work_dir = './work_dirs/r2plus1d_r34_8x8x1_180e_ucf101_rgb/'
find_unused_parameters = False
load_from = 'https://download.openmmlab.com/mmaction/recognition/r2plus1d/r2plus1d_r34_256p_8x8x1_180e_kinetics400_rgb/r2plus1d_r34_256p_8x8x1_180e_kinetics400_rgb_20200729-aa94765e.pth'
resume_from = None
dreamerlin commented 3 years ago

It seems your master port has been used, you should change it like https://github.com/open-mmlab/mmaction2/blob/master/tools/slurm_train.sh#L3

richardkxu commented 3 years ago

Hi @dreamerlin ,

I got the same error when running the following cmd on 1 GPU on a single machine with 4 GPUs:

MASTER_PORT=$((12000 + $RANDOM % 20000));CUDA_VISIBLE_DEVICES=0; python mmaction2/tools/train.py configs/recognition/r2plus1d/r2plus1d_r34_8x8x1_180e_ucf101_rgb.py --validate --seed 0 --deterministic

I am not using distributed training, nor slurm. I use the same cmd to run irCSN script and it works without setting the MASTER_PORT. I am wondering if there is anything different between the r2plus1d runtime config and irCSN runtime config?

congee524 commented 3 years ago

Maybe you can use distributed training and set GPUS=1 with different ports.

richardkxu commented 3 years ago

That does not fix the error either. I don't think it is related to the port being used. I used the same cmd to run irCSN without any problem. I think there might be a bug in the r2plus1d dist implementation or runtime config.

innerlee commented 3 years ago

Hi @richardkxu this is caused by SyncBN https://github.com/open-mmlab/mmaction2/blob/master/configs/_base_/models/r2plus1d_r34.py#L11 You can replace it by the usual BN. SyncBN thinks that you are in the distributed environment, so pytorch will try to get the worldrank. And then trigger the error because distributed has not been initialized.

innerlee commented 3 years ago

It is hard to debug because the **new** config system buries the important info to somewhere no one ever notices.

richardkxu commented 3 years ago

Thanks @innerlee! The error was fixed after replacing norm_cfg=dict(type='SyncBN', requires_grad=True, eps=1e-3), with norm_cfg=dict(type='BN3d', requires_grad=True, eps=1e-3),. Hopefully this patch can be added to the next release.

abhishek0696 commented 2 years ago

@richardkxu I had the same exact error. Thanks a tonne, @innerlee, your solution solved the error!!!