Runtime Data Parallel Error

jiadingfang commented 3 years ago

Similar to https://github.com/sconlyshootery/FeatDepth/issues/16#issue-747520482, I also ran into trouble using the distributed training. The command I ran:

python -m torch.distributed.launch --master_port=9900 --nproc_per_node=1 train.py --config config/cfg_kitti_fm.py --work_dir /home/fjd/Data/logs

The error report

/home/fjd/miniconda3/envs/featdepth/lib/python3.7/site-packages/torch/nn/functional.py:4004: UserWarning: Default grid_sample and affine_grid behavior has changed to align_corners=False since 1.3.0. Please specify align_corners=True if the old behavior is desired. See the documentation of grid_sample for details.
  "Default grid_sample and affine_grid behavior has changed "
Traceback (most recent call last):
  File "train.py", line 103, in <module>
    main()
  File "train.py", line 99, in main
    logger=logger)
  File "/home/fjd/FeatDepth/mono/apis/trainer.py", line 68, in train_mono
    _dist_train(model, dataset_train, dataset_val, cfg, validate=validate)
  File "/home/fjd/FeatDepth/mono/apis/trainer.py", line 177, in _dist_train
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
  File "/home/fjd/miniconda3/envs/featdepth/lib/python3.7/site-packages/mmcv/runner/runner.py", line 380, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/fjd/miniconda3/envs/featdepth/lib/python3.7/site-packages/mmcv/runner/runner.py", line 278, in train
    self.model, data_batch, train_mode=True, **kwargs)
  File "/home/fjd/FeatDepth/mono/apis/trainer.py", line 29, in batch_processor
    model_out, losses = model(data)
  File "/home/fjd/miniconda3/envs/featdepth/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/fjd/miniconda3/envs/featdepth/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 873, in forward
    if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
making sure all `forward` function outputs participate in calculating loss. 
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 0: 159 160 265 266
 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 21052) of binary: /home/fjd/miniconda3/envs/featdepth/bin/python
Traceback (most recent call last):
  File "/home/fjd/miniconda3/envs/featdepth/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/fjd/miniconda3/envs/featdepth/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/fjd/miniconda3/envs/featdepth/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/fjd/miniconda3/envs/featdepth/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/fjd/miniconda3/envs/featdepth/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/fjd/miniconda3/envs/featdepth/lib/python3.7/site-packages/torch/distributed/run.py", line 713, in run
    )(*cmd_args)
  File "/home/fjd/miniconda3/envs/featdepth/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/fjd/miniconda3/envs/featdepth/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2021-11-15_21:19:29
  host      : jd01
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 21052)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

The config I use:

POSE_LAYERS = 18#resnet18
FRAME_IDS = [0, -1, 1, 's']#0 refers to current frame, -1 and 1 refer to temperally adjacent frames, 's' refers to stereo adjacent frame.
IMGS_PER_GPU = 2 #the number of images fed to each GPU
HEIGHT = 320#input image height
WIDTH = 1024#input image width

data = dict(
    name = 'kitti',#dataset name
    split = 'exp',#training split name
    height = HEIGHT,
    width = WIDTH,
    frame_ids = FRAME_IDS,
    in_path = '/home/fjd/Data/kitti_data',#path to raw data
    gt_depth_path = '/home/fjd/Data/kitti_data/gt_depths.npz',#path to gt data
    png = False,#image format
    stereo_scale = True if 's' in FRAME_IDS else False,
)

model = dict(
    name = 'mono_fm',# select a model by name
    depth_num_layers = DEPTH_LAYERS,
    pose_num_layers = POSE_LAYERS,
    frame_ids = FRAME_IDS,
    imgs_per_gpu = IMGS_PER_GPU,
    height = HEIGHT,
    width = WIDTH,
    scales = [0, 1, 2, 3],# output different scales of depth maps
    min_depth = 0.1, # minimum of predicted depth value
    max_depth = 100.0, # maximum of predicted depth value
    depth_pretrained_path = '/home/fjd/Data/weights/resnet{}.pth'.format(DEPTH_LAYERS),# pretrained weights for resnet
    pose_pretrained_path =  '/home/fjd/Data/weights/resnet{}.pth'.format(POSE_LAYERS),# pretrained weights for resnet
    extractor_pretrained_path = '/home/fjd/Data/autoencoder.pth',# pretrained weights for autoencoder
    automask = False if 's' in FRAME_IDS else True,
    disp_norm = False if 's' in FRAME_IDS else True,
    perception_weight = 1e-3,
    smoothness_weight = 1e-3,
)

# resume_from = '/node01_data5/monodepth2-test/model/ms/ms.pth'#directly start training from provide weights
resume_from = None
finetune = None
total_epochs = 40
imgs_per_gpu = IMGS_PER_GPU
learning_rate = 1e-4
workers_per_gpu = 4
validate = True

optimizer = dict(type='Adam', lr=learning_rate, weight_decay=0)
optimizer_config = dict(grad_clip=dict(max_norm=35, norm_type=2))
lr_config = dict(
    policy='step',
    warmup='linear',
    warmup_iters=500,
    warmup_ratio=1.0 / 3,
    step=[20,30],
    gamma=0.5,
)

checkpoint_config = dict(interval=1)
log_config = dict(interval=50,
                  hooks=[dict(type='TextLoggerHook'),])
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = None
workflow = [('train', 1)]

I tried it twice. First on a 4-gpu docker env, and second time on a local machine with 2 gpus.

I'm willing to provide any other details if you need it, I appreciate it if you could help.

sconlyshootery commented 3 years ago

Did you try to set 'model = MMDistributedDataParallel(model.cuda(), find_unused_parameters=True)'

jiadingfang commented 3 years ago

Hi, it seems like the job is running now, thx! However, on one 12gb titan xp gpu, with the default setting from "cfg_kitti_fm", it says eta: 18days, is that normal?

jiadingfang commented 3 years ago

[E ProcessGroupNCCL.cpp:587] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1802492 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1802456 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1802516 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1802517 milliseconds before timing out.
Traceback (most recent call last):
  File "train.py", line 103, in <module>
    main()
  File "train.py", line 93, in main
    train_mono(model,
  File "/home/FeatDepth/mono/apis/trainer.py", line 68, in train_mono
    _dist_train(model, dataset_train, dataset_val, cfg, validate=validate)
  File "/home/FeatDepth/mono/apis/trainer.py", line 151, in _dist_train
    model = MMDistributedDataParallel(model.cuda(), find_unused_parameters=True)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 578, in __init__
    dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: replicas[0][0] in this process with sizes [64, 3, 7, 7] appears not to match sizes of the same param in process 0.
Traceback (most recent call last):
  File "train.py", line 103, in <module>
    main()
  File "train.py", line 93, in main
    train_mono(model,
  File "/home/FeatDepth/mono/apis/trainer.py", line 68, in train_mono
    _dist_train(model, dataset_train, dataset_val, cfg, validate=validate)
  File "/home/FeatDepth/mono/apis/trainer.py", line 151, in _dist_train
    model = MMDistributedDataParallel(model.cuda(), find_unused_parameters=True)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 578, in __init__
    dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: replicas[0][0] in this process with sizes [64, 3, 7, 7] appears not to match sizes of the same param in process 0.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1802517 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1802516 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
Traceback (most recent call last):
  File "train.py", line 103, in <module>
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1802492 milliseconds before timing out.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3652 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3654 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3655 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 3653) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 710, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=====================================================
train.py FAILED
-----------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2021-11-19_11:33:44
  host      : a206ffe42d57
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 3653)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 3653
=====================================================

On my 4-gpu server, the above happened, do you, by any chance, have an idea how to solve it?

sconlyshootery commented 3 years ago

Hi, it seems like the job is running now, thx! However, on one 12gb titan xp gpu, with the default setting from "cfg_kitti_fm", it says eta: 18days, is that normal?

No, you may increase your batchsize.

sconlyshootery commented 3 years ago

[E ProcessGroupNCCL.cpp:587] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1802492 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1802456 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1802516 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1802517 milliseconds before timing out.
Traceback (most recent call last):
  File "train.py", line 103, in <module>
    main()
  File "train.py", line 93, in main
    train_mono(model,
  File "/home/FeatDepth/mono/apis/trainer.py", line 68, in train_mono
    _dist_train(model, dataset_train, dataset_val, cfg, validate=validate)
  File "/home/FeatDepth/mono/apis/trainer.py", line 151, in _dist_train
    model = MMDistributedDataParallel(model.cuda(), find_unused_parameters=True)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 578, in __init__
    dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: replicas[0][0] in this process with sizes [64, 3, 7, 7] appears not to match sizes of the same param in process 0.
Traceback (most recent call last):
  File "train.py", line 103, in <module>
    main()
  File "train.py", line 93, in main
    train_mono(model,
  File "/home/FeatDepth/mono/apis/trainer.py", line 68, in train_mono
    _dist_train(model, dataset_train, dataset_val, cfg, validate=validate)
  File "/home/FeatDepth/mono/apis/trainer.py", line 151, in _dist_train
    model = MMDistributedDataParallel(model.cuda(), find_unused_parameters=True)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 578, in __init__
    dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: replicas[0][0] in this process with sizes [64, 3, 7, 7] appears not to match sizes of the same param in process 0.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1802517 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1802516 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
Traceback (most recent call last):
  File "train.py", line 103, in <module>
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1802492 milliseconds before timing out.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3652 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3654 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3655 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 3653) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 710, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=====================================================
train.py FAILED
-----------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2021-11-19_11:33:44
  host      : a206ffe42d57
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 3653)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 3653
=====================================================

On my 4-gpu server, the above happened, do you, by any chance, have an idea how to solve it?

Sorry, I don't know how to solve it. It seems your sever can not run distributed training.

sconlyshootery / FeatDepth

Runtime Data Parallel Error #70