Closed jiadingfang closed 2 years ago
Did you try to set 'model = MMDistributedDataParallel(model.cuda(), find_unused_parameters=True)'
Hi, it seems like the job is running now, thx! However, on one 12gb titan xp gpu, with the default setting from "cfg_kitti_fm", it says eta: 18days, is that normal?
[E ProcessGroupNCCL.cpp:587] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1802492 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1802456 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1802516 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1802517 milliseconds before timing out.
Traceback (most recent call last):
File "train.py", line 103, in <module>
main()
File "train.py", line 93, in main
train_mono(model,
File "/home/FeatDepth/mono/apis/trainer.py", line 68, in train_mono
_dist_train(model, dataset_train, dataset_val, cfg, validate=validate)
File "/home/FeatDepth/mono/apis/trainer.py", line 151, in _dist_train
model = MMDistributedDataParallel(model.cuda(), find_unused_parameters=True)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 578, in __init__
dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: replicas[0][0] in this process with sizes [64, 3, 7, 7] appears not to match sizes of the same param in process 0.
Traceback (most recent call last):
File "train.py", line 103, in <module>
main()
File "train.py", line 93, in main
train_mono(model,
File "/home/FeatDepth/mono/apis/trainer.py", line 68, in train_mono
_dist_train(model, dataset_train, dataset_val, cfg, validate=validate)
File "/home/FeatDepth/mono/apis/trainer.py", line 151, in _dist_train
model = MMDistributedDataParallel(model.cuda(), find_unused_parameters=True)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 578, in __init__
dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: replicas[0][0] in this process with sizes [64, 3, 7, 7] appears not to match sizes of the same param in process 0.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1802517 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1802516 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
Traceback (most recent call last):
File "train.py", line 103, in <module>
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1802492 milliseconds before timing out.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3652 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3654 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3655 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 3653) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=====================================================
train.py FAILED
-----------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2021-11-19_11:33:44
host : a206ffe42d57
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 3653)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 3653
=====================================================
On my 4-gpu server, the above happened, do you, by any chance, have an idea how to solve it?
Hi, it seems like the job is running now, thx! However, on one 12gb titan xp gpu, with the default setting from "cfg_kitti_fm", it says eta: 18days, is that normal?
No, you may increase your batchsize.
[E ProcessGroupNCCL.cpp:587] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1802492 milliseconds before timing out. [E ProcessGroupNCCL.cpp:587] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1802456 milliseconds before timing out. [E ProcessGroupNCCL.cpp:587] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1802516 milliseconds before timing out. [E ProcessGroupNCCL.cpp:587] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1802517 milliseconds before timing out. Traceback (most recent call last): File "train.py", line 103, in <module> main() File "train.py", line 93, in main train_mono(model, File "/home/FeatDepth/mono/apis/trainer.py", line 68, in train_mono _dist_train(model, dataset_train, dataset_val, cfg, validate=validate) File "/home/FeatDepth/mono/apis/trainer.py", line 151, in _dist_train model = MMDistributedDataParallel(model.cuda(), find_unused_parameters=True) File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 578, in __init__ dist._verify_model_across_ranks(self.process_group, parameters) RuntimeError: replicas[0][0] in this process with sizes [64, 3, 7, 7] appears not to match sizes of the same param in process 0. Traceback (most recent call last): File "train.py", line 103, in <module> main() File "train.py", line 93, in main train_mono(model, File "/home/FeatDepth/mono/apis/trainer.py", line 68, in train_mono _dist_train(model, dataset_train, dataset_val, cfg, validate=validate) File "/home/FeatDepth/mono/apis/trainer.py", line 151, in _dist_train model = MMDistributedDataParallel(model.cuda(), find_unused_parameters=True) File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 578, in __init__ dist._verify_model_across_ranks(self.process_group, parameters) RuntimeError: replicas[0][0] in this process with sizes [64, 3, 7, 7] appears not to match sizes of the same param in process 0. [E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1802517 milliseconds before timing out. [E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1802516 milliseconds before timing out. [E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down. Traceback (most recent call last): File "train.py", line 103, in <module> terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1802492 milliseconds before timing out. WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3652 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3654 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3655 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 3653) of binary: /usr/bin/python3 Traceback (most recent call last): File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 193, in <module> main() File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 710, in run elastic_launch( File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 131, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 259, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ===================================================== train.py FAILED ----------------------------------------------------- Failures: <NO_OTHER_FAILURES> ----------------------------------------------------- Root Cause (first observed failure): [0]: time : 2021-11-19_11:33:44 host : a206ffe42d57 rank : 1 (local_rank: 1) exitcode : -6 (pid: 3653) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 3653 =====================================================
On my 4-gpu server, the above happened, do you, by any chance, have an idea how to solve it?
Sorry, I don't know how to solve it. It seems your sever can not run distributed training.
Similar to https://github.com/sconlyshootery/FeatDepth/issues/16#issue-747520482, I also ran into trouble using the distributed training. The command I ran:
The error report
The config I use:
I tried it twice. First on a 4-gpu docker env, and second time on a local machine with 2 gpus.
I'm willing to provide any other details if you need it, I appreciate it if you could help.