Open songzhuoyuan opened 4 months ago
我也是这个问题,请问你解决了吗
好像没有
Zero.旋律.\ove @.***
------------------ 原始邮件 ------------------ 发件人: @.>; 发送时间: 2024年9月19日(星期四) 晚上7:12 收件人: @.>; 抄送: @.>; @.>; 主题: Re: [rshaojimmy/MultiModal-DeepFake] 训练过程出问题,sh train.sh (Issue #39)
我也是这个问题,请问你解决了吗
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
Traceback (most recent call last): File "train.py", line 557, in
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(args, config))
File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error: Traceback (most recent call last): File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, *args) File "/root/autodl-tmp/code/MultiModal-DeepFake-main/train.py", line 316, in main_worker init_dist(args) File "/root/autodl-tmp/code/MultiModal-DeepFake-main/tools/env.py", line 13, in init_dist _init_dist_pytorch(args) File "/root/autodl-tmp/code/MultiModal-DeepFake-main/tools/env.py", line 27, in _init_dist_pytorch dist.init_process_group(backend=args.dist_backend, init_method=args.dist_url, File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 576, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 183, in _tcp_rendezvous_handler store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout) File "/root/miniconda3/envs/DGM4/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 157, in _create_c10d_store return TCPStore( RuntimeError: Stop_waiting response is expected