va1shn9v / PromptIR

PromptIR: Prompting for All-in-One Blind Image Restoration [NeurIPS 2023]
https://arxiv.org/abs/2306.13090
Other
328 stars 24 forks source link

Problems with connection timeouts #18

Closed Caofan0 closed 6 months ago

Caofan0 commented 12 months ago

The problem I encountered when training, I used a device and a gpu to train.Can you help me with it? I changed it in the code. parser.add_argument('--cuda', type=int, default=0) parser.add_argument("--num_gpus",type=int,default= 1,help = "Number of GPUs to use for training")

Here are the results.

File "E:\python\lib\site-packages lightninglpytorchltrainerltrainer.py", line 571, in _fit_implself._run(model, ckpt_path=ckpt_path)File "E:\python\liblsite-packages lightning pytorch trainer(trainer.py", line 938,in _runself.strategy.setup_environment()File "E:\pythonlliblsite-packagesllightninglpytorchlstrategieslddp.py", line 143, in setup_environmentself.setup_distributed()File "E:\oythonllib\site-packagesllightninglpytorch\strategiesiddp.py", line 191, in setup_distributecinit dist connection(self.cluster environment. self.process group backend, timeout=self. timeoutFile "E:lpythonlliblsite-packages lightninglfabriclutilities distributed.py", line 258, in _init_dist_connectiontorchdistributed.init_process_group(torchdistributed_backend, rank=global_rank, orld.size=world_size,**kwargsFile "E:\python\lib\site-packagesltorchidistributed\distributed_c10d.py",line 754, in init_process_groupstore. rank, world_size = next(rendezyous iterator)File "E:\python\lib\site-packages torchidistributed\rendezvous.py" line 246,in -env-rendezvous_handlerstore = _create_c10d_store(master-addr, master-port, rank, world_size, timeoutFile "E:\python\lib\site-packages\torchidistributed rendezvous.py", line 177,in _create_c10d_storereturn Tcpstore( imeoutError: The client socket has timed out after 1800s while trying to connect to (127.0.0,1, 51823)