RuntimeError: [/opt/conda/conda-bld/pytorch_1603729006826/work/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:84] Timed out waiting 1800000ms for recv operation to complete #7
Hi,
Thanks for sharing your work!
I got this problem when training CelebA with pi-GAN. And I don't know how to solve it. It was runned in one GPU V100-32GB with pytorch 1.7.0 and cuda 10.1.
Exception:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/mnt/lustre/gaosicheng/anaconda3/envs/pigan/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/mnt/lustre/gaosicheng/codes/pi-GAN-master/train.py", line 249, in train
scaler.scale(d_loss).backward()
File "/mnt/lustre/gaosicheng/anaconda3/envs/pigan/lib/python3.7/site-packages/torch/tensor.py", line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/mnt/lustre/gaosicheng/anaconda3/envs/pigan/lib/python3.7/site-packages/torch/autograd/init.py", line 132, in
allow_unreachable=True) # allow_unreachable flag
RuntimeError: [/opt/conda/conda-bld/pytorch_1603729006826/work/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:84] Timed out waiting 1800000ms for recv operation to complete
Hi, Thanks for sharing your work! I got this problem when training CelebA with pi-GAN. And I don't know how to solve it. It was runned in one GPU V100-32GB with pytorch 1.7.0 and cuda 10.1.
Exception:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/mnt/lustre/gaosicheng/anaconda3/envs/pigan/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/mnt/lustre/gaosicheng/codes/pi-GAN-master/train.py", line 249, in train
scaler.scale(d_loss).backward()
File "/mnt/lustre/gaosicheng/anaconda3/envs/pigan/lib/python3.7/site-packages/torch/tensor.py", line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/mnt/lustre/gaosicheng/anaconda3/envs/pigan/lib/python3.7/site-packages/torch/autograd/init.py", line 132, in allow_unreachable=True) # allow_unreachable flag
RuntimeError: [/opt/conda/conda-bld/pytorch_1603729006826/work/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:84] Timed out waiting 1800000ms for recv operation to complete