RuntimeError: [/opt/conda/conda-bld/pytorch_1603729006826/work/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:84] Timed out waiting 1800000ms for recv operation to complete

Ree1s commented 3 years ago

Hi, Thanks for sharing your work! I got this problem when training CelebA with pi-GAN. And I don't know how to solve it. It was runned in one GPU V100-32GB with pytorch 1.7.0 and cuda 10.1.

Exception:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/mnt/lustre/gaosicheng/anaconda3/envs/pigan/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/mnt/lustre/gaosicheng/codes/pi-GAN-master/train.py", line 249, in train
scaler.scale(d_loss).backward()
File "/mnt/lustre/gaosicheng/anaconda3/envs/pigan/lib/python3.7/site-packages/torch/tensor.py", line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/mnt/lustre/gaosicheng/anaconda3/envs/pigan/lib/python3.7/site-packages/torch/autograd/init.py", line 132, in allow_unreachable=True) # allow_unreachable flag
RuntimeError: [/opt/conda/conda-bld/pytorch_1603729006826/work/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:84] Timed out waiting 1800000ms for recv operation to complete

marcoamonteiro commented 3 years ago

Hi there, can you try halving the batch size, or doubling the batch split and check if you get the same error?

marcoamonteiro commented 3 years ago

Closing due to inactivity. Feel free to reopen if you are still having this issue.

marcoamonteiro / pi-GAN

RuntimeError: [/opt/conda/conda-bld/pytorch_1603729006826/work/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:84] Timed out waiting 1800000ms for recv operation to complete #7