Closed saswat0 closed 1 year ago
This fixed it
export NCCL_P2P_DISABLE=1
This fixed it
export NCCL_P2P_DISABLE=1
still show this mes, anyone knows why? 'The client socket has failed to connect to [::ffff:0.0.15.250]:49011 (errno: 110 - Connection timed out)'
@ljw919 Were you able to solve the time out issue?
torch.distributed.DistStoreError: Socket Timeout
I'm trying to train the diffuser from scratch using a custom dataset (FFHQ), but the process gets stuck indefinitely. Here's the script that I'm using to run the job.
There are no logs displayed either.