Training stuck indefinitely

saswat0 commented 1 year ago

I'm trying to train the diffuser from scratch using a custom dataset (FFHQ), but the process gets stuck indefinitely. Here's the script that I'm using to run the job.

export PYTHONPATH=.:$PYTHONPATH

MODEL_FLAGS="--image_size 256 --num_channels 128 --num_res_blocks 3"
DIFFUSION_FLAGS="--diffusion_steps 1000000 --noise_schedule linear"
TRAIN_FLAGS="--lr 1e-4 --batch_size 64"

mpiexec -n 2 -verbose python scripts/image_train.py --data_dir ./data/padding_025 $MODEL_FLAGS $DIFFUSION_FLAGS $TRAIN_FLAGS

There are no logs displayed either.

saswat0 commented 1 year ago

This fixed it

export NCCL_P2P_DISABLE=1

ljw919 commented 4 months ago

This fixed it
export NCCL_P2P_DISABLE=1

still show this mes, anyone knows why? 'The client socket has failed to connect to [::ffff:0.0.15.250]:49011 (errno: 110 - Connection timed out)'

mlkorra commented 3 months ago

@ljw919 Were you able to solve the time out issue?

torch.distributed.DistStoreError: Socket Timeout

openai / guided-diffusion

Training stuck indefinitely #108