Closed PanXiebit closed 9 months ago
Reset by peer means that the port you are trying to connect to isn't reachable between the machines. Ensure you've opened the port properly and each machine can ping each other
@muellerzr thanks, I am sure that each machine can ping each other, and the port is ok.
If I downgrade the accelerate package to version 0.19.0 without making any other changes, the multi-node training can run smoothly.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
MASTER_ADDR=xx.xx.xx.xx MASTER_PORT=11332 JOB_ID=228 NNODES=2 GPUS_PER_NODE=8 WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
DISTRIBUTED_ARGS="--num_machines $NNODES --num_processes $WORLD_SIZE --main_process_ip $MASTER_ADDR --main_process_port $MASTER_PORT"
MODEL_DIR=stable-diffusion-v1-5 OUTPUT_DIR=""
accelerate launch --config_file "text_to_image/default_config.yaml" $DISTRIBUTED_ARGS --gpu_ids='all' --mixed_precision="bf16" \ text_to_image/train_lcm_lora.py \ --pretrained_teacher_model=$MODEL_DIR \ --output_dir=$OUTPUT_DIR \ --mixed_precision=bf16 \ --resolution=512 \ --learning_rate=1e-6 --loss_type="huber" --adam_weight_decay=0.0 \ --max_train_steps=1000000 \ --max_train_samples=4000000 \ --dataloader_num_workers=8 \ --validation_steps=200 \ --checkpointing_steps=1000 --checkpoints_total_limit=10 \ --train_batch_size=12 \ --gradient_checkpointing --enable_xformers_memory_efficient_attention \ --gradient_accumulation_steps=1 \ --resume_from_checkpoint=latest \ --report_to=tensorboard \ --seed=453645634 \
Expected behavior
When NNODES>=2, the program does not run after executing the script, indicating a lack of communication between multiple machines.
If node=1, the program can run normally.
When accelerate==0.19.0 and transformers==4.26.1, the program can run normally and can be executed on multiple nodes when NNODES>=2.