ConnectionError: Tried to launch distributed communication on port 29401, but another process is utilizing it. Please specify a different port (such as using the --main_process_port flag or specifying a different main_process_port in your config file) and rerun your script. To automatically use the next open port (on a single node), you can set this to 0.

qinchangchang commented 3 weeks ago

Describe the bug

Reproduction

export MODEL_NAME="CompVis/stable-diffusion-v1-4" \ export TRAIN_DATA_DIR="/home/qinchang/pro/qc/new_project/newConcept/data/poisoned_images" \ export OUTPUT_DIR="/home/qinchang/pro/qc/new_project/newConcept/model/model_first" \

CUDA_VISIBLE_DEVICES=1 accelerate launch --config_file="/home/qinchang/.cache/huggingface/accelerate/default_config.yaml" train_text_to_image_lora.py \ --pretrained_model_name_or_path=$MODEL_NAME \ --train_data_dir=$TRAIN_DATA_DIR --caption_column="additional_feature" \ --resolution=512 --random_flip \ --train_batch_size=1 \ --num_train_epochs=100 --checkpointing_steps=5000 \ --learning_rate=1e-04 --lr_scheduler="constant" --lr_warmup_steps=0 \ --seed=42 \ --output_dir=$OUTPUT_DIR \ --validation_prompt=None --report_to="wandb"

Logs

Traceback (most recent call last):
  File "/home/qinchang/miniconda3/envs/diagnosis_new/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/qinchang/miniconda3/envs/diagnosis_new/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/home/qinchang/miniconda3/envs/diagnosis_new/lib/python3.9/site-packages/accelerate/commands/launch.py", line 1159, in launch_command
    multi_gpu_launcher(args)
  File "/home/qinchang/miniconda3/envs/diagnosis_new/lib/python3.9/site-packages/accelerate/commands/launch.py", line 771, in multi_gpu_launcher
    current_env = prepare_multi_gpu_env(args)
  File "/home/qinchang/miniconda3/envs/diagnosis_new/lib/python3.9/site-packages/accelerate/utils/launch.py", line 212, in prepare_multi_gpu_env
    raise ConnectionError(
ConnectionError: Tried to launch distributed communication on port `29400`, but another process is utilizing it. Please specify a different port (such as using the `--main_process_port` flag or specifying a different `main_process_port` in your config file) and rerun your script. To automatically use the next open port (on a single node), you can set this to `0`.

System Info

Diffusers version: 0.31.0

Platform: Linux-6.8.0-48-generic-x86_64-with-glibc2.39
Running on Google Colab?: No
Python version: 3.9.18
PyTorch version (GPU?): 2.5.1+cu124 (True)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Huggingface_hub version: 0.26.2
Transformers version: 4.46.1
Accelerate version: 1.1.0
PEFT version: not installed
Bitsandbytes version: not installed
Safetensors version: 0.4.5
xFormers version: not installed
Accelerator: NVIDIA GeForce RTX 3090 Ti, 24564 MiB NVIDIA GeForce RTX 3090 Ti, 24564 MiB
Using GPU in script?: yes
Using distributed or parallel set-up in script?: yes one machine with two gpus

Who can help?

No response

sayakpaul commented 3 weeks ago

Can you show you accelerate config?

qinchangchang commented 3 weeks ago

Can you show you accelerate config?

accelerate default.yml: compute_environment: LOCAL_MACHINE debug: false distributed_type: MULTI_GPU downcast_bf16: 'no' enable_cpu_affinity: false gpu_ids: all machine_rank: 0 main_training_function: main mixed_precision: fp16 num_machines: 1 num_processes: 2 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false main_process_port: 29401

sayakpaul commented 3 weeks ago

Can you modify num_processes: 2 to num_processes: 1?

qinchangchang commented 3 weeks ago

Can you modify num_processes: 2 to num_processes: 1?

I tried it, however, it doesn't work.

sayakpaul commented 3 weeks ago

Then may it's the port or some configuration problem because I can run it just fine

qinchangchang commented 3 weeks ago

Ok fine, thank you.

huggingface / diffusers

Describe the bug

Reproduction

Logs

System Info

Who can help?