Open qinchangchang opened 3 weeks ago
Can you show you accelerate config?
Can you show you accelerate config?
accelerate default.yml: compute_environment: LOCAL_MACHINE debug: false distributed_type: MULTI_GPU downcast_bf16: 'no' enable_cpu_affinity: false gpu_ids: all machine_rank: 0 main_training_function: main mixed_precision: fp16 num_machines: 1 num_processes: 2 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false main_process_port: 29401
Can you modify num_processes: 2
to num_processes: 1
?
Can you modify
num_processes: 2
tonum_processes: 1
?
I tried it, however, it doesn't work.
Then may it's the port or some configuration problem because I can run it just fine
Ok fine, thank you.
Describe the bug
ConnectionError: Tried to launch distributed communication on port 29401, but another process is utilizing it. Please specify a different port (such as using the --main_process_port flag or specifying a different main_process_port in your config file) and rerun your script. To automatically use the next open port (on a single node), you can set this to 0.
Reproduction
export MODEL_NAME="CompVis/stable-diffusion-v1-4" \ export TRAIN_DATA_DIR="/home/qinchang/pro/qc/new_project/newConcept/data/poisoned_images" \ export OUTPUT_DIR="/home/qinchang/pro/qc/new_project/newConcept/model/model_first" \
CUDA_VISIBLE_DEVICES=1 accelerate launch --config_file="/home/qinchang/.cache/huggingface/accelerate/default_config.yaml" train_text_to_image_lora.py \ --pretrained_model_name_or_path=$MODEL_NAME \ --train_data_dir=$TRAIN_DATA_DIR --caption_column="additional_feature" \ --resolution=512 --random_flip \ --train_batch_size=1 \ --num_train_epochs=100 --checkpointing_steps=5000 \ --learning_rate=1e-04 --lr_scheduler="constant" --lr_warmup_steps=0 \ --seed=42 \ --output_dir=$OUTPUT_DIR \ --validation_prompt=None --report_to="wandb"
Logs
System Info
Diffusers version: 0.31.0
Who can help?
No response