Open jyy-1998 opened 2 weeks ago
Could you enable verbose logging with Accelerate (ref) and paste the logs? This does not look like it contains any information that would help identify what the issue might be
Maybe it's because your cpu memory is not enough
I have investigated this before and I can confirm it works. See: https://github.com/huggingface/diffusers/issues/9278#issuecomment-2410113103
Can you try regarding #9829 ? I have saved memory by implementing this :)
Describe the bug
I tried to use accelerate+deepspeed to train flux, but every time after a dozen steps, an error occurred and the program crashed. Can anyone provide some help?
Reproduction
accelerate launch --config_file config.yaml train_flux.py --pretrained_model_name_or_path="./FLUX.1-dev" --resolution=1024 --train_batch_size=1 --output_dir="output0" --num_train_epochs=10 --checkpointing_steps=5000 --validation_steps=100 --max_train_steps=40001 --learning_rate=4e-05 --seed=12345 --mixed_precision="fp16" --revision="fp16" --use_8bit_adam --gradient_accumulation_steps=1 --gradient_checkpointing
compute_environment: LOCAL_MACHINE deepspeed_config: gradient_accumulation_steps: 1 gradient_clipping: 1.0 offload_optimizer_device: cpu offload_param_device: cpu zero3_init_flag: true zero_stage: 2 distributed_type: DEEPSPEED downcast_bf16: 'no' gpu_ids: 0,1 enable_cpu_affinity: false machine_rank: 0 main_training_function: main mixed_precision: fp16 num_machines: 1 num_processes: 2 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false
Logs
System Info
deepspeed==0.14.4 accelerate==0.33.0 transformers==4.41.2
Who can help?
No response