huggingface / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
25.35k stars 5.25k forks source link

FLUX dreambooth train on multigpu with deepspeed #9484

Open zhangvia opened 1 week ago

zhangvia commented 1 week ago

Describe the bug

i'm using the train_dreambooth_flux.py to finetune flux. i get oom on 4x A100 80gb with deepspeed stage 2, gradient checkpoint, bf16 mixed precision, 1024px *1024px input, adafactor optimizer,batchsize 1. it can only run with deepspeed stage3, but that is too slow about 16sec/it.

Reproduction

just use train_dreambooth_flux.py in repo

Logs

No response

System Info

Who can help?

@linoytsaban

kopyl commented 6 days ago

Where you able to finsih the training?

Try different accelerate configs. I use x2 H100 with 95 GB VRAM each with this config:

compute_environment: LOCAL_MACHINE
debug: true
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: true
fsdp_config:
  fsdp_activation_checkpointing: true
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: true
  fsdp_offload_params: true
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: true

But for your GPU setup it might give you CUDA OOM errors.

zhangvia commented 5 days ago

Where you able to finsih the training?

i try the deepspeed stage 3, and i can finish the training. but the speed is too slow. and if reduce the input resolution, i can train with deepspeed stage2

kopyl commented 4 days ago

@zhangvia could you please share your accelerate config?

zhangvia commented 4 days ago

nothing special in accelerate config, just enable the deepspeed. and the deepspeed json is like:

{
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "zero_optimization": {
        "stage": 3,
        "allgather_partitions": true,
        "allgather_bucket_size": 5e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 5e8,
        "contiguous_gradients": true,
        "round_robin_gradients": true
    },

    "gradient_accumulation_steps": 1,
    "gradient_clipping": 1.0,
    "steps_per_print": 2000,
    "train_batch_size": 16,
    "train_micro_batch_size_per_gpu": 4,
    "wall_clock_breakdown": false
}

besides, precompute the image latent and text embedding, so you can offload the vae clip and t5

kopyl commented 4 days ago

@zhangvia thank you very much :)

Could you please share the entire accelerate config, not just the deepspeed?

zhangvia commented 3 days ago

like i said, nothing special:

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_config_file: ./deepspeed.json
  zero3_init_flag: false
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
main_process_port: 29501
kopyl commented 2 days ago

@zhangvia thanks :)

kopyl commented 2 days ago

@zhangvia by the way, do you have an updated training script which pre-computes the image latent and text embedding? Some while ago i made it for sd 1.5 text to image training.

If you could share it, i'd really appreciate it :)

zhangvia commented 1 day ago

@zhangvia by the way, do you have an updated training script which pre-computes the image latent and text embedding? Some while ago i made it for sd 1.5 text to image training.

sorry, i just test the scripts not actually implement the pre compute code. but if you just want to finetune flux , don't want add any changes to flux model, i suggest using sd-scripts to train the flux