FLUX dreambooth train on multigpu with deepspeed

zhangvia commented 2 months ago

Describe the bug

i'm using the train_dreambooth_flux.py to finetune flux. i get oom on 4x A100 80gb with deepspeed stage 2, gradient checkpoint, bf16 mixed precision, 1024px *1024px input, adafactor optimizer,batchsize 1. it can only run with deepspeed stage3, but that is too slow about 16sec/it.

Reproduction

just use train_dreambooth_flux.py in repo

Logs

No response

System Info

🤗 Diffusers version: 0.31.0.dev0
Platform: Linux-5.15.0-105-generic-x86_64-with-glibc2.31
Running on Google Colab?: No
Python version: 3.10.0
PyTorch version (GPU?): 2.3.0+cu118 (True)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Huggingface_hub version: 0.23.4
Transformers version: 4.44.2
Accelerate version: 0.33.0
PEFT version: 0.10.0
Bitsandbytes version: 0.44.0.dev
Safetensors version: 0.4.2
xFormers version: 0.0.26.post1+cu118
Accelerator: NVIDIA A800 80GB PCIe, 81920 MiB NVIDIA A800 80GB PCIe, 81920 MiB NVIDIA A800 80GB PCIe, 81920 MiB NVIDIA A800 80GB PCIe, 81920 MiB
Using GPU in script?: yes
Using distributed or parallel set-up in script?: yes

Who can help?

@linoytsaban

kopyl commented 1 month ago

Where you able to finsih the training?

Try different accelerate configs. I use x2 H100 with 95 GB VRAM each with this config:

compute_environment: LOCAL_MACHINE
debug: true
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: true
fsdp_config:
  fsdp_activation_checkpointing: true
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: true
  fsdp_offload_params: true
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: true

But for your GPU setup it might give you CUDA OOM errors.

zhangvia commented 1 month ago

Where you able to finsih the training?

i try the deepspeed stage 3, and i can finish the training. but the speed is too slow. and if reduce the input resolution, i can train with deepspeed stage2

kopyl commented 1 month ago

@zhangvia could you please share your accelerate config?

zhangvia commented 1 month ago

nothing special in accelerate config, just enable the deepspeed. and the deepspeed json is like:

{
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "zero_optimization": {
        "stage": 3,
        "allgather_partitions": true,
        "allgather_bucket_size": 5e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 5e8,
        "contiguous_gradients": true,
        "round_robin_gradients": true
    },

    "gradient_accumulation_steps": 1,
    "gradient_clipping": 1.0,
    "steps_per_print": 2000,
    "train_batch_size": 16,
    "train_micro_batch_size_per_gpu": 4,
    "wall_clock_breakdown": false
}

besides, precompute the image latent and text embedding, so you can offload the vae clip and t5

kopyl commented 1 month ago

@zhangvia thank you very much :)

Could you please share the entire accelerate config, not just the deepspeed?

zhangvia commented 1 month ago

like i said, nothing special:

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_config_file: ./deepspeed.json
  zero3_init_flag: false
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
main_process_port: 29501

kopyl commented 1 month ago

@zhangvia thanks :)

kopyl commented 1 month ago

@zhangvia by the way, do you have an updated training script which pre-computes the image latent and text embedding? Some while ago i made it for sd 1.5 text to image training.

If you could share it, i'd really appreciate it :)

zhangvia commented 1 month ago

@zhangvia by the way, do you have an updated training script which pre-computes the image latent and text embedding? Some while ago i made it for sd 1.5 text to image training.

sorry, i just test the scripts not actually implement the pre compute code. but if you just want to finetune flux , don't want add any changes to flux model, i suggest using sd-scripts to train the flux

kopyl commented 1 month ago

@zhangvia is it available on Linux?

zhangvia commented 1 month ago

@zhangvia is it available on Linux?

of course, it can only be used on linux

kopyl commented 1 month ago

@zhangvia thanks. I successfully trained it with kohya_ss

github-actions[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / diffusers