Open zhangvia opened 1 week ago
Where you able to finsih the training?
Try different accelerate configs. I use x2 H100 with 95 GB VRAM each with this config:
compute_environment: LOCAL_MACHINE
debug: true
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: true
fsdp_config:
fsdp_activation_checkpointing: true
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch: BACKWARD_PRE
fsdp_cpu_ram_efficient_loading: true
fsdp_forward_prefetch: true
fsdp_offload_params: true
fsdp_sharding_strategy: FULL_SHARD
fsdp_state_dict_type: SHARDED_STATE_DICT
fsdp_sync_module_states: true
fsdp_use_orig_params: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: true
But for your GPU setup it might give you CUDA OOM errors.
Where you able to finsih the training?
i try the deepspeed stage 3, and i can finish the training. but the speed is too slow. and if reduce the input resolution, i can train with deepspeed stage2
@zhangvia could you please share your accelerate config?
nothing special in accelerate config, just enable the deepspeed. and the deepspeed json is like:
{
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"zero_optimization": {
"stage": 3,
"allgather_partitions": true,
"allgather_bucket_size": 5e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 5e8,
"contiguous_gradients": true,
"round_robin_gradients": true
},
"gradient_accumulation_steps": 1,
"gradient_clipping": 1.0,
"steps_per_print": 2000,
"train_batch_size": 16,
"train_micro_batch_size_per_gpu": 4,
"wall_clock_breakdown": false
}
besides, precompute the image latent and text embedding, so you can offload the vae clip and t5
@zhangvia thank you very much :)
Could you please share the entire accelerate config, not just the deepspeed?
like i said, nothing special:
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
deepspeed_config_file: ./deepspeed.json
zero3_init_flag: false
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
main_process_port: 29501
@zhangvia thanks :)
@zhangvia by the way, do you have an updated training script which pre-computes the image latent and text embedding? Some while ago i made it for sd 1.5 text to image training.
If you could share it, i'd really appreciate it :)
@zhangvia by the way, do you have an updated training script which pre-computes the image latent and text embedding? Some while ago i made it for sd 1.5 text to image training.
sorry, i just test the scripts not actually implement the pre compute code. but if you just want to finetune flux , don't want add any changes to flux model, i suggest using sd-scripts to train the flux
Describe the bug
i'm using the train_dreambooth_flux.py to finetune flux. i get oom on 4x A100 80gb with deepspeed stage 2, gradient checkpoint, bf16 mixed precision, 1024px *1024px input, adafactor optimizer,batchsize 1. it can only run with deepspeed stage3, but that is too slow about 16sec/it.
Reproduction
just use train_dreambooth_flux.py in repo
Logs
No response
System Info
Who can help?
@linoytsaban