Open zhangvia opened 2 months ago
Where you able to finsih the training?
Try different accelerate configs. I use x2 H100 with 95 GB VRAM each with this config:
compute_environment: LOCAL_MACHINE
debug: true
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: true
fsdp_config:
fsdp_activation_checkpointing: true
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch: BACKWARD_PRE
fsdp_cpu_ram_efficient_loading: true
fsdp_forward_prefetch: true
fsdp_offload_params: true
fsdp_sharding_strategy: FULL_SHARD
fsdp_state_dict_type: SHARDED_STATE_DICT
fsdp_sync_module_states: true
fsdp_use_orig_params: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: true
But for your GPU setup it might give you CUDA OOM errors.
Where you able to finsih the training?
i try the deepspeed stage 3, and i can finish the training. but the speed is too slow. and if reduce the input resolution, i can train with deepspeed stage2
@zhangvia could you please share your accelerate config?
nothing special in accelerate config, just enable the deepspeed. and the deepspeed json is like:
{
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"zero_optimization": {
"stage": 3,
"allgather_partitions": true,
"allgather_bucket_size": 5e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 5e8,
"contiguous_gradients": true,
"round_robin_gradients": true
},
"gradient_accumulation_steps": 1,
"gradient_clipping": 1.0,
"steps_per_print": 2000,
"train_batch_size": 16,
"train_micro_batch_size_per_gpu": 4,
"wall_clock_breakdown": false
}
besides, precompute the image latent and text embedding, so you can offload the vae clip and t5
@zhangvia thank you very much :)
Could you please share the entire accelerate config, not just the deepspeed?
like i said, nothing special:
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
deepspeed_config_file: ./deepspeed.json
zero3_init_flag: false
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
main_process_port: 29501
@zhangvia thanks :)
@zhangvia by the way, do you have an updated training script which pre-computes the image latent and text embedding? Some while ago i made it for sd 1.5 text to image training.
If you could share it, i'd really appreciate it :)
@zhangvia by the way, do you have an updated training script which pre-computes the image latent and text embedding? Some while ago i made it for sd 1.5 text to image training.
sorry, i just test the scripts not actually implement the pre compute code. but if you just want to finetune flux , don't want add any changes to flux model, i suggest using sd-scripts to train the flux
@zhangvia is it available on Linux?
@zhangvia is it available on Linux?
of course, it can only be used on linux
@zhangvia thanks. I successfully trained it with kohya_ss
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Describe the bug
i'm using the train_dreambooth_flux.py to finetune flux. i get oom on 4x A100 80gb with deepspeed stage 2, gradient checkpoint, bf16 mixed precision, 1024px *1024px input, adafactor optimizer,batchsize 1. it can only run with deepspeed stage3, but that is too slow about 16sec/it.
Reproduction
just use train_dreambooth_flux.py in repo
Logs
No response
System Info
Who can help?
@linoytsaban