huggingface / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
25.49k stars 5.28k forks source link

accelerate + FSDP + T2I train saving ckpt error #6705

Open Forainest opened 8 months ago

Forainest commented 8 months ago

Describe the bug

I have used /examples/text_to_image/train_text_to_image_sdxl.py to train a fine tune sdxl. I used accelerate 0.25.0 + FSDP, when I was saving a checkpoint it will stuck and can't save a whole ckpt. And I also tried deepspeed it will stuck too. I didn't change any code in train_text_to_image_sdxl.py

Reproduction

accelerate config is

compute_environment: LOCAL_MACHINE
debug: true
distributed_type: FSDP
diwbcast_bf16: 'no'
fsdp_config:
    fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
    fsdp_backward_prefetch_policy: BACKWARD_PRE
    fsdp_cpu_ram_efficient_loading: true
    fsdp_forward_prefetch: true
    fsdp_offload_params: true
    fsdp_sharding_strategy: 1
    fsdp_state_dict_type: FULL_STATE_DICT
    fsdp_sync_module_state: true
    fsdp_transformer_layer_cls_to_wrap: UNet2DConditionModel, DownBlock2D, CrossAttnDownBlock2D. UpBlock2D, CrossAttnUpBlock2D
    fsdp_use_orig_params: true
machine_rank: 0,
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false 

code: /examples/text_to_image/train_text_to_image_sdxl.py pretrain_model and dataset: totally follow README

Logs

No response

System Info

Linux localhost.localdomain 4.14.0-115.el7a.0.1.aarch64

Who can help?

@yiyixuxu @sayakpaul

Forainest commented 8 months ago

When use deepspeed I delete accelerate.is_main_process, it can save ckpt normally

sayakpaul commented 8 months ago

For DeepSpeed, you need to follow what's done in: https://github.com/huggingface/diffusers/pull/6628. Could you try that and see if it works?

Forainest commented 8 months ago

Thanks for deepspeed!. How about FSDP? Do we have any method to save ckpt successful?

sayakpaul commented 8 months ago

I am no FSDP expert. Can you post the error trace?

Forainest commented 8 months ago

There has no error trace. It will hang out when using accelerate + FSDP to save a ckpt or finish training. like issuse: https://github.com/huggingface/diffusers/issues/2816

github-actions[bot] commented 7 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

AoqunJin commented 6 months ago

@Forainest When using deepspeed, you can install apex, which will be automatically used in deepspeed. That works for me.

git clone https://github.com/NVIDIA/apex.git 
cd apex git checkout 741bdf50825a97664db08574981962d66436d16a 
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./ --global-option="--cuda_ext" --global-option="--cpp_ext"
github-actions[bot] commented 5 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.