Closed Forainest closed 5 days ago
When use deepspeed I delete accelerate.is_main_process, it can save ckpt normally
For DeepSpeed, you need to follow what's done in: https://github.com/huggingface/diffusers/pull/6628. Could you try that and see if it works?
Thanks for deepspeed!. How about FSDP? Do we have any method to save ckpt successful?
I am no FSDP expert. Can you post the error trace?
There has no error trace. It will hang out when using accelerate + FSDP to save a ckpt or finish training. like issuse: https://github.com/huggingface/diffusers/issues/2816
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
@Forainest When using deepspeed, you can install apex, which will be automatically used in deepspeed. That works for me.
git clone https://github.com/NVIDIA/apex.git
cd apex git checkout 741bdf50825a97664db08574981962d66436d16a
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./ --global-option="--cuda_ext" --global-option="--cpp_ext"
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Closing due to inactivities. Feel free to re-open.
Describe the bug
I have used /examples/text_to_image/train_text_to_image_sdxl.py to train a fine tune sdxl. I used accelerate 0.25.0 + FSDP, when I was saving a checkpoint it will stuck and can't save a whole ckpt. And I also tried deepspeed it will stuck too. I didn't change any code in train_text_to_image_sdxl.py
Reproduction
accelerate config is
code: /examples/text_to_image/train_text_to_image_sdxl.py pretrain_model and dataset: totally follow README
Logs
No response
System Info
Linux localhost.localdomain 4.14.0-115.el7a.0.1.aarch64
Who can help?
@yiyixuxu @sayakpaul