Closed uygnef closed 1 year ago
@stas00 could please help me take a look at this issue?
See https://github.com/huggingface/diffusers/pull/3076
Please carefully read the OP of the PR for details.
@uygnef Have you solved this problem?
@luochuwei Yes, it works for training one model, but there seems to be an issue with training multiple models. I have submit the issue at https://github.com/microsoft/DeepSpeed/issues/3472
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
Describe the bug An error is reported when using deepspeed's zero stage3 finetune diffusers/examples/text_to_image/train_text_to_image.py script. My machine's GPU is 2*A100, running on deepspeed zero stage3
error log is
I read https://github.com/huggingface/diffusers/issues/1865 , https://www.deepspeed.ai/tutorials/zero/#allocating-massive-megatron-lm-models and https://deepspeed.readthedocs.io/en/latest/zero3.html#deepspeed.zero.GatheredParameters modify /usr/local/conda/lib/python3.9/site-packages/transformers/models/clip/modeling_clip.py as this:
but it does not work.
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I am also experiencing the same issue as mentioned in https://github.com/huggingface/diffusers/issues/1865, therefore I have copied the reproduction steps from the original post.
/home/kas/zero_stage3_offload_config.json
pip install deepspeed export MODEL_NAME="stabilityai/stable-diffusion-2" export dataset_name="lambdalabs/pokemon-blip-captions"
accelerate launch --config_file ./accelerate.yaml --mixed_precision="fp16" train_text_to_image.py \ --pretrained_model_name_or_path=$MODEL_NAME \ --dataset_name=$dataset_name \ --use_ema \ --resolution=224 --center_crop --random_flip \ --train_batch_size=16 \ --gradient_accumulation_steps=2 \ --gradient_checkpointing \ --max_train_steps=500 \ --learning_rate=6e-5 \ --max_grad_norm=1 \ --lr_scheduler="constant_with_warmup" --lr_warmup_steps=0 \ --output_dir="sd-pokemon-model"
Expected behavior
The goal is to be able to use Zero3 normally.