Open dann2333 opened 3 months ago
However, when I tried to re-start the train I only found the error below. I tried to use other checkpoints and delete that two params, however no one works.
What does this mean in code?
@linoytsaban could you check if you're able to reproduce this with text encoder training?
However, when I tried to re-start the train I only found the error below. I tried to use other checkpoints and delete that two params, however no one works.
What does this mean in code?
accelerate launch train_dreambooth_lora_sd3.py --pretrained_model_name_or_path="stabilityai/stable-diffusion-3-medium-diffusers" --output_dir=sd3-lora --instance_data_dir="sample-imgs" --instance_prompt="xxx" --resolution=900 --train_batch_size=1 --train_text_encoder --gradient_accumulation_steps=16 --optimizer="adamw" --learning_rate=1e-6 --text_encoder_lr=1e-6 --lr_scheduler="cosine" --lr_warmup_steps=500 --max_train_steps=4000 --rank=32 --seed="42" --gradient_checkpointing --resume_from_checkpoint latest --center_crop --report_to="wandb" --checkpointing_steps 20 --checkpoints_total_limit 3 --validation_prompt="xxx" --validation_epochs=1
Because the prompt is a bit long so I cutted them. If the accurate prompt is needed, just reply. Thanks
Hey @dann2333 thanks for reporting!
The issue is when resuming from checkpoint (I was able to reproduce), I'm not sure yet as to why that happens but basically when training with checkpointing such as -
--gradient_checkpointing
--resume_from_checkpoint="latest"
specifically, this call errors
Hey @dann2333 thanks for reporting!
The issue is when resuming from checkpoint (I was able to reproduce), I'm not sure yet as to why that happens but basically when training with checkpointing such as -
--gradient_checkpointing --resume_from_checkpoint="latest"
specifically, this call errors
Since both the text_encoder_one and text_encoder_two classes are CLIPTextModelWithProjection, there is an issue when saving the models in the save_model_hook:
isinstance(model, type(unwrap_model(text_encoder_one)))
isinstance(model, type(unwrap_model(text_encoder_two)))
They are of the same type, so text_encoder_two will overwrite text_encoder_one. A small workaround is:
elif isinstance(model, type(unwrap_model(text_encoder_one))) and model.config.hidden_size == 768:
Hey @dann2333 thanks for reporting! The issue is when resuming from checkpoint (I was able to reproduce), I'm not sure yet as to why that happens but basically when training with checkpointing such as -
--gradient_checkpointing --resume_from_checkpoint="latest"
specifically, this call errors
Since both the text_encoder_one and text_encoder_two classes are CLIPTextModelWithProjection, there is an issue when saving the models in the save_model_hook:
isinstance(model, type(unwrap_model(text_encoder_one))) isinstance(model, type(unwrap_model(text_encoder_two)))
They are of the same type, so text_encoder_two will overwrite text_encoder_one. A small workaround is:
elif isinstance(model, type(unwrap_model(text_encoder_one))) and model.config.hidden_size == 768:
Really? You're so smart
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Describe the bug
Hi, I'm trying to fine-tuning stabilityai/stable-diffusion-3-medium-diffusers and using the official diffuser scripts. The train process was normal expect the loss cannot reduce. I hope to add a validation prompt to see if every thing works ok, so I used Ctrl-C stopped the training process and then added the --validation_prompt and --validation_epochs params. However, when I tried to re-start the train I only found the error below. I tried to use other checkpoints and delete that two params, however no one works.
Reproduction
Here are the checkpoints link: https://drive.google.com/drive/folders/16RbJa_W4H7aQiGf7QhTXVEJV53LPuS8n?usp=sharing , https://drive.google.com/drive/folders/1zT3LmB7SNtavHP3tgbodTgE13VT0cYvb?usp=sharing The train command is:
accelerate launch train_dreambooth_lora_sd3.py --pretrained_model_name_or_path="stabilityai/stable-diffusion-3-medium-diffusers" --output_dir=sd3-lora --instance_data_dir="sample-imgs" --instance_prompt="xxx" --resolution=900 --train_batch_size=1 --train_text_encoder --gradient_accumulation_steps=16 --optimizer="adamw" --learning_rate=1e-6 --text_encoder_lr=1e-6 --lr_scheduler="cosine" --lr_warmup_steps=500 --max_train_steps=4000 --rank=32 --seed="42" --gradient_checkpointing --resume_from_checkpoint latest --center_crop --report_to="wandb" --checkpointing_steps 20 --checkpoints_total_limit 3 --validation_prompt="xxx" --validation_epochs=1
Logs
System Info
🤗 Diffusers version: 0.30.0.dev0 Platform: Linux-6.5.0-25-generic-x86_64-with-glibc2.35 Running on a notebook?: No Running on Google Colab?: No Python version: 3.10.13 PyTorch version (GPU?): 2.2.1 (True) Flax version (CPU?/GPU?/TPU?): not installed (NA) Jax version: not installed JaxLib version: not installed Huggingface_hub version: 0.24.0 Transformers version: 4.42.4 Accelerate version: 0.32.1 PEFT version: 0.11.1 Bitsandbytes version: not installed Safetensors version: 0.4.3 xFormers version: not installed Accelerator: NVIDIA L40s, 49152 MiB VRAM Using GPU in script?: NVIDIA L40s, 49152 MiB VRAM Using distributed or parallel set-up in script?: No
Who can help?
@sayakpaul