Closed yuxu915 closed 9 months ago
This PR should resolve these issues: https://github.com/huggingface/diffusers/pull/5388. Could you please check that?
hi, thanks for your kind responce, @sayakpaul ,however, I tried the PR in https://github.com/huggingface/diffusers/pull/5388 , results seem not satisfied as the main branch(the output is not like the training dog at all even after 1500 training), all training settings are based on default configurations in https://github.com/younesbelkada/diffusers/blob/b21064f68ffad648455da116ba4b6bb669d1a223/examples/dreambooth/README_sdxl.md?plain=1#L79. It will be really nice if you could help debug in main branch, thanks. 😊
Cc: @younesbelkada for the configs he tried.
@yuxu915 do you use by any chance --use-gradient-checkpointing
? can you share the full command and I can try to repro
hi, @younesbelkada , thanks for helping, but I cannot find --use-gradient-checkpointing
in train_dreambooth_lora_sdxl.py
🤔️, I trained on https://github.com/younesbelkada/diffusers . Am I still using wrong repo?
My trianing command is as follow:
export MODEL_NAME="stable-diffusion-xl-base-1.0/stable-diffusion-xl-base-1.0"
export INSTANCE_DIR="datasets/image_instance/dog_1"
export OUTPUT_DIR="lora-trained-xl"
export VAE_PATH="model/stable-diffusion-xl-base-1.0/sdxl-vae-fp16-fix"
accelerate launch train_dreambooth_lora_sdxl.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--instance_data_dir=$INSTANCE_DIR \
--pretrained_vae_model_name_or_path=$VAE_PATH \
--output_dir=$OUTPUT_DIR \
--instance_prompt="a photo of sks dog" \
--resolution=256 \
--train_batch_size=1 \
--gradient_accumulation_steps=4 \
--learning_rate=1e-5 \
--report_to="wandb" \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--max_train_steps=1500 \
--validation_prompt="A photo of sks dog in a bucket" \
--validation_epochs=500 \
--checkpointing_steps=100 \
--seed="0" \
--mixed_precision="fp16"
hi, @younesbelkada , do you mean --gradient_checkpointing
? I turn it on but seems having same results as before.
Hi @yuxu915 ,
I had a look a the training scripts in detail. It appears that in PEFT we do initialize lora layers differently than in diffusers.
In PEFT we use kaiming
with a=sqrt(5)
: https://github.com/huggingface/peft/blob/main/src/peft/tuners/lora/layer.py#L158 and in diffusers we use torch.nn.init.normal
with std= 1 / rank
: https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/lora.py#L223
If one uses the default hyper-parameters, the model indeed struggles to converge after 500 steps; I managed to get a nice convergence by using a higher LR (2e-4
) and cosine
as LR scheduler. Below is the full command that I used
export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
export INSTANCE_DIR="dog"
export OUTPUT_DIR="lora-trained-xl"
export VAE_PATH="madebyollin/sdxl-vae-fp16-fix"
export CUDA_VISIBLE_DEVICES="2"
accelerate launch train_dreambooth_lora_sdxl.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--instance_data_dir=$INSTANCE_DIR \
--pretrained_vae_model_name_or_path=$VAE_PATH \
--output_dir=$OUTPUT_DIR \
--mixed_precision="fp16" \
--instance_prompt="a photo of sks dog" \
--resolution=1024 \
--train_batch_size=2 \
--gradient_accumulation_steps=4 \
--learning_rate=2e-4 \
--report_to="wandb" \
--lr_scheduler="cosine" \
--lr_warmup_steps=0 \
--max_train_steps=500 \
--validation_prompt="A photo of sks dog in a bucket" \
--validation_epochs=25 \
--seed="0" \
--push_to_hub
And images that i get after ~150 steps:
Results after ~470 steps:
Also confirmed that it works even when using gradient_checkpointing
with same config:
hi, @younesbelkada thanks for your kind responce, I tried your training command, and get results like:
The results appear to be not entirely similar to the images in the training set. I will try more combinations of hyperparameters in an attempt to achieve better results.
Another problem is that, I tried to save intermediate loras during the training process by setting--checkpointing_steps
to 25, however, during the inference stage, I sequentially read each lora and generate images. These images are different from those generated during the validation process( see in wandb) and not similar from the images in the training set. Inference scripts is :
from huggingface_hub.repocard import RepoCard
from diffusers import DiffusionPipeline
import torch
base_model_id = '/model/stable-diffusion-xl-base-1.0/stable-diffusion-xl-base-1.0'
pipe = DiffusionPipeline.from_pretrained(base_model_id, torch_dtype=torch.float16)
for step in range(25, 501, 25):
pipe.load_lora_weights(f"/diffusers/examples/dreambooth/lora-trained-xl/checkpoint-{step}")
image = pipe("A picture of a sks dog in a bucket", num_inference_steps=25).images[0]
image.save(f"sks_dog_{step}.png")
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Describe the bug
When resume training from a middle lora checkpoint, it stops update the model( i.e. checkpionts remain the same as the middle checkpoint). For reproducing the bug, just turn on the
--resume_from_checkpoint
flag. All experimental settings are based on default configurations, using the latest version of the Diffusers library. Thanks for help. @patrickvonplaten @sayakpaul @yiyixuxu @DN6 Maybe related to https://github.com/huggingface/diffusers/issues/5004Reproduction
https://colab.research.google.com/drive/17zNvqJZ8ChJaYZr6XIfsJBduKtb5FbOT#scrollTo=N14_vgURsNMY
Logs
No response
System Info
diffusers
version: 0.24.0.dev0Who can help?
No response