train_dreambooth_lora_sdxl.py cannot resume training from checkpoint ! ! model freezed ! !

yuxu915 commented 11 months ago

Describe the bug

When resume training from a middle lora checkpoint, it stops update the model( i.e. checkpionts remain the same as the middle checkpoint). For reproducing the bug, just turn on the --resume_from_checkpoint flag. All experimental settings are based on default configurations, using the latest version of the Diffusers library. Thanks for help. @patrickvonplaten @sayakpaul @yiyixuxu @DN6 Maybe related to https://github.com/huggingface/diffusers/issues/5004

Reproduction

https://colab.research.google.com/drive/17zNvqJZ8ChJaYZr6XIfsJBduKtb5FbOT#scrollTo=N14_vgURsNMY

Logs

No response

System Info

diffusers version: 0.24.0.dev0
Platform: Linux-3.10.0-1160.76.1.el7.x86_64-x86_64-with-glibc2.10
Python version: 3.8.5
PyTorch version (GPU?): 2.0.0+cu117 (True)
Huggingface_hub version: 0.16.4
Transformers version: 4.33.0
Accelerate version: 0.20.3
xFormers version: 0.0.18
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

No response

sayakpaul commented 11 months ago

This PR should resolve these issues: https://github.com/huggingface/diffusers/pull/5388. Could you please check that?

yuxu915 commented 11 months ago

hi, thanks for your kind responce, @sayakpaul ,however, I tried the PR in https://github.com/huggingface/diffusers/pull/5388 , results seem not satisfied as the main branch（the output is not like the training dog at all even after 1500 training）, all training settings are based on default configurations in https://github.com/younesbelkada/diffusers/blob/b21064f68ffad648455da116ba4b6bb669d1a223/examples/dreambooth/README_sdxl.md?plain=1#L79. It will be really nice if you could help debug in main branch, thanks. 😊

sayakpaul commented 11 months ago

Cc: @younesbelkada for the configs he tried.

younesbelkada commented 11 months ago

@yuxu915 do you use by any chance --use-gradient-checkpointing ? can you share the full command and I can try to repro

yuxu915 commented 11 months ago

hi, @younesbelkada , thanks for helping, but I cannot find --use-gradient-checkpointing in train_dreambooth_lora_sdxl.py🤔️, I trained on https://github.com/younesbelkada/diffusers . Am I still using wrong repo? My trianing command is as follow:

export MODEL_NAME="stable-diffusion-xl-base-1.0/stable-diffusion-xl-base-1.0"
export INSTANCE_DIR="datasets/image_instance/dog_1"
export OUTPUT_DIR="lora-trained-xl"
export VAE_PATH="model/stable-diffusion-xl-base-1.0/sdxl-vae-fp16-fix"

accelerate launch train_dreambooth_lora_sdxl.py \
  --pretrained_model_name_or_path=$MODEL_NAME  \
  --instance_data_dir=$INSTANCE_DIR \
  --pretrained_vae_model_name_or_path=$VAE_PATH \
  --output_dir=$OUTPUT_DIR \
  --instance_prompt="a photo of sks dog" \
  --resolution=256 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --learning_rate=1e-5 \
  --report_to="wandb" \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=1500 \
  --validation_prompt="A photo of sks dog in a bucket" \
  --validation_epochs=500 \
  --checkpointing_steps=100 \
  --seed="0" \
  --mixed_precision="fp16"

yuxu915 commented 11 months ago

hi, @younesbelkada , do you mean --gradient_checkpointing? I turn it on but seems having same results as before.

younesbelkada commented 11 months ago

Hi @yuxu915 , I had a look a the training scripts in detail. It appears that in PEFT we do initialize lora layers differently than in diffusers. In PEFT we use kaiming with a=sqrt(5) : https://github.com/huggingface/peft/blob/main/src/peft/tuners/lora/layer.py#L158 and in diffusers we use torch.nn.init.normal with std= 1 / rank: https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/lora.py#L223 If one uses the default hyper-parameters, the model indeed struggles to converge after 500 steps; I managed to get a nice convergence by using a higher LR (2e-4) and cosine as LR scheduler. Below is the full command that I used

export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
export INSTANCE_DIR="dog"
export OUTPUT_DIR="lora-trained-xl"
export VAE_PATH="madebyollin/sdxl-vae-fp16-fix"
export CUDA_VISIBLE_DEVICES="2"

accelerate launch train_dreambooth_lora_sdxl.py \
  --pretrained_model_name_or_path=$MODEL_NAME  \
  --instance_data_dir=$INSTANCE_DIR \
  --pretrained_vae_model_name_or_path=$VAE_PATH \
  --output_dir=$OUTPUT_DIR \
  --mixed_precision="fp16" \
  --instance_prompt="a photo of sks dog" \
  --resolution=1024 \
  --train_batch_size=2 \
  --gradient_accumulation_steps=4 \
  --learning_rate=2e-4 \
  --report_to="wandb" \
  --lr_scheduler="cosine" \
  --lr_warmup_steps=0 \
  --max_train_steps=500 \
  --validation_prompt="A photo of sks dog in a bucket" \
  --validation_epochs=25 \
  --seed="0" \
  --push_to_hub

And images that i get after ~150 steps:

Screenshot 2023-11-20 at 15 12 13

younesbelkada commented 11 months ago

Results after ~470 steps:

Screenshot 2023-11-20 at 15 37 37

younesbelkada commented 11 months ago

Also confirmed that it works even when using gradient_checkpointing with same config:

Screenshot 2023-11-20 at 16 21 55

yuxu915 commented 11 months ago

hi, @younesbelkada thanks for your kind responce, I tried your training command, and get results like:

The results appear to be not entirely similar to the images in the training set. I will try more combinations of hyperparameters in an attempt to achieve better results. Another problem is that, I tried to save intermediate loras during the training process by setting--checkpointing_steps to 25, however, during the inference stage, I sequentially read each lora and generate images. These images are different from those generated during the validation process( see in wandb) and not similar from the images in the training set. Inference scripts is :

from huggingface_hub.repocard import RepoCard
from diffusers import DiffusionPipeline
import torch

base_model_id = '/model/stable-diffusion-xl-base-1.0/stable-diffusion-xl-base-1.0'
pipe = DiffusionPipeline.from_pretrained(base_model_id, torch_dtype=torch.float16)

for step in range(25, 501, 25):    
    pipe.load_lora_weights(f"/diffusers/examples/dreambooth/lora-trained-xl/checkpoint-{step}")
    image = pipe("A picture of a sks dog in a bucket", num_inference_steps=25).images[0]
    image.save(f"sks_dog_{step}.png")

github-actions[bot] commented 9 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / diffusers