Dreambooth results are not reproducible

aurotripathy commented 1 year ago

Describe the bug

Disappointed that the results are not reproducible (others say the same as well). In my case, the output image is very similar to one of the input images.

My steps:

pip install --upgrade diffusers[torch]
git clone https://github.com/huggingface/diffusers.git

 cd diffusers/examples/dreambooth/
 pip install -U -r requirements.txt
 accelerate config
 huggingface-cli login

Get the seeding dog images as instructed

install bitsandbytes . Needs to point to where Cuda is installed.

pip install bitsandbytes
python -m bitsandbytes
export LD_LIBRARY_PATH="/opt/conda/lib"
python -m bitsandbytes

Run the finetune script

export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export INSTANCE_DIR="/root/images/dog/"
export CLASS_DIR="path-to-class-images"
export OUTPUT_DIR="path-to-save-model"

accelerate launch train_dreambooth.py \
  --pretrained_model_name_or_path=$MODEL_NAME  \
  --train_text_encoder \
  --instance_data_dir=$INSTANCE_DIR \
  --class_data_dir=$CLASS_DIR \
  --output_dir=$OUTPUT_DIR \
  --with_prior_preservation --prior_loss_weight=1.0 \
  --instance_prompt="a photo of sks dog" \
  --class_prompt="a photo of dog" \
  --resolution=512 \
  --train_batch_size=1 \
  --use_8bit_adam \
  --gradient_checkpointing \
  --learning_rate=2e-6 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --num_class_images=200 \
  --max_train_steps=800

Run the inference script

from diffusers import StableDiffusionPipeline
import torch

model_id = "path-to-save-model"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")

prompt = "A photo of sks dog in a bucket"
image = pipe(prompt, num_inference_steps=50, guidance_scale=7.5).images[0]

image.save("dog-bucket.png")

Reproduction

No response

Logs

--

System Info

--

Revist commented 1 year ago

Duplicate of https://github.com/huggingface/diffusers/issues/1062

patil-suraj commented 1 year ago

Answered here https://github.com/huggingface/diffusers/issues/1062#issuecomment-1307128786

aurotripathy commented 1 year ago

@patil-suraj, @Revist: Thanks for the guidance. If I use just the prior-preservation loss script as-is, I'm able to reproduce the result (or something close for A photo of sks dog in a bucket).

new_1

If I extend the fine-tuning to the text-encoder, the input is the same as the output.

Is that expected?

I would love an answer here but I can close this and open a new one, if that's better.

Thank you again.

-Auro

entrpn commented 1 year ago

I'm able to get decent results if I fine tune the text-encoder, even using low quality images (all images look pretty similar).

I'm using 128 class images with 8 instance images, 800 steps. I had to try multiple times with different settings to get something like this:

a photo of sks man wearing an ironman suit

aurotripathy commented 1 year ago

@entrpn, thank you. Would you be willing to share your setting for fine-tuning the text encoder(which ones)? That would be helpful.

patil-suraj commented 1 year ago

@aurotripathy as I said in the comment I linked, for dream booth we need to tune hyperparams to get the best results. Also, training text encoder is usually helpful for more complex concepts like faces. I don't think this is an issue with training script. I would suggest to try different settings and pick what works best for your use-case.

aurotripathy commented 1 year ago

Thank you @patil-suraj Lots of good thoughts (for a newbie, me) Closing this.

camenduru commented 1 year ago

Would you be willing to share your setting for fine-tuning the text encoder(which ones)? That would be helpful.

@aurotripathy maybe helps https://wandb.ai/psuraj/dreambooth/reports/Dreambooth-Training-Analysis--VmlldzoyNzk0NDc3

entrpn commented 1 year ago

@aurotripathy take a look at this project. https://github.com/entrpn/serving-model-cards/tree/main/training-dreambooth. That's what I used for the results above.

huggingface / diffusers