nitrosocke / dreambooth-training-guide

604 stars 31 forks source link

Basic Dreambooth Guide

This is a short guide on the process of collecting a dataset and basic dreambooth settings.

My Setup

How to fine tune Stable Diffusion 2.0

Right now I suggest training on the 512x512 base model of Stable Diffusion 2.0 as the training with a V model has not been tested yet and might need a few modifications to the code. The diffusers based repositories like Shivams will be able to train on the 2.0 model with only a few updates to the dependencies. I recommend using xformers as I have not tested it without them and can't tell if they are actually needed. A windows branch can be found here. Using the pip installers from the guide you should have the following dependencies installed:

pip install --upgrade git+https://github.com/huggingface/diffusers.git transformers accelerate scipy
pip install ftfy
pip install albumentations
pip install tensorboard
pip install modelcards
pip install OmegaConf
pip install pytorch_lightning
pip install -U -I --no-deps https://github.com/C43H66N12O12S2/stable-diffusion-webui/releases/download/f/xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl

Make sure to download the Stable Diffusion 2.0 Base model and not the 768-v or any other model. After you installed the dependencies and loaded the correct model you should be able to train a model just like before.

The Dataset

Dataset creation is the most important part of getting good, consistent results from Dreambooth training. Be sure to use high quality samples, as artifacts such as motion blur or low resolution will get picked up by the training and appear in the images you generate with your model. When training for a specific style, pick samples with good consistency. Ideally, only pick images from the show or artist you're training. Avoid fan art or anything with a different style, unless you're aiming for something like a style fusion.

In my trainings the sample or instance images consist of around 10-100 sample images from the show/movie/syle I want to train.

Once you've collected photos for a dataset, crop and resize all the images into 512x512 squares and remove any watermarks, logos, people/limbs cut off by the edge of the picture, or anything else you don't want in your final model. For subjects, I've found that including samples with either black or white backgrounds helps immensely. Transparent backgrounds may also work, but can sometimes leave a fringe or border around the subject as can be seen here: PNG Fringe

So for now I can't recommend using transparent backgrounds. Make sure to include images with a normal background (for example, of your subject in a scene). In my testing, using only plain backgrounds gave a poorer result.

Save the images in the PNG format and put all of them in a "train" folder.

The Reg Images

Something I've read on the internet, that I do NOT recommend, is using Stock Photos or otherwise Real Images as reg images.

I don't know where this theory originates, but I find it to be misinformation. In Dreambooth training, reg images are used as an example of what the model already can generate in that class and prevent it from training any other classes. For example, when training the class "man" you don't want the class "woman" to be affected as well. Using reg images that weren't created by the model prevents this "prior preservation" from working and instead trains the whole model on your dataset. You can achieve this effect easier by just training without the "prior_preservation_loss".

I usually use 1000 regularization or class images for my style training. The paper suggests 200 times the number of samples, but I've never used more than 2000 reg images. Generate the images beforehand or let the script do it before the training process. I use the stable-diffusion-v1-5 model to render the images using the DDIM Sampler, 30 Steps and 512x512 resolution. For the prompt, you want to use the class you intent to train. When training a style I use "artwork style" as the prompt. You can have a look at my reg images here, or use them for your own training:

Reg Images by Nitrosocke _The intended classprompt for these is the folder name.

The Settings

I roughly try to use 100 steps per samples image, so a dataset with 10 images should train good with 1000 steps. The effects of undertraining and overtraining can be seen here:

Training Stages

These are examples settings for my training:

instance_prompt= "archer style"
class_prompt= "illustration style"
train_batch_size= 1
gradient_accumulation_steps= 1
learning_rate= 1e-6
lr_scheduler= "constant"
lr_warmup_steps= 0
max_train_steps= 4000
train_text_encoder 
gradient_checkpointing

FAQ:

How many Images do you use for your Dataset?

How long does Dreambooth Training take for you with Shivam's repo?

Do your models take you a lot of attempts or is your first dreambooth always successful?

What settings do you use for training steps and repeats?

Which base ckpt did you use for the mo-di model?

How do I run your models?

I've been searching for models to use lately and all I find are missing the .ckpt file. Is there a reason for this or did they just forget?

Do you have a video guide for dreambooth?

How are the artworks in the examples with perfect faces and no extra limbs or heads?

Can I merge an existing model with a model I trained on my face to make myself in that style?

What does your dataset look like?

How much vram do you need for dreambooth

Is there an easy way to take an existing image and apply this style so that the subjects still resemble original?

Can we just use images of the same style or everything that is available for the topic?

Do training images need to be 512 by 512 pixels?

*Guide written by Nitrosocke, edited by wavymulder ❤