gmongaras / Diffusion_models_from_scratch

Creating a diffusion model from scratch in PyTorch to learn exactly how they work.
MIT License
310 stars 22 forks source link
classifier-free-guidance ddim ddpm diffusion-models imagenet pytorch

Summary

This repo is composed of DDPM, DDIM, and Classifier-Free guided models trained on ImageNet 64x64. More information can be found below.

To go along with this repo, I also wrote an article explaining the algorithms behind it.

Contents

Current Additions

This repo has the following Diffusion features:

Instead of going into each of the parts here, you can read an article I wrote which explains each part in detail.

Environment Setup

First, download the data from this repo using the following on the command line

git clone https://github.com/gmongaras/Diffusion_models_from_scratch.git
cd Diffusion_models_from_scratch/

(Optional) If you don't want to change your environment, you can first create a virtual environment:

pip install virtualenv
python -m venv MyEnv/

Activate the virtual environment: https://docs.python.org/3/library/venv.html#how-venvs-work

Windows: MyEnv\Scripts\activate.bat

Linux: source MyEnv/bin/activate

Before running any scripts, make sure to download the correct packages and package versions. You can do so by running the following commands to upgrade pip and install the necessary package versions:

pip install pip -U
pip install -U requirements.txt

Note: PyTorch should be installed with cuda enabled if training and probably should have cuda if generating images, but is not required. The cuda version downloaded may be different from the one needed. The cuda versions and how to download them can be found below:

https://pytorch.org/get-started/locally/

Now the enviroment should be setup properly.

Downloading Pre-Trained Models

Pre-Trained Model Notes

I have several pre-trained models available to download or varying model architecture types. There are 5 model types based on the u-net block construction.

The above notation comes from the Train A Model section under the blk_types parameter.

Each model was trained with the following parameters unless otherwise specified:

  1. Image Resolution: 64x64
  2. Initial embedding channel: 128
  3. Channel multiplier — 1
  4. Number of U-net blocks — 3
  5. Timesteps — 1000
  6. VLB weighting Lambda — 0.001
  7. Beta Scheduler — Cosine
  8. Batch Size — 128 (across 8 GPUs, so 1024)
  9. Gradient Accumulation Steps — 1
  10. Number of steps (Note: This is not epochs, a step is a single gradient update to the model)— 600,000
  11. Learning Rate — 3*10^-4 = 0.0003
  12. Time embedding dimension size— 512
  13. Class embedding dimension size — 512
  14. Probability of null class for classifier-free guidance — 0.2
  15. Attention resolution — 16

Below are some training notes:

Picking a Model

To pick a model, I suggest looking at the results. The lower the FID score, the better better the outputs of the model are. The best models according to the results are:

Downloading A Model

Once the model has been picked, you can download a model at the following link:

Google Drive link

For training from a checkpoint you need to download three files for a model:

For inference/generation you only need to download two files for the model:

Put these files in the models/ directory to easily load them in when training/generating.

Downloading Training Data

Imagenet data can be downloaded from the following link: https://image-net.org/download-images.php

To get the data, you must first request access and be accepted to download the Imagenet data. I trained my models on Imagenet 64x64

image

Once downloaded, you should pur both the Imagenet64_train_part1.zip and Imagenet64_train_part2.zip in the data/ directory.

The zip files are in the correct directory, run the following script to load the data into the necessary format:

python data/loadImagenet64.py

If you wish to load the data into memory before training, run the script below. Otherwise, the data will be extracted from disk as needed.

python data/make_massive_tensor.py

The directory should look as follows when all data is downloaded: Directory Structure

Directory Structure

If you download both pretrained models and the training data, your directory should look like the following tree.

.
├── data
│   ├── Imagenet64
|   |   ├── 0.pkl
|   |   ├── ...
|   |   ├── metadata.pkl
│   ├── Imagenet64_train_part1.zip
│   ├── Imagenet64_train_part1.zip
│   ├── README.md
│   ├── archive.zip
│   ├── loadImagenet64.py
│   ├── make_massive_tensor.py
├── eval
|   ├── __init__.py
|   ├── compute_FID.py
|   ├── compute_imagenet_stats.py
|   ├── compute_model_stats.py
|   ├── compute_model_stats_multiple.py
├── models
|   ├── README.md
|   ├── [model_param_name].json
|   ├── [model_name].pkl
├── src
|   ├── blocks
|   |   ├── BigGAN_Res.py
|   |   ├── BigGAN_ResDown.py
|   |   ├── BigGAN_ResUp.py
|   |   ├── ConditionalBatchNorm2D.py
|   |   ├── Efficient_Channel_Attention.py
|   |   ├── Multihead_Attn.py
|   |   ├── Non_local.py
|   |   ├── Non_local_MH.py
|   |   ├── PositionalEncoding.py
|   |   ├── Spatial_Channel_Attention.py
|   |   ├── __init__.py
|   |   ├── clsAttn.py
|   |   ├── convNext.py
|   |   ├── resBlock.py
|   |   ├── wideResNet.py
|   ├── helpers
|   |   ├── PixelCNN_PP_helper_functions.py
|   |   ├── PixelCNN_PP_loss.py
|   |   ├── image_rescale.py
|   |   ├── multi_gpu_helpers.py
|   ├── models
|   |   ├── PixelCNN.py
|   |   ├── PixelCNN_PP.py
|   |   ├── U_Net.py
|   |   ├── Variance_Scheduler.py
|   |   ├── diff_model.py
|   ├── CustomDataset.py
|   ├── __init__.py
|   ├── infer.py
|   ├── model_trainer.py
|   ├── train.py
├── tests
|   ├── BigGAN_Res_test.py
|   ├── U_Net_test.py
|   ├── __init__.py
|   ├── diff_model_noise_test.py
├── .gitattributes
├── .gitignore
├── README.md

Train A Model

Before training a model, make sure you setup the environment and downloaded the data

After the above is complete, you can run the training script as follows from the root directory of this repo:

torchrun --nproc_per_node=[num_gpus] src/train.py --[params]

For example:

torchrun --nproc_per_node=8 src/train.py --blk_types res,res,clsAtn,chnAtn --batchSize 32

The above example runs the code with the following parameters:

torchrun --nproc_per_node=1 src/train.py --loadModel True --loadDir models/models_res --loadFile model_479e_600000s.pkl --optimFile optim_479e_600000s.pkl --loadDefFile model_params_479e_600000s.json --gradAccSteps 2

The above example loads in a pre-trained model for checkpoint:

The parameters of the script are as follows:

Data Parameters

Model Parameters

Training Parameters

Saving Parameters

Model loading Parameters

Data loading parameters

Generate Images With Pretrained Models

Before training a model, make sure you setup the environment and downloaded pre-trained models

After the above is done, you can run the script as follows from the root directory of this repo:

python -m src.infer --loadDir [Directory location of models] --loadFile [Filename of the .pkl model file] --loadDefFile [Filename of the .json model parameter file] --[other params]

For example, if I downloaded the model_358e_450000s file for the models_res_res_atn model and I want to use my CPU with a step size of 20, I would use the following on the command line:

python -m src.infer --loadDir models/models_res_res_atn --loadFile model_358e_450000s.pkl --loadDefFile model_params_358e_450000s.json --device cpu --step_size 20

The parameters of the inference scripts are as follows:

Required:

Generation parameters

Output parameters

Note: The class values and labels are zero-indexed and can be found in this document.

Calculating FID for a pretrained model

Once you have trained your models, you can evaluate them here using these scripts.

Note: All scripts for the section are located in the eval/ directory.

Calculating FID requires three steps:

1: Compute statistics for the ImageNet Data

For this step, run the compute_imagenet_stats.py to compute the FID for the ImageNet dataset.

python -m eval.compute_imagenet_stats

This script has the following parameters:

2: Compute statistics for pretrained models

This step has two alternatives. If you wish to generate FID for a single pre-trained, model use the compute_model_stats.py like so:

python -m eval.compute_model_stats

This script has the following paramters (which can be accessed by editting the file):

If you want to generate FID on multiple models and have access to multiple GPUs, you can parallelize the process. The compute_model_stats_multiple.py allows for this parallelization and can be run with the following command:

python -m eval.compute_model_stats_multiple

Note: The number of items in each of the lists should be at most equal to the number of GPUs you wish to use.

This script has the following parameters which can be changed inside the script file:

Note: Compared to the first step, this step is much more computationally heavy as it reqires the generation of images. Since it's a diffusion model, it has the downside of having to generate T (1000) images before a single image is even generated.

3: Compute the FID between ImageNet and the model(s)

Once you have generated both the FID and ImageNet statistics, you can compute the FID scores using the compute_FID.py script as follows:

python -m eval.compute_FID

This script has the following parameters:

Once the script is run, the FID will be printed to the screen.

Note: I have computed the FID for all the pretrained models, which can be found in the same location as Downloading Pre-Trained Models int the Google Drive folder in the filename saved_stats.7z. You can use 7-zip to open this file.

My Results

As stated in Downloading Pre-Trained Models, there are 5 different models I tried out:

Although I trained with classifier-free guidance, I calculated FID scores without guidance as adding guidance requires me to test too many parameters. Additionally, I only collected 10,000 generated images to calculate my FID scores as that already took long enough to generate.

By the way, long FID generation times are one of the problems with diffusion, generation times take forever and unlike GANs, you are not generating images during training. So, you can’t continuously collect FID scores as the model is learning.

Although I keep the classifier guidance value constant, I wanted to test variations between DDIM and DDPM, so I took a look at the step size and the DDIM scale. Note that a DDIM scale of 1 means DDPM, and a scale of 0 means DDIM. A step size of 1 means use all 1000 steps to generate images and a step size of 10 means use 100 steps to generate images:

Let's checkout the FIDs for each of these models:

It's a little hard to look at in this form. Let's look at a reduced graph with the minimum FID for each model type and u-net construction.

I calculate the FID score every 50,000 steps. I am only showing the minimum FID score over all 600,000 steps to reduce clutter.

Clearly, the models with two residual blocks performed the best. As for the attention addition, it doesn’t look like it made much of a difference as it was about the same as the model without attention.

Also, using a DDIM (0 scale) with a step size of 10 outperformed all other DDPM/DDIM methods of generation. I find this fact interesting since the model was explicitly trained for DDPM (1 scale) generation on 1000 steps, but performs between with DDIM on 100 steps.

Let's see some sample images using a DDIM scale of 0, classifier-free guidance scale of 4 and classes sampled randomly from the list of classes:

Overall, the results look pretty good, though if I trained it for longer and tried to find better hyperparameters, the results could be better!

References

  1. Diffusion Models Beat GANs on Image Synthesis (with classifier guidance): https://arxiv.org/abs/2105.05233

  2. Denoising Diffusion Probabilities Models (DDPMs): https://arxiv.org/abs/2006.11239

  3. Improved DDPMs (Improved Denoising Diffusion Probabilistic Models): https://arxiv.org/abs/2102.09672

  4. Denoising Diffusion Implicit Models (DDIM): https://arxiv.org/abs/2010.02502

  5. Classifier-Free Guidance: https://arxiv.org/abs/2207.12598

  6. U-net (Convolutional Networks for Biomedical Image Segmentation): https://arxiv.org/abs/1505.04597

  7. ConvNext (A ConvNet for the 2020s): https://arxiv.org/abs/2201.03545

  8. Attention block (Attention Is All You Need): https://arxiv.org/abs/1706.03762

  9. Attention/Vit block (An Image is Worth 16x16 Words): https://arxiv.org/abs/2010.11929

  10. Channel Attention block (ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks): https://arxiv.org/abs/1910.03151

Thanks to the following link for helping me multi-gpu the project! https://theaisummer.com/distributed-training-pytorch/

Thanks to Huggingface for the Residual Blocks! https://huggingface.co/blog/annotated-diffusion#resnet-block