This repo is composed of DDPM, DDIM, and Classifier-Free guided models trained on ImageNet 64x64. More information can be found below.
To go along with this repo, I also wrote an article explaining the algorithms behind it.
This repo has the following Diffusion features:
Instead of going into each of the parts here, you can read an article I wrote which explains each part in detail.
First, download the data from this repo using the following on the command line
git clone https://github.com/gmongaras/Diffusion_models_from_scratch.git
cd Diffusion_models_from_scratch/
(Optional) If you don't want to change your environment, you can first create a virtual environment:
pip install virtualenv
python -m venv MyEnv/
Activate the virtual environment: https://docs.python.org/3/library/venv.html#how-venvs-work
Windows: MyEnv\Scripts\activate.bat
Linux: source MyEnv/bin/activate
Before running any scripts, make sure to download the correct packages and package versions. You can do so by running the following commands to upgrade pip and install the necessary package versions:
pip install pip -U
pip install -U requirements.txt
Note: PyTorch should be installed with cuda enabled if training and probably should have cuda if generating images, but is not required. The cuda version downloaded may be different from the one needed. The cuda versions and how to download them can be found below:
https://pytorch.org/get-started/locally/
Now the enviroment should be setup properly.
I have several pre-trained models available to download or varying model architecture types. There are 5 model types based on the u-net block construction.
The above notation comes from the Train A Model section under the blk_types parameter.
Each model was trained with the following parameters unless otherwise specified:
Below are some training notes:
To pick a model, I suggest looking at the results. The lower the FID score, the better better the outputs of the model are. The best models according to the results are:
res-res-atn
:
model_358e_450000s.pkl
model_params_358e_450000s.json
optim_358e_450000s.pkl
res-res
:
model_438e_550000s.pkl
model_params_438e_550000s.json
optim_438e_550000s.pkl
Once the model has been picked, you can download a model at the following link:
For training from a checkpoint you need to download three files for a model:
For inference/generation you only need to download two files for the model:
Put these files in the models/
directory to easily load them in when training/generating.
Imagenet data can be downloaded from the following link: https://image-net.org/download-images.php
To get the data, you must first request access and be accepted to download the Imagenet data. I trained my models on Imagenet 64x64
Once downloaded, you should pur both the Imagenet64_train_part1.zip
and Imagenet64_train_part2.zip
in the data/ directory.
The zip files are in the correct directory, run the following script to load the data into the necessary format:
python data/loadImagenet64.py
If you wish to load the data into memory before training, run the script below. Otherwise, the data will be extracted from disk as needed.
python data/make_massive_tensor.py
The directory should look as follows when all data is downloaded: Directory Structure
If you download both pretrained models and the training data, your directory should look like the following tree.
.
├── data
│ ├── Imagenet64
| | ├── 0.pkl
| | ├── ...
| | ├── metadata.pkl
│ ├── Imagenet64_train_part1.zip
│ ├── Imagenet64_train_part1.zip
│ ├── README.md
│ ├── archive.zip
│ ├── loadImagenet64.py
│ ├── make_massive_tensor.py
├── eval
| ├── __init__.py
| ├── compute_FID.py
| ├── compute_imagenet_stats.py
| ├── compute_model_stats.py
| ├── compute_model_stats_multiple.py
├── models
| ├── README.md
| ├── [model_param_name].json
| ├── [model_name].pkl
├── src
| ├── blocks
| | ├── BigGAN_Res.py
| | ├── BigGAN_ResDown.py
| | ├── BigGAN_ResUp.py
| | ├── ConditionalBatchNorm2D.py
| | ├── Efficient_Channel_Attention.py
| | ├── Multihead_Attn.py
| | ├── Non_local.py
| | ├── Non_local_MH.py
| | ├── PositionalEncoding.py
| | ├── Spatial_Channel_Attention.py
| | ├── __init__.py
| | ├── clsAttn.py
| | ├── convNext.py
| | ├── resBlock.py
| | ├── wideResNet.py
| ├── helpers
| | ├── PixelCNN_PP_helper_functions.py
| | ├── PixelCNN_PP_loss.py
| | ├── image_rescale.py
| | ├── multi_gpu_helpers.py
| ├── models
| | ├── PixelCNN.py
| | ├── PixelCNN_PP.py
| | ├── U_Net.py
| | ├── Variance_Scheduler.py
| | ├── diff_model.py
| ├── CustomDataset.py
| ├── __init__.py
| ├── infer.py
| ├── model_trainer.py
| ├── train.py
├── tests
| ├── BigGAN_Res_test.py
| ├── U_Net_test.py
| ├── __init__.py
| ├── diff_model_noise_test.py
├── .gitattributes
├── .gitignore
├── README.md
Before training a model, make sure you setup the environment and downloaded the data
After the above is complete, you can run the training script as follows from the root directory of this repo:
torchrun --nproc_per_node=[num_gpus] src/train.py --[params]
For example:
torchrun --nproc_per_node=8 src/train.py --blk_types res,res,clsAtn,chnAtn --batchSize 32
The above example runs the code with the following parameters:
torchrun --nproc_per_node=1 src/train.py --loadModel True --loadDir models/models_res --loadFile model_479e_600000s.pkl --optimFile optim_479e_600000s.pkl --loadDefFile model_params_479e_600000s.json --gradAccSteps 2
The above example loads in a pre-trained model for checkpoint:
model_479e_600000s.pkl
optim_479e_600000s.pkl
model_params_479e_600000s.json
The parameters of the script are as follows:
Data Parameters
Model Parameters
res
, conv
, clsAtn
, atn
, and/or chnAtn
.
res,res,conv,clsAtn,chnAtn
linear
or cosine
Training Parameters
Saving Parameters
Model loading Parameters
Data loading parameters
Before training a model, make sure you setup the environment and downloaded pre-trained models
After the above is done, you can run the script as follows from the root directory of this repo:
python -m src.infer --loadDir [Directory location of models] --loadFile [Filename of the .pkl model file] --loadDefFile [Filename of the .json model parameter file] --[other params]
For example, if I downloaded the model_358e_450000s file for the models_res_res_atn model and I want to use my CPU with a step size of 20, I would use the following on the command line:
python -m src.infer --loadDir models/models_res_res_atn --loadFile model_358e_450000s.pkl --loadDefFile model_params_358e_450000s.json --device cpu --step_size 20
The parameters of the inference scripts are as follows:
Required:
Generation parameters
Output parameters
Note: The class values and labels are zero-indexed and can be found in this document.
Once you have trained your models, you can evaluate them here using these scripts.
Note: All scripts for the section are located in the eval/
directory.
Calculating FID requires three steps:
1: Compute statistics for the ImageNet Data
For this step, run the compute_imagenet_stats.py
to compute the FID for the ImageNet dataset.
python -m eval.compute_imagenet_stats
This script has the following parameters:
2: Compute statistics for pretrained models
This step has two alternatives. If you wish to generate FID for a single pre-trained, model use the compute_model_stats.py
like so:
python -m eval.compute_model_stats
This script has the following paramters (which can be accessed by editting the file):
step_size
. If the model requires 1000 steps to generate a single image, but has a step size of 4, then it will take 1000/4 = 250 steps to generate one image. Note that a higher step size means faster generation, but also lower quality images.If you want to generate FID on multiple models and have access to multiple GPUs, you can parallelize the process. The compute_model_stats_multiple.py
allows for this parallelization and can be run with the following command:
python -m eval.compute_model_stats_multiple
Note: The number of items in each of the lists should be at most equal to the number of GPUs you wish to use.
This script has the following parameters which can be changed inside the script file:
step_size
. If the model requires 1000 steps to generate a single image, but has a step size of 4, then it will take 1000/4 = 250 steps to generate one image. Note that a higher step size means faster generation, but also lower quality images.Note: Compared to the first step, this step is much more computationally heavy as it reqires the generation of images. Since it's a diffusion model, it has the downside of having to generate T (1000) images before a single image is even generated.
3: Compute the FID between ImageNet and the model(s)
Once you have generated both the FID and ImageNet statistics, you can compute the FID scores using the compute_FID.py
script as follows:
python -m eval.compute_FID
This script has the following parameters:
Once the script is run, the FID will be printed to the screen.
Note: I have computed the FID for all the pretrained models, which can be found in the same location as Downloading Pre-Trained Models int the Google Drive folder in the filename saved_stats.7z
. You can use 7-zip to open this file.
As stated in Downloading Pre-Trained Models, there are 5 different models I tried out:
Although I trained with classifier-free guidance, I calculated FID scores without guidance as adding guidance requires me to test too many parameters. Additionally, I only collected 10,000 generated images to calculate my FID scores as that already took long enough to generate.
By the way, long FID generation times are one of the problems with diffusion, generation times take forever and unlike GANs, you are not generating images during training. So, you can’t continuously collect FID scores as the model is learning.
Although I keep the classifier guidance value constant, I wanted to test variations between DDIM and DDPM, so I took a look at the step size and the DDIM scale. Note that a DDIM scale of 1 means DDPM, and a scale of 0 means DDIM. A step size of 1 means use all 1000 steps to generate images and a step size of 10 means use 100 steps to generate images:
Let's checkout the FIDs for each of these models:
It's a little hard to look at in this form. Let's look at a reduced graph with the minimum FID for each model type and u-net construction.
I calculate the FID score every 50,000 steps. I am only showing the minimum FID score over all 600,000 steps to reduce clutter.
Clearly, the models with two residual blocks performed the best. As for the attention addition, it doesn’t look like it made much of a difference as it was about the same as the model without attention.
Also, using a DDIM (0 scale) with a step size of 10 outperformed all other DDPM/DDIM methods of generation. I find this fact interesting since the model was explicitly trained for DDPM (1 scale) generation on 1000 steps, but performs between with DDIM on 100 steps.
Let's see some sample images using a DDIM scale of 0, classifier-free guidance scale of 4 and classes sampled randomly from the list of classes:
Overall, the results look pretty good, though if I trained it for longer and tried to find better hyperparameters, the results could be better!
Diffusion Models Beat GANs on Image Synthesis (with classifier guidance): https://arxiv.org/abs/2105.05233
Denoising Diffusion Probabilities Models (DDPMs): https://arxiv.org/abs/2006.11239
Improved DDPMs (Improved Denoising Diffusion Probabilistic Models): https://arxiv.org/abs/2102.09672
Denoising Diffusion Implicit Models (DDIM): https://arxiv.org/abs/2010.02502
Classifier-Free Guidance: https://arxiv.org/abs/2207.12598
U-net (Convolutional Networks for Biomedical Image Segmentation): https://arxiv.org/abs/1505.04597
ConvNext (A ConvNet for the 2020s): https://arxiv.org/abs/2201.03545
Attention block (Attention Is All You Need): https://arxiv.org/abs/1706.03762
Attention/Vit block (An Image is Worth 16x16 Words): https://arxiv.org/abs/2010.11929
Channel Attention block (ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks): https://arxiv.org/abs/1910.03151
Thanks to the following link for helping me multi-gpu the project! https://theaisummer.com/distributed-training-pytorch/
Thanks to Huggingface for the Residual Blocks! https://huggingface.co/blog/annotated-diffusion#resnet-block