The codebase for evaluation of deep generative models as presented in Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models, accepted to NeurIPS 2023
We studied 41 generative models across a diverse range of image datasets and found:
Here we provide code to compute the following 15 generative evaluation metrics using 8 different encoder networks:
Metrics:
Encoders:
Our multifaceted investigation of generative evaluation shows that diffusion models are unfairly punished by the Inception network: they synthesize more realistic images as judged by humans and their diversity more closely resembles the training data, yet are consistently ranked worse than GANs on metrics computed in Inception-V3 representation space. |
First clone this repository, then navigate to the directory and pip install to install all required packages.
git clone git@github.com:layer6ai-labs/dgm-eval
cd dgm-eval
pip install -e .
We recommend you do this in a conda environment:
conda create --name dgm-eval pip python==3.10
conda activate dgm-eval
git clone git@github.com:layer6ai-labs/dgm-eval
cd dgm-eval
pip install -e .
Computing metrics only requires the paths to either locally hosted image datasets or torchvision.datasets. Encoders are automatically downloaded. For example, the following will compute the Fréchet distance (fd), kernel distance (kd), precision/recall/density/coverage (prdc), and the CT score (ct) using DINOv2 (default) as the encoder.
python -m dgm_eval path/to/training_dataset path/to/generated_dataset \
--test_path path/to/test_dataset \
--model dinov2 \
--metrics fd kd prdc ct
See scripts/run_experiments.sh
or run python dgm_eval -h
for further details on commandline parameters. As we suggest in the paper, metrics should be reported using the default model size (DINOv2-ViT-L/14) for final leaderboard values, but tracking progress during training is a factor of 4 more efficient with DINOv2-ViT-B/14. To use this architecture instead simply add --arch vitb14
as a commandline parameter.
Local datasets should either be un-conditional:
local/path/
000000.png
000001.png
...
or conditional:
local/path/
0/
000000.png
000001.png
...
1/
000000.png
000001.png
...
...
The directory should only include image files. To download and use a dataset from torchvision.datasets, just specify the dataset and train/test string:
python dgm_eval CIFAR10:train CIFAR10:test
A full example is as follows:
python -m dgm_eval CIFAR10:train CIFAR10:test \
--model dinov2 \
--metrics fd kd prdc \
--device cuda \
--batch_size 256 \
--nsample 512
>>> ....
>>> Num real: 512 Num fake: 512
>>> fd: 862.53745
>>> kd_value: 0.01095
>>> kd_variance: 0.00000
>>> precision: 0.90430
>>> recall: 0.91797
>>> density: 0.97969
>>> coverage: 0.94141
All generated data shown in this work can be accessed at the following link:
drive.google.com/drive/folders/1X0MFaUta90d3zF9xG4KchjR-8SE0cT_7?usp=sharing
Including:
CIFAR10/
, imagenet256/
, LSUN Bedroom/
, and FFHQ256/
.toy-datasets/
Data for human evaluation of image realism can be found at data/human-evaluation-realism/
DINOv2 is the best suited model for generative evaluation. Our extensive quantitative and qualitative assessments showed that it distills a generalized representation space suitable for evaluation of diverse image datasets. Metrics computed in DINOv2 space show much better alignment with human evaluation than those in Inception-V3 space. |
We have included leaderboard values on paperswithcode (links), and list all metrics in a table below:
Heatmaps can be visualized for each model on any given image datasets by the following, with examples following:
python -m dgm_eval CIFAR10:train CIFAR10:test \
--model inception \
--metrics fd \
--device cuda \
--batch_size 256 \
--nsample 50000 \
--heatmaps
Images | Inception | DINOv2 |
---|---|---|
If you use any part of this repository in your research, please cite the associated paper with the following bibtex entry:
Authors: George Stein, Jesse C. Cresswell, Rasa Hosseinzadeh, Yi Sui, Brendan Leigh Ross, Valentin Villecroze, Zhaoyan Liu, Anthony L. Caterini, J. Eric T. Taylor, Gabriel Loaiza-Ganem
@inproceedings{stein2023exposing,
title={Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models},
author={Stein, George and Cresswell, Jesse and Hosseinzadeh, Rasa and Sui, Yi and Ross, Brendan and Villecroze, Valentin and Liu, Zhaoyan and Caterini, Anthony L and Taylor, Eric and Loaiza-Ganem, Gabriel},
booktitle={Advances in Neural Information Processing Systems},
volume={36},
year={2023}
}
This data and code is licensed under the MIT License, copyright by Layer 6 AI.