layer6ai-labs / dgm-eval

Codebase for evaluation of deep generative models as presented in Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models
MIT License
131 stars 11 forks source link

Evaluation of Deep Generative models

The codebase for evaluation of deep generative models as presented in Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models, accepted to NeurIPS 2023

We studied 41 generative models across a diverse range of image datasets and found:

Here we provide code to compute the following 15 generative evaluation metrics using 8 different encoder networks:

Metrics:

Encoders:

image
Our multifaceted investigation of generative evaluation shows that diffusion models are unfairly punished by the Inception network: they synthesize more realistic images as judged by humans and their diversity more closely resembles the training data, yet are consistently ranked worse than GANs on metrics computed in Inception-V3 representation space.

Installation & Usage

Installation

First clone this repository, then navigate to the directory and pip install to install all required packages.

git clone git@github.com:layer6ai-labs/dgm-eval
cd dgm-eval
pip install -e .

We recommend you do this in a conda environment:

conda create --name dgm-eval pip python==3.10
conda activate dgm-eval
git clone git@github.com:layer6ai-labs/dgm-eval
cd dgm-eval
pip install -e .

Usage

Computing metrics only requires the paths to either locally hosted image datasets or torchvision.datasets. Encoders are automatically downloaded. For example, the following will compute the Fréchet distance (fd), kernel distance (kd), precision/recall/density/coverage (prdc), and the CT score (ct) using DINOv2 (default) as the encoder.

python -m dgm_eval path/to/training_dataset path/to/generated_dataset \
                --test_path path/to/test_dataset \
                --model dinov2 \
                --metrics fd kd prdc ct

See scripts/run_experiments.sh or run python dgm_eval -h for further details on commandline parameters. As we suggest in the paper, metrics should be reported using the default model size (DINOv2-ViT-L/14) for final leaderboard values, but tracking progress during training is a factor of 4 more efficient with DINOv2-ViT-B/14. To use this architecture instead simply add --arch vitb14 as a commandline parameter.

Local datasets should either be un-conditional:

local/path/
    000000.png
    000001.png
    ...

or conditional:

local/path/
    0/
        000000.png
        000001.png
        ...
    1/
        000000.png
        000001.png
        ...
    ...     

The directory should only include image files. To download and use a dataset from torchvision.datasets, just specify the dataset and train/test string:

python dgm_eval CIFAR10:train CIFAR10:test

A full example is as follows:

python -m dgm_eval CIFAR10:train CIFAR10:test \
                    --model dinov2 \
                    --metrics fd kd prdc \
                    --device cuda \
                    --batch_size 256 \
                    --nsample 512 

>>> ....
>>> Num real: 512 Num fake: 512
>>> fd: 862.53745
>>> kd_value: 0.01095
>>> kd_variance: 0.00000
>>> precision: 0.90430
>>> recall: 0.91797
>>> density: 0.97969
>>> coverage: 0.94141

Data Access

Images

All generated data shown in this work can be accessed at the following link:

drive.google.com/drive/folders/1X0MFaUta90d3zF9xG4KchjR-8SE0cT_7?usp=sharing

Including:

Human Evaluation

Data for human evaluation of image realism can be found at data/human-evaluation-realism/

Dinov2 Leaderboard

image
DINOv2 is the best suited model for generative evaluation. Our extensive quantitative and qualitative assessments showed that it distills a generalized representation space suitable for evaluation of diverse image datasets. Metrics computed in DINOv2 space show much better alignment with human evaluation than those in Inception-V3 space.

We have included leaderboard values on paperswithcode (links), and list all metrics in a table below:

image

Visualizing Heatmaps

Heatmaps can be visualized for each model on any given image datasets by the following, with examples following:

python -m dgm_eval CIFAR10:train CIFAR10:test \
                     --model inception \
                     --metrics fd \
                     --device cuda \
                     --batch_size 256 \
                     --nsample 50000 \
                     --heatmaps
Images Inception DINOv2
image image image

Citing

If you use any part of this repository in your research, please cite the associated paper with the following bibtex entry:

Authors: George Stein, Jesse C. Cresswell, Rasa Hosseinzadeh, Yi Sui, Brendan Leigh Ross, Valentin Villecroze, Zhaoyan Liu, Anthony L. Caterini, J. Eric T. Taylor, Gabriel Loaiza-Ganem

@inproceedings{stein2023exposing,
  title={Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models},
  author={Stein, George and Cresswell, Jesse and Hosseinzadeh, Rasa and Sui, Yi and Ross, Brendan and Villecroze, Valentin and Liu, Zhaoyan and Caterini, Anthony L and Taylor, Eric and Loaiza-Ganem, Gabriel},
  booktitle={Advances in Neural Information Processing Systems},
  volume={36},
  year={2023}
}

License

This data and code is licensed under the MIT License, copyright by Layer 6 AI.