layer6ai-labs / dgm-eval

Codebase for evaluation of deep generative models as presented in Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models
MIT License
123 stars 9 forks source link

Evaluation differences between open_clip and clip #1

Closed iroilomamnou closed 1 year ago

iroilomamnou commented 1 year ago

Thanks for sharing this great repo with the community. I would like to point out one issue I'm encountering with the clip model you are using. There is a difference between the output of the clip model you are using and the one provided by openai. This impacts the results of the metrics you implemented. I'm getting lower values for some metrics using the openai version.

It would be great if you can double check.

Thanks in advance.

georgestein commented 1 year ago

Thanks for using the repo!

You are correct that the CLIP model we set as default (specified with the commandline parameter --model clip) uses OpenCLIP ViT-L/14 trained on DataComp-1B, and not the OpenAI versions, and hence gives different values.

Beyond model type (e.g. CLIP) we also considered additional components - model implementation (OpenCLIP and CLIP), model architecture (ViT-B/16, ViT-L/14, ...), and training set (OpenAI's, LAION, DataComp). Each of these components will result in a model with different characteristics and with different alignment with human evaluation. Please see Appendix B.4.1 of the paper for an analysis, specifically Figure 9, and the surrounding discussion that provides our reasoning for not using the base OpenAI CLIP model (in summary OpenAI models perform worse, do not have as large of architectures as provided by OpenCLIP, and are not "open") .

But, if you are wanting to specify different CLIP model architectures or pre-trained on different datasets than the one we set as default please see the usage in clip.py here https://github.com/layer6ai-labs/dgm-eval/blob/master/dgm_eval/models/clip.py#L10-L15, as they can be set by the commandline arguments --arch and --pretrained_weights.

iroilomamnou commented 1 year ago

Thanks @georgestein for your quick reply. My bad.. I did not check your paper in details. I've tried different architectures and got the same results. The architectures of the openai version produced lower Frechet distance when applied to my medical dataset, which contradicts your findings based on natural images. The only reason I see for such difference is that openai have potentially used medical datasets from the internet to train their models, but there are no indications in their paper that support this claim.

Thanks again for sharing this work.

georgestein commented 1 year ago

The reason I bring up "alignment with human evaluation" as the measure of performance is that a specific value of the Frechet Distance computed with a given encoder is meaningless in isolation, and should only be used to compare/rank generated sets of data from two different generative models (or two different checkpoints, epochs, hyperparameter settings, etc). The actual FD value you get depends on the encoder, representation size, normalizations, etc. A lower value does not mean a better model unless you are comparing to FD computed with the exact same encoder.

This is why the plots in our paper do not actually display the FD value (except for the final DINOv2 table at the end of the appendix, as we found that this encoder was best for natural image datasets), but instead show trends, rankings, correlations, etc when evaluating different generative models. If you use MAE as your encoder the FD might be 0.001, CLIP might give 1, and DINO might give 1000. This does not mean one should use MAE instead of CLIP or DINO - all it means is that MAE has a different representation space. This smaller number does not mean it is better for determining image quality. In an extreme example imagine an encoder that just outputs a constant value no matter what image you show it - the FD will be 0, but that does not mean this encoder is good. To determine which encoder is best for an application requires a study like ours.

There is a potential that for medical datasets a different encoder may be ideal. This would be an interesting line of research! While I have no quantitive results on the topic, from my experience playing around with all these models for quite some time I would guess that a CLIP model that was pre-trained on natural images would probably perform very poorly in this domain. CLIP representations extract information relevant to the text-image pairs in the training set, and therefore puts more focus towards to things that were in these simple captions, such as "airplane", "bed", "person", etc. I doubt that such features are applicable for medical images, or that medical images with sufficiently complex text captions were a sufficient portion of the training set for CLIP to learn anything useful beyond e.g. "this is an x-ray" or "that looks like a hospital".

iroilomamnou commented 1 year ago

Got it. To sum up.. there is a difference between the "alignment with human evaluation" and the evaluation of different generative models. FD (and all other metrics I think) can be employed to compare different generative models with the exact same encoder. In my case, I was assessing whether the generated medical images (specifically malignant cases) by a diffusion model were of sufficient quality for utilization in a target task. The aim was to achieve a more balanced medical dataset since malignant cases tend to be less prevalent compared to normal cases in the medical domain. Theoretically speaking, the performance of the target task should correlate with the human evaluation of the generated images, although such evaluation has not been conducted yet in our case. Alternatively, we could explore the correlation between the performance of the target task and the FD values as an initial step, as your approach in the paper necessitates human evaluation, which requires additional time and effort.

More work is needed before drawing conclusive findings..

Thanks again.

georgestein commented 1 year ago

We could explore the correlation between the performance of the target task and the FD values as an initial step, as your approach in the paper necessitates human evaluation, which requires additional time and effort.

This would be a great proxy. I expect that downstream classification and FD will be highly correlated if the encoder is appropriate for such medical images. If not, this may be evidence that a more medical-oriented encoder may be appropriate for your task over a generalized SSL encoder such as DINOv2. Either way I'd be interested to see the results!