Evaluation metrics - Githubissues

mlfoundations / open-diffusion

Simple large-scale training of stable diffusion with multi-node support.

120 stars 8 forks source link

Evaluation metrics #4

Open mehdidc opened 1 year ago

mehdidc commented 1 year ago

Would be great to have (optional) model evaluation. Possibilities:

CLIP score (e.g. on a reference set of captions like the ones from Parti)
FID, or inception distance in general where we could use other models like CLIP to extract features, as Inception is ImageNet specific
Possibily the recent ImageReward https://arxiv.org/abs/2304.05977, which relies on a model trained on human rankings and is quite easy to use, they are also planning to make the ranking dataset bigger

vkramanuj commented 1 year ago

Good points. The default captions used validate_and_save_model are from https://github.com/j-min/DallEval, with the intent of eventually adding automatic validation to this repo. There are some options for merging into this repo:

Introduce these metrics during the validate_and_save function. I am partially against this because CLIP score and FID score both involve loading other models/datasets, which would complicate the config and GPU memory consumption as well as the main train script.
Set up an asynchronous function that "watches" the output examples folder that validate_and_save makes, then computes FID/CLIP score when there's an update. This would operate on a separate node/set of GPUs than the original train script and would be invoked by the user separately from the train script.

Which do you think is a better option? Also thanks for the pointer to ImageReward, I will look into it!

mehdidc commented 1 year ago

I would also go for option 2 for now at least, because of what you said + the need to distribute the computation of all the metrics over the GPUs as well otherwise only rank zero would be used, while others GPUs would wait.

vkramanuj commented 1 year ago

Sounds good to me, it will take me some time to implement this. Let me know if you'd like to take some part of the PR. I see 3 direct parts:

Integration of FID score evaluation (with https://github.com/j-min/DallEval).
CLIPScore evaluation + possibly ImageReward
A watcher given an evaluation pipeline, which would need to sync to the same wandb as the training run.

I have partial implementations on all of these (except ImageReward) which I will push to a working branch soon that we could use as a starting point.

mehdidc commented 1 year ago

I can take care of ImageReward, and help with others, so please go ahead and push the working branch so that I extend it. maybe you can do FID and I do CLIPScore, or the other way.

mehdidc commented 1 year ago

Another work to consider: https://arxiv.org/abs/2305.01569, similar to ImageReward (they also compare themselves with ImageReward). Code: https://github.com/yuvalkirstain/PickScore

vkramanuj commented 1 year ago

Hi Mehdi, I have added some starting code in the evaluation branch. It's rough but has an implementation of computing clip score directly from a tar file w/o extracting as well as how it would be used by someone in evaluation/quality_metrics_watcher.py. It also has a starting point for FID score. Thanks for the pointer to that paper :-) Heads up, I will be a bit slow to reply to things due to the NeurIPS deadline.