mlfoundations / open-diffusion

Simple large-scale training of stable diffusion with multi-node support.
120 stars 8 forks source link

Evaluation metrics #4

Open mehdidc opened 1 year ago

mehdidc commented 1 year ago

Would be great to have (optional) model evaluation. Possibilities:

vkramanuj commented 1 year ago

Good points. The default captions used validate_and_save_model are from https://github.com/j-min/DallEval, with the intent of eventually adding automatic validation to this repo. There are some options for merging into this repo:

  1. Introduce these metrics during the validate_and_save function. I am partially against this because CLIP score and FID score both involve loading other models/datasets, which would complicate the config and GPU memory consumption as well as the main train script.
  2. Set up an asynchronous function that "watches" the output examples folder that validate_and_save makes, then computes FID/CLIP score when there's an update. This would operate on a separate node/set of GPUs than the original train script and would be invoked by the user separately from the train script.

Which do you think is a better option? Also thanks for the pointer to ImageReward, I will look into it!

mehdidc commented 1 year ago

I would also go for option 2 for now at least, because of what you said + the need to distribute the computation of all the metrics over the GPUs as well otherwise only rank zero would be used, while others GPUs would wait.

vkramanuj commented 1 year ago

Sounds good to me, it will take me some time to implement this. Let me know if you'd like to take some part of the PR. I see 3 direct parts:

  1. Integration of FID score evaluation (with https://github.com/j-min/DallEval).
  2. CLIPScore evaluation + possibly ImageReward
  3. A watcher given an evaluation pipeline, which would need to sync to the same wandb as the training run.

I have partial implementations on all of these (except ImageReward) which I will push to a working branch soon that we could use as a starting point.

mehdidc commented 1 year ago

I can take care of ImageReward, and help with others, so please go ahead and push the working branch so that I extend it. maybe you can do FID and I do CLIPScore, or the other way.

mehdidc commented 1 year ago

Another work to consider: https://arxiv.org/abs/2305.01569, similar to ImageReward (they also compare themselves with ImageReward). Code: https://github.com/yuvalkirstain/PickScore

vkramanuj commented 1 year ago

Hi Mehdi, I have added some starting code in the evaluation branch. It's rough but has an implementation of computing clip score directly from a tar file w/o extracting as well as how it would be used by someone in evaluation/quality_metrics_watcher.py. It also has a starting point for FID score. Thanks for the pointer to that paper :-) Heads up, I will be a bit slow to reply to things due to the NeurIPS deadline.