lucidrains / DALLE2-pytorch

Implementation of DALL-E 2, OpenAI's updated text-to-image synthesis neural network, in Pytorch
MIT License
10.97k stars 1.07k forks source link

Build a fair evaluation of the prior #29

Open rom1504 opened 2 years ago

rom1504 commented 2 years ago

We're starting to have our first prior now. (PR of the training script coming soon)

Time to evaluate Ideas:

If you have more ideas please share, I may be missing some obvious things.

We have more volunteers that want to help so I'll point some here :)

lucidrains commented 2 years ago

Already?! πŸ™πŸ’―πŸŽ‰

lucidrains commented 2 years ago

I was going to add training scripts for all the components tomorrow morning ... πŸ˜‚

lucidrains commented 2 years ago

I think for the clip guided generation, you will still need a decoder conditioned on the clip image embedding, tho you can probably get away with a small resolution net for starters, just to validate

xiankgx commented 2 years ago

Thank you for your work! I've modified your code to work with the original CLIP and got the training scripts. What is a good small dataset to test this on?

lucidrains commented 2 years ago

@xiankgx πŸ‘‹ are you working with Laion? you should, because they have a humongous (and smaller test) datasets

lucidrains commented 2 years ago

Thank you for your work! I've modified your code to work with the original CLIP and got the training scripts. What is a good small dataset to test this on?

Do you have a link to the CLIP that you used? I can try to incorporate it tomorrow using https://github.com/lucidrains/DALLE2-pytorch/blob/main/dalle2_pytorch/dalle2_pytorch.py#L95

xiankgx commented 2 years ago

Thank you for your work! I've modified your code to work with the original CLIP and got the training scripts. What is a good small dataset to test this on?

Do you have a link to the CLIP that you used? I can try to incorporate it tomorrow using https://github.com/lucidrains/DALLE2-pytorch/blob/main/dalle2_pytorch/dalle2_pytorch.py#L95

I am using the CLIP from this link: https://github.com/openai/CLIP

lucidrains commented 2 years ago

ohh got it, looks like they finally got it to be pip installable. I'll take a look tomorrow at an adapter!

lucidrains commented 2 years ago

ok, the plan will be to automatically use the openai clip by setting a use_openai_clip on both the Decoder and DiffusionPrior, which will allow researchers to skip the first step in the whole process

xiankgx commented 2 years ago

@xiankgx wave are you working with Laion? you should, because they have a humongous (and smaller test) datasets

How can I help? Would be glad to help.

christophschuhmann commented 2 years ago

Here is another benchmark we should definitely use:

https://github.com/cat-state/clip-retrieval/blob/main/clip_retrieval/clip_benchmark.py

The purpose of this benchmark should be to evaluate the ability of a clip model to use retrieve correct or at least semantically close samples from a given dataset.

At first basic version of this script should be focused on image-text pairs.

Later, it would be nice to have a general version of this benchmark that could be used for any pair of modalities like audio, video, text and images.

Let’s say every sample has a component A (e.g. image) and a component B (e.g. text).

take sample A-B and look with A in B-kNN index for the closest neighbor A’- B’. β€”> check if the B’ = B, or even better, use a similarity encoder for the modality B to estimate how similar B and B’ are. (for text e.g. with this https://huggingface.co/sentence-transformers/all-mpnet-base-v2) Then take B from sample A-B and look in A-kNN index for the closest neighbor sample A’-B’. β€”> check if A’=A alternatively, use a similarity encoder for the modality A to estimate how similar A and A’ are. If there is no good single modality encoder to measure the semantic similarity of A and A’ (Like with Image- Image pairs at the moment), β€”> Take the similarity of B and B’ as a proxy (e.g. if B is text, check if the text B’ that belongs to the retrieved sample A’-B’ is semantically close to the text B we used for the query of the image index ) Calculate the mean and the standard deviation of the similarities for A->B and B->A kNN queries for all samples in the evaluation set / n samples

https://github.com/rom1504/clip-retrieval

lucidrains commented 2 years ago

Thank you for your work! I've modified your code to work with the original CLIP and got the training scripts. What is a good small dataset to test this on?

Do you have a link to the CLIP that you used? I can try to incorporate it tomorrow using https://github.com/lucidrains/DALLE2-pytorch/blob/main/dalle2_pytorch/dalle2_pytorch.py#L95

I am using the CLIP from this link: https://github.com/openai/CLIP

https://github.com/lucidrains/DALLE2-pytorch/tree/0.0.67#openai-clip ok, should be OpenAI clip compatible now, at some point I'll make it OpenCLIP compatible as well

rom1504 commented 2 years ago

you can depend on clip-anytorch if you want openai clip from pypi (that's my pypi deployment of it)

lucidrains commented 2 years ago

nice! I'll refactor to use it maybe next week :)

rom1504 commented 2 years ago

Oh I mean there is no change to be made except pip install clip-anytorch Everything else is the same, including the imports

lucidrains commented 2 years ago

@rom1504 it works! :pray: https://github.com/lucidrains/DALLE2-pytorch/releases/tag/0.0.73

rom1504 commented 2 years ago

https://huggingface.co/rom1504/dalle2-diffusion-prior/resolve/main/1651432174.5708027_saved_model.pth here's a first checkpoint for the prior

let's start evaluation work!

rom1504 commented 2 years ago

https://colab.research.google.com/drive/1kUYIvWje6CVO9llqY_9bYYk6zMNh1sSh?usp=sharing first eval from theo comparing image predicted with real embedding not amazing

let's include that kind of metrics in the training to see if things improve with time

TheoCoombes commented 2 years ago

I should note here, there could be a good chance this could be some form of normalization issue on my end causing the results to be worse as this is my first time playing around with the DALLE2-pytorch code - I'm going to play around with it some more and see

lucidrains commented 2 years ago

@TheoCoombes so one thing to note is that in the paper, they actually sampled a couple image embeddings (well, just 2 i guess), and then selected the one with the highest similarity to the text embedding. so it seems they must have encountered the same difficulties. the logic is in this function here if you need it! https://github.com/lucidrains/DALLE2-pytorch/blob/main/dalle2_pytorch/dalle2_pytorch.py#L831

rom1504 commented 2 years ago

https://colab.research.google.com/drive/10P81dVS7YKCMUHF3FA7WD3Q_mp-cCWIA#scrollTo=VVElbFFcb5T7 new eval with the new checkpoint https://huggingface.co/krish240574/Dalle2-Diffusion-Prior/blob/main/1651473037.600823_saved_model.pth

now it works, we improve the similarity! previous checkpoint was 0.27 -> 0.09 now 0.27 -> 0.28

we're going to make a PR to evaluate that automatically during training since it's cheap

lucidrains commented 2 years ago

@rom1504 very nice! :D

nousr commented 2 years ago

I have a new model/eval set. This run tries out optimization parameters that are more in line with what the paper specifies.

It seems to show better performance, resulting in a similarity of ~0.78 (up from 0.28, on the previously used image). However, more work should be done on benchmarking since early testing in the discord shows that unrelated prompts can also score relatively high similarities.

Here is the model repository (hugging face link). And a W&B report of a 25M datapoint run.

nousr commented 2 years ago

I've got a 300M point run going with the improved norm (re: #60).

I've also attempted to add a way to track the similarity with an unrelated text embedding. In short I shuffle the text embeddings in an effort to simulate "unrelated" prompts...

I'll PR the code when the run finishes and have you guys take a look at it to make sure the code & run results match what we would expect.

You can keep an eye on the run here (wandb report)

NasirKhalid24 commented 2 years ago

I trained a prior with ViT-B/32 text and image embeddings using the train_diffusion_prior.py script. Additionally i tracked the score for a fixed text embedding vs predicted image embedding. Ideally it should decrease over time since the text and images will be unrelated - it does but not as much and my weights toward the end still give a high score for unrelated text and predicted image embeds

CosineSim(Unrelated_Text_Embed, Prior(Text)) is upper graph

CosineSim(Related_Image_Embed, Prior(Text)) is lower graph

Screen Shot 2022-05-06 at 1 56 10 PM
xiankgx commented 2 years ago

Based on experience with CLIP, many texts can have the same cosine similarity ballpark even if some texts are better than others. Perhaps we can use softmax accuracy instead between generated image embeddings and input (text embeddings).

nousr commented 2 years ago

I just launched a run to test out a new metric, using deep-image-prior (#59), to generate images from the diffusion prior at 10k step intervals during training.

If we get to ~50k steps and it looks like it works, then we can create a much more diverse prompt-set to evaluate the prior on.

lucidrains commented 2 years ago

https://github.com/lucidrains/DALLE2-pytorch/issues/23#issuecomment-1127011855 we can share the preprint once it gets released on arxiv

rom1504 commented 2 years ago

this is almost done, zero shot eval might be the last thing here, some people are on it