Open rom1504 opened 2 years ago
Already?! ππ―π
I was going to add training scripts for all the components tomorrow morning ... π
I think for the clip guided generation, you will still need a decoder conditioned on the clip image embedding, tho you can probably get away with a small resolution net for starters, just to validate
Thank you for your work! I've modified your code to work with the original CLIP and got the training scripts. What is a good small dataset to test this on?
@xiankgx π are you working with Laion? you should, because they have a humongous (and smaller test) datasets
Thank you for your work! I've modified your code to work with the original CLIP and got the training scripts. What is a good small dataset to test this on?
Do you have a link to the CLIP that you used? I can try to incorporate it tomorrow using https://github.com/lucidrains/DALLE2-pytorch/blob/main/dalle2_pytorch/dalle2_pytorch.py#L95
Thank you for your work! I've modified your code to work with the original CLIP and got the training scripts. What is a good small dataset to test this on?
Do you have a link to the CLIP that you used? I can try to incorporate it tomorrow using https://github.com/lucidrains/DALLE2-pytorch/blob/main/dalle2_pytorch/dalle2_pytorch.py#L95
I am using the CLIP from this link: https://github.com/openai/CLIP
ohh got it, looks like they finally got it to be pip installable. I'll take a look tomorrow at an adapter!
ok, the plan will be to automatically use the openai clip by setting a use_openai_clip
on both the Decoder
and DiffusionPrior
, which will allow researchers to skip the first step in the whole process
@xiankgx wave are you working with Laion? you should, because they have a humongous (and smaller test) datasets
How can I help? Would be glad to help.
Here is another benchmark we should definitely use:
https://github.com/cat-state/clip-retrieval/blob/main/clip_retrieval/clip_benchmark.py
The purpose of this benchmark should be to evaluate the ability of a clip model to use retrieve correct or at least semantically close samples from a given dataset.
At first basic version of this script should be focused on image-text pairs.
Later, it would be nice to have a general version of this benchmark that could be used for any pair of modalities like audio, video, text and images.
Letβs say every sample has a component A (e.g. image) and a component B (e.g. text).
take sample A-B and look with A in B-kNN index for the closest neighbor Aβ- Bβ. β> check if the Bβ = B, or even better, use a similarity encoder for the modality B to estimate how similar B and Bβ are. (for text e.g. with this https://huggingface.co/sentence-transformers/all-mpnet-base-v2) Then take B from sample A-B and look in A-kNN index for the closest neighbor sample Aβ-Bβ. β> check if Aβ=A alternatively, use a similarity encoder for the modality A to estimate how similar A and Aβ are. If there is no good single modality encoder to measure the semantic similarity of A and Aβ (Like with Image- Image pairs at the moment), β> Take the similarity of B and Bβ as a proxy (e.g. if B is text, check if the text Bβ that belongs to the retrieved sample Aβ-Bβ is semantically close to the text B we used for the query of the image index ) Calculate the mean and the standard deviation of the similarities for A->B and B->A kNN queries for all samples in the evaluation set / n samples
Thank you for your work! I've modified your code to work with the original CLIP and got the training scripts. What is a good small dataset to test this on?
Do you have a link to the CLIP that you used? I can try to incorporate it tomorrow using https://github.com/lucidrains/DALLE2-pytorch/blob/main/dalle2_pytorch/dalle2_pytorch.py#L95
I am using the CLIP from this link: https://github.com/openai/CLIP
https://github.com/lucidrains/DALLE2-pytorch/tree/0.0.67#openai-clip ok, should be OpenAI clip compatible now, at some point I'll make it OpenCLIP compatible as well
you can depend on clip-anytorch if you want openai clip from pypi (that's my pypi deployment of it)
nice! I'll refactor to use it maybe next week :)
Oh I mean there is no change to be made except pip install clip-anytorch Everything else is the same, including the imports
@rom1504 it works! :pray: https://github.com/lucidrains/DALLE2-pytorch/releases/tag/0.0.73
https://huggingface.co/rom1504/dalle2-diffusion-prior/resolve/main/1651432174.5708027_saved_model.pth here's a first checkpoint for the prior
let's start evaluation work!
https://colab.research.google.com/drive/1kUYIvWje6CVO9llqY_9bYYk6zMNh1sSh?usp=sharing first eval from theo comparing image predicted with real embedding not amazing
let's include that kind of metrics in the training to see if things improve with time
I should note here, there could be a good chance this could be some form of normalization issue on my end causing the results to be worse as this is my first time playing around with the DALLE2-pytorch code - I'm going to play around with it some more and see
@TheoCoombes so one thing to note is that in the paper, they actually sampled a couple image embeddings (well, just 2 i guess), and then selected the one with the highest similarity to the text embedding. so it seems they must have encountered the same difficulties. the logic is in this function here if you need it! https://github.com/lucidrains/DALLE2-pytorch/blob/main/dalle2_pytorch/dalle2_pytorch.py#L831
https://colab.research.google.com/drive/10P81dVS7YKCMUHF3FA7WD3Q_mp-cCWIA#scrollTo=VVElbFFcb5T7 new eval with the new checkpoint https://huggingface.co/krish240574/Dalle2-Diffusion-Prior/blob/main/1651473037.600823_saved_model.pth
now it works, we improve the similarity! previous checkpoint was 0.27 -> 0.09 now 0.27 -> 0.28
we're going to make a PR to evaluate that automatically during training since it's cheap
@rom1504 very nice! :D
I have a new model/eval set. This run tries out optimization parameters that are more in line with what the paper specifies.
It seems to show better performance, resulting in a similarity of ~0.78 (up from 0.28, on the previously used image). However, more work should be done on benchmarking since early testing in the discord shows that unrelated prompts can also score relatively high similarities.
Here is the model repository (hugging face link). And a W&B report of a 25M datapoint run.
I've got a 300M point run going with the improved norm (re: #60).
I've also attempted to add a way to track the similarity with an unrelated text embedding. In short I shuffle the text embeddings in an effort to simulate "unrelated" prompts...
I'll PR the code when the run finishes and have you guys take a look at it to make sure the code & run results match what we would expect.
You can keep an eye on the run here (wandb report)
I trained a prior with ViT-B/32 text and image embeddings using the train_diffusion_prior.py script. Additionally i tracked the score for a fixed text embedding vs predicted image embedding. Ideally it should decrease over time since the text and images will be unrelated - it does but not as much and my weights toward the end still give a high score for unrelated text and predicted image embeds
CosineSim(Unrelated_Text_Embed, Prior(Text)) is upper graph
CosineSim(Related_Image_Embed, Prior(Text)) is lower graph
Based on experience with CLIP, many texts can have the same cosine similarity ballpark even if some texts are better than others. Perhaps we can use softmax accuracy instead between generated image embeddings and input (text embeddings).
I just launched a run to test out a new metric, using deep-image-prior (#59), to generate images from the diffusion prior at 10k step intervals during training.
If we get to ~50k steps and it looks like it works, then we can create a much more diverse prompt-set to evaluate the prior on.
https://github.com/lucidrains/DALLE2-pytorch/issues/23#issuecomment-1127011855 we can share the preprint once it gets released on arxiv
this is almost done, zero shot eval might be the last thing here, some people are on it
We're starting to have our first prior now. (PR of the training script coming soon)
Time to evaluate Ideas:
If you have more ideas please share, I may be missing some obvious things.
We have more volunteers that want to help so I'll point some here :)