mehdidc / feed_forward_vqgan_clip

Feed forward VQGAN-CLIP model, where the goal is to eliminate the need for optimizing the latent space of VQGAN for each input prompt
MIT License
136 stars 18 forks source link

Goal? #1

Closed afiaka87 closed 3 years ago

afiaka87 commented 3 years ago

Hey!

Is the idea here to use CLIP embeds through a transformer similar to alstroemeria's CLIP Decision Transformer?

edit: https://github.com/crowsonkb/cond_transformer_2

mehdidc commented 3 years ago

Hey! I wasn't aware of this repo at all will have look. Yes, I think it fits your description, basically input = clip embedding of text, output = vqgan's latent space, I am using an MLP mixer but it could also be a transformer. The loss function is minimizing the distance between the generated image CLIP embedding and the text CLIP embedding. No images are used at all, only captions. I started to get it working, by the way I am using your blog captions dataset for testing :D it's very helpful. Additionally, the input also contains noise input (generated from random normal) to have diversity on the images, for that to work I use a loss for diversity, otherwise for the same input text embedding I get the same image.

mehdidc commented 3 years ago

For the goal in general: to avoid optimizing the latent space for every single caption, similar to what has been done in the style transfer literature e.g. https://arxiv.org/pdf/1703.01664.pdf

afiaka87 commented 3 years ago

Here is asltroemeria's ideas. She tends to work on colab notebooks and doesn't have a repository. I cleaned things up a little and renamed it to CLIP-E; which seemed fitting.

https://gist.github.com/afiaka87/ebdb95635f837a35acb1f11a1d7b28d7

edit: here's a github repo actually https://github.com/crowsonkb/cond_transformer_2

mehdidc commented 3 years ago

Thanks!

afiaka87 commented 3 years ago

I think your description is maybe subtly different. Nevertheless - it definitely has code in there for getting CLIP embeddings into the appropriate format (digits) for a transformer or an mlp-mixer type of architecture.

edit: Good call on using the blog captions! They've been very helpful in training DALLE for me.

Unrelated:

I'm currently experimenting with the task of generating fonts/visuals-of-text from captions. I'm using augly from FB to add a 64 pixel caption to the top of every image. The remaining image gets 192 pixels. Seems to be going well actually. The weights from taming-transformers were failing me so I switched back to the released dalle dVAE weights which seem to have more/better codes for english text.

afiaka87 commented 3 years ago

@mehdidc It appears alstroemeria (@crowsonkb on github) has a proper github repository now.

https://github.com/crowsonkb/cond_transformer_2

crowsonkb commented 3 years ago

I saw this due to my repo being mentioned... If you're trying to output directly to the VQGAN latent space in one step (instead of autoregressive like my transformer model), then I've tried similar things before and can offer some advice. I made a StyleGAN-like generator (output was RGB) conditioned on CLIP embeddings, and tried to train it with a spherical distance loss, and it just learned to generate a "mean" image that varied slightly for each prompt, with almost no diversity. What worked better was a contrastive loss between an ordinary-sized batch of images and a large (10-20k) batch of text prompts, basically I modified the CLIP training loss into InfoNCE loss so I could use unequal batch sizes. This actually started to produce outputs that looked very different for different prompts. But the results were still blurry, so I added a discriminator to make the outputs look more like some set of reals. Here are some CLIP prompt interpolation videos from it: https://twitter.com/RiversHaveWings/status/1401653716495704068 https://twitter.com/RiversHaveWings/status/1416008149731971073. The results were never super good so I moved on to other stuff but you might have more luck. Hope this helps! :)

mehdidc commented 3 years ago

@afiaka87 Cool, I would be curious to see how it works on unseen words?

mehdidc commented 3 years ago

@crowsonkb " If you're trying to output directly to the VQGAN latent space in one step (instead of autoregressive like my transformer model)" Yes, that is exactly what I am experimenting with currently. Thanks a lot for the tips, contrastive loss + discriminator would be nice to try, I am looking for solutions to the diversity problem. For now I am maximizing the distance between the VGG16 feature space of the generated images like in https://arxiv.org/pdf/1703.01664.pdf, but I am not fully satisfied. Interpolations are actually a really nice application of this kind of model indeed. Another thing I would be interested to look at is zero-shot generalization, as @afiaka87 mentioned, to see if the model can generalize out of the training prompt distribution.

Here is a snapshot of the current state in my case: https://imgur.com/a/jHvzn9K

from left to right:

a female mannequin dressed in a brown pullover sweater and white wrap skirt
a photo of the national animal of guatemala
a chalk drawing of a turtle sitting in a forest at night
a charcoal drawing of a cougar sitting on a mountain during winter
a female mannequin dressed in an olive turtleneck sweater and orange pleated skirt
a photo of buena vista , san francisco , from a street in the morning
a photo of san francisco 's castro theatre
a professional high quality emoji of a chicken cat chimera . a chicken imitating a cat . a chicken made of cat . a professional emoji . 
afiaka87 commented 3 years ago

@mehdidc moving this to the dicussions in the DALLE-pytorch repo: https://github.com/lucidrains/DALLE-pytorch/discussions/339