mehdidc / feed_forward_vqgan_clip

Feed forward VQGAN-CLIP model, where the goal is to eliminate the need for optimizing the latent space of VQGAN for each input prompt
MIT License
136 stars 18 forks source link

How to condition model output z that looks like it can from a standard normal distribution? #20

Closed xiankgx closed 2 years ago

xiankgx commented 2 years ago

Hi, this is a nice repo and I'm trying to reimplement something similar for StyleGAN2. Using a list of texts, I'm trying to map CLIP text embeddings to StyleGAN2 latent vectors which is input to StyleGAN2 generator for generating images and then optimize this MLP mapper model using CLIP loss. However, I'm quickly getting blown out images for entire batches. I'm suspecting perhaps this is due to the output of the MLP not conditioned to output something that looks like it can from a standard normal distribution. I wonder if you could perhaps point me in the right direction how to do this.

mehdidc commented 2 years ago

Hi, would indeed be really cool to apply the same idea to StyleGAN2 or StyleGAN3 ! You can have a look at Wasserstein auto-encoders paper (https://arxiv.org/pdf/1711.01558.pdf). While this setup is not related to auto-encoders per se, the regularization part where they match the feature space to the normal prior is exactly what you are looking for. They propose two ways of doing it, one using a GAN on the feature space (algorithm 1 in the paper), and another using MMD (Maximum Mean Discrepancy) (algorithm 2 in the paper).

xiankgx commented 2 years ago

Nice, let me have a look. Thanks for your suggestion!