mehdidc / feed_forward_vqgan_clip

Feed forward VQGAN-CLIP model, where the goal is to eliminate the need for optimizing the latent space of VQGAN for each input prompt
MIT License
136 stars 18 forks source link

How to improve so we could get results closer to the "regular" VQGAN+CLIP? #14

Open apolinario opened 3 years ago

apolinario commented 3 years ago

Hi! I really love this idea and think that this concept solves the main bottleneck of current VQGAN+CLIP approach which is the optimisation for each prompt. I love how instantaneous this approach is to generating new images. However results with the different CC12M or blog captions model fall short in comparison to the most recent VQGAN+CLIP optimisation approaches

I am wondering where it could potentially be improved. I guess one thing could be trying to embed the MSE regularised and z+quantize most recent VQGAN+CLIP approaches. The other is that I am wondering whether a bigger training dataset would improve the quality. Would it make sense to train it on ImageNet captions or maybe even a bigger 100M+ caption dataset? (maybe C@H?)

As you can see, I can't actually contribute much (but I could help with a bigger dataset training effort) but I'm cheering for this project to not die!

mehdidc commented 2 years ago

Hi @apolinario, thanks for your interest ! indeed the quality does not match the optimization approaches yet, the problem could come from the model architecture that is used and/or the loss function. There is an issue by @afiaka87 #8 "Positional Stickiness" which mentions one of the problems that seem to be persistent (it seems to happen regardless of model size, data size), and we are still not certain about the reason it happens.

"I guess one thing could be trying to embed the MSE regularised and z+quantize most recent VQGAN+CLIP approaches."

Could you please give more details about this approach or a reference ? I could try it out.

"ImageNet captions" I wasn't aware there are captions for ImageNet, do you have a link or repo?

Thanks

apolinario commented 2 years ago

Hi @mehdidc, thanks for getting back at this. So this is a "MSE regularised and Z+quantize VQGAN-CLIP" notebook, there's a debate of whether or not this actually improves quality but it seems to be preferred and widely adopted by some of the digital artists and the EAI community

And yeah, actually "ImageNet captions" don't indeed exist, I just had the naive thought of trying to train it in similar captions of the dataset VQGAN itself was trained without putting more thought into. However, with the release of the first big dataset output from the crawl@home project, I think the LAION-400M or a subset of it could suit very well for training

And thanks for letting me know about the persisten #8 Positional Stickness issue. I noticed a similar behavior while using it. Will try to look into it and also bring some attention to it as well