mehdidc / feed_forward_vqgan_clip

Feed forward VQGAN-CLIP model, where the goal is to eliminate the need for optimizing the latent space of VQGAN for each input prompt
MIT License
136 stars 18 forks source link

Not an issue - richer datasets #6

Open johndpope opened 3 years ago

johndpope commented 3 years ago

are you familiar with this https://twitter.com/e08477/status/1418440857578098691?s=21 ?

I want to do cityscape shots. Are you familiar with any relevant datasets? Can this repo help output higher quality images? Or does it help with the prompting?

mehdidc commented 3 years ago

Hi, I was not aware of these, these are very beautiful! the repo is not meant to output higher quality images (quality should be the same as VQGAN-CLIP examples) or help with prompting, it is meant to do the same thing without needing an optimization loop for each prompt, and can also generalize to new unseen prompts in the training set. All you need is to collect/build a dataset of prompts and train the model with it, once it is done you can generate images with new prompts in a single step (so no optimization loop). I will shortly also upload pre-trained model(s) based on conceptual captions 12m prompts (https://github.com/google-research-datasets/conceptual-12m), if you would like to give it a try without re-training from scratch. Also, since you obtain a model at the end, additionally you can also interpolate between the generated images of different prompts. I hope the goal of the repo is clearer.

johndpope commented 3 years ago

"so no optimization loop" - does that mean there's no 500x iterations to get a good looking image?

fyi - @nerdyRodent

mehdidc commented 3 years ago

" does that mean there's no 500x iterations to get a good looking image?" Yes

mehdidc commented 3 years ago

Following the tweet you mentioned above, here is an example with "deviantart, volcano": https://imgur.com/a/cYMsNo5 with a model currently being trained on conceptual captions 12m.

mehdidc commented 3 years ago

@johndpope I added a bunch of pre-trained models if you want to give it a try

johndpope commented 3 years ago

I had a play with the 1.7gb cc12m_32x1024 - I couldn't get my high quality that I was getting on VQGAN-CLIP - will keep trying - bumping the dimensions. Maybe docs could use some pointers - 256 x256 / 512x512 etc One thing is clear - this can perform very quickly - perhaps efforts to have this provide a hot serving whereby you could give it a new prompt / running a service / almost in realtime without turning off the engine so to speak. We talk about FPS - frames per second - could we see a VQPS ???

Here's some images I turned out over the weekend - https://github.com/nerdyrodent/VQGAN-CLIP/issues/13

Observerations When I threw in a parameter - it was clearly identifable. Los Angeles | 35mm Eg. https://twitter.com/johndpope/status/1419352229031518209/photo/1

Los Angeles Album Cover https://twitter.com/johndpope/status/1419354082192412679/photo/1

This didn't quite cut it. python -u main.py test pretrained_models/cc12m_32x1024/model.th "los angeles album cover"

Other improvements for newbies - you could consider integrating these downloads into readme https://github.com/nerdyrodent/VQGAN-CLIP/blob/5edb6a133944ee735025b8a92f6432d6c5fbf5eb/download_models.sh

afiaka87 commented 3 years ago

@johndpope have you considered re-embedding the outputs from the trained vitgan as clip image-embeds; and then using those as prompts to a "normal" VQGAN-CLIP optimization with a much higher learning rate than usual and fewer steps? That will allow you to use non-square dimensions.

Also - one of the other primary benefits of this approach is that if you'd like to finetune from one of the checkpoints or even train your own from scratch - this can be relatively simple as all you need are some captions which can be generated/typed out. You'll want to cover a large-ish corpus but using something like the provided MIT states captions as a base should be a good start.

Thanks for the extra info. I'm a little busy today but I think the README might need one or two more things and possibly a colab notebook specific to training (if we don't have that already) that would make it easy to customize MIT states.

edit: realtime updates to your captions/display of rate of generations etc. may be outside of the scope of the project.