mehdidc / feed_forward_vqgan_clip

Feed forward VQGAN-CLIP model, where the goal is to eliminate the need for optimizing the latent space of VQGAN for each input prompt
MIT License
136 stars 18 forks source link

Observations training with different modifying words/phrases #7

Closed afiaka87 closed 3 years ago

afiaka87 commented 3 years ago

Searching for a more photo-realistic output - I've found that training on certain words is likely to bias the output heavily.

"illustration"/"cartoon" biases heavily towards a complete lack of photorealism in favor of very abstract interpretations that are often too simple in fact.

Here - an example from training on the blog post captions with the word "minimalist" prepended to each caption (and a removal of all mannequin captions which are about a 1/16 of all the captions)

progress_0000019700

In the Eleuther AI discord; a user @kingdomakrillic posted a very useful link https://imgur.com/a/SnSIQRu showing the effect a starting caption/modifier caption has on various other words when generating an image using the VQGAN + CLIP method.

With those captions; I decided to randomly prepend all the modifying words/phrases which produced a (subjectively) photo-realistic output to the blog post captions.

        "8k resolution",
        "Flickr",
        "Ambient occlusion",
        "filmic",
        "global illumination",
        "Photo taken with Nikon D750",
        "DSLR",
        "20 megapixels",
        "photo taken with Ektachrome",
        "photo taken with Fugifilm Superia",
        "photo taken with Provia",
        "criterion collection",
        "National Geographic photo ",
        "Associated Press photo",
        "detailed",
        "shot on 70mm",
        "3840x2160",
        "ISO 200",
        "Tri-X 400 TX",
        "Ilford HPS",
        "matte photo",
        "Kodak Gold 200",
        "Kodak Ektar",
        "Kodak Portra",
        "geometric",

With this in place - outputs tend to much more photorealistic (similar caption to above, less than 1 epoch trained): <|startoftext|>2 0 megapixels photo of richmond district , san francisco , from a tall vantage point in the morning <|endoftext|>!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! progress_0000005100

None of this is very principled however and my next attempts were indeed going to be either "add noise to the captions" or "train on image-text pairs as well" - both of which seem to be in the codebase already! So I'm going to have a try with that.

In the meantime - here is a checkpoint from the first round of captions (prepend "minimalist" to every blog caption, removing all captions containing "mannequin"). I trained it using the vitgan for 8 epochs, 128 dim, 8 depth, ViT-B16, 32 cutn. The loss was perhaps still going down at this point; but with very diminished returns.

model.th.zip

afiaka87 commented 3 years ago

@mehdidc I recommend having a look at this link in particular: https://imgur.com/a/SnSIQRu

It's quite revealing as to the zero-shot style transfer capabilities of CLIP in my opinion; which are definitely there but may require some "prompt engineering" to achieve.

mehdidc commented 3 years ago

Wow very cool ! thanks for sharing the details and the model we can already put it in the readme. By the way, did you get truncate=True in clip.tokenize working in your experiments ? it seems like the argument does not exist, not sure if I was using the wrong version.

"None of this is very principled however and my next attempts were indeed going to be either "add noise to the captions" or "train on image-text pairs as well" - both of which seem to be in the codebase already! So I'm going to have a try with that."

Could you please elaborate on the noise + image-text pairs ? I haven't thought about this possibility.

"@mehdidc I recommend having a look at this link in particular: https://imgur.com/a/SnSIQRu"

Yes I have seen these! really nice, would be cool to reproduce them

mehdidc commented 3 years ago

I think what you started to do now could lead to a new kind of prompting which have effect on the model level rather than a single generated instance, that is, how to construct a dataset of prompts (rather than a single one) to have a model X which after training have some wanted/desired properties

afiaka87 commented 3 years ago

Wow very cool ! thanks for sharing the details and the model we can already put it in the readme. By the way, did you get truncate=True in clip.tokenize working in your experiments ? it seems like the argument does not exist, not sure if I was using the wrong version.

"None of this is very principled however and my next attempts were indeed going to be either "add noise to the captions" or "train on image-text pairs as well" - both of which seem to be in the codebase already! So I'm going to have a try with that."

Could you please elaborate on the noise + image-text pairs ? I haven't thought about this possibility.

"@mehdidc I recommend having a look at this link in particular: https://imgur.com/a/SnSIQRu"

Yes I have seen these! really nice, would be cool to reproduce them

It wasn't so much "noise + image-text pairs" as it was "add noise to the text" or train on image-text-pairs.

Edit: @mehdidc CLIP allows you to embed both image and text; and compare the two with a cosine similarity. You can also compare image-to-image with the cosine similarity. As such you could actually train these on image-text pairs to try to improve the results. To do so; you would compare an image embed of a dataset image with an image embed from the cutouts - in more or less the same way you do currently with the caption embeds

The noise suggestion was in reference to what (i believe) is currently in the codebase as the "noise vector" or something similar.

afiaka87 commented 3 years ago

Wow very cool ! thanks for sharing the details and the model we can already put it in the readme. By the way, did you get truncate=True in clip.tokenize working in your experiments ? it seems like the argument does not exist, not sure if I was using the wrong version.

For the truncate bit - it's perhaps a good idea to uninstall the clip package at the global, system, and local level - just to make sure you're not using an old one. Aside from that - I think I may have had to use the sys.path.append trick. @rom1504's branch of CLIP is pip installable (pip install clip-anytorch) however and they recently merged the upstream additions from OpenAI. I'll try to work on a revert PR for that today.

mehdidc commented 3 years ago

Great, thanks for checking this out