Closed afiaka87 closed 3 years ago
@mehdidc I recommend having a look at this link in particular: https://imgur.com/a/SnSIQRu
It's quite revealing as to the zero-shot style transfer capabilities of CLIP in my opinion; which are definitely there but may require some "prompt engineering" to achieve.
Wow very cool ! thanks for sharing the details and the model we can already put it in the readme.
By the way, did you get truncate=True
in clip.tokenize
working in your experiments ? it seems like the argument does not exist, not sure
if I was using the wrong version.
"None of this is very principled however and my next attempts were indeed going to be either "add noise to the captions" or "train on image-text pairs as well" - both of which seem to be in the codebase already! So I'm going to have a try with that."
Could you please elaborate on the noise + image-text pairs ? I haven't thought about this possibility.
"@mehdidc I recommend having a look at this link in particular: https://imgur.com/a/SnSIQRu"
Yes I have seen these! really nice, would be cool to reproduce them
I think what you started to do now could lead to a new kind of prompting which have effect on the model level rather than a single generated instance, that is, how to construct a dataset of prompts (rather than a single one) to have a model X which after training have some wanted/desired properties
Wow very cool ! thanks for sharing the details and the model we can already put it in the readme. By the way, did you get
truncate=True
inclip.tokenize
working in your experiments ? it seems like the argument does not exist, not sure if I was using the wrong version."None of this is very principled however and my next attempts were indeed going to be either "add noise to the captions" or "train on image-text pairs as well" - both of which seem to be in the codebase already! So I'm going to have a try with that."
Could you please elaborate on the noise + image-text pairs ? I haven't thought about this possibility.
"@mehdidc I recommend having a look at this link in particular: https://imgur.com/a/SnSIQRu"
Yes I have seen these! really nice, would be cool to reproduce them
It wasn't so much "noise + image-text pairs" as it was "add noise to the text" or train on image-text-pairs.
Edit: @mehdidc CLIP allows you to embed both image and text; and compare the two with a cosine similarity. You can also compare image-to-image with the cosine similarity. As such you could actually train these on image-text pairs to try to improve the results. To do so; you would compare an image embed of a dataset image with an image embed from the cutouts - in more or less the same way you do currently with the caption embeds
The noise suggestion was in reference to what (i believe) is currently in the codebase as the "noise vector" or something similar.
Wow very cool ! thanks for sharing the details and the model we can already put it in the readme. By the way, did you get
truncate=True
inclip.tokenize
working in your experiments ? it seems like the argument does not exist, not sure if I was using the wrong version.
For the truncate
bit - it's perhaps a good idea to uninstall the clip
package at the global, system, and local level - just to make sure you're not using an old one. Aside from that - I think I may have had to use the sys.path.append
trick. @rom1504's branch of CLIP is pip installable (pip install clip-anytorch
) however and they recently merged the upstream additions from OpenAI. I'll try to work on a revert PR for that today.
Great, thanks for checking this out
Searching for a more photo-realistic output - I've found that training on certain words is likely to bias the output heavily.
"illustration"/"cartoon" biases heavily towards a complete lack of photorealism in favor of very abstract interpretations that are often too simple in fact.
Here - an example from training on the blog post captions with the word "minimalist" prepended to each caption (and a removal of all mannequin captions which are about a 1/16 of all the captions)
In the Eleuther AI discord; a user @kingdomakrillic posted a very useful link https://imgur.com/a/SnSIQRu showing the effect a starting caption/modifier caption has on various other words when generating an image using the VQGAN + CLIP method.
With those captions; I decided to randomly prepend all the modifying words/phrases which produced a (subjectively) photo-realistic output to the blog post captions.
With this in place - outputs tend to much more photorealistic (similar caption to above, less than 1 epoch trained):
<|startoftext|>2 0 megapixels photo of richmond district , san francisco , from a tall vantage point in the morning <|endoftext|>!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
None of this is very principled however and my next attempts were indeed going to be either "add noise to the captions" or "train on image-text pairs as well" - both of which seem to be in the codebase already! So I'm going to have a try with that.
In the meantime - here is a checkpoint from the first round of captions (prepend "minimalist" to every blog caption, removing all captions containing "mannequin"). I trained it using the
vitgan
for 8 epochs, 128 dim, 8 depth, ViT-B16, 32 cutn. The loss was perhaps still going down at this point; but with very diminished returns.model.th.zip