lucidrains / deep-daze

Simple command line tool for text to image generation using OpenAI's CLIP and Siren (Implicit neural representation network). Technique was originally created by https://twitter.com/advadnoun
MIT License
4.37k stars 326 forks source link

Are there any inference possibilities? #112

Open miladavaz opened 3 years ago

miladavaz commented 3 years ago

It's currently taking 3-4 hours running an instance on a TESLA v100 card. Is there any way of applying inference?

NotNANtoN commented 3 years ago

It should not take that long. You can reduce the number of epochs. But it will still not be fast because a whole network is trained from scratch.

It's possible to pretrain a SIREN network using meta-learning on a text corpus. Then the whole training process would be much quicker. I experimented with the simple REPTILE approach in one of my projects AuViMi to make deepdaze quickly adapt to new input from a webcam - but that's still experimental.

miladavaz commented 3 years ago

makes sense. Im discovering that it has a very similar generating pattern. I was wondering if it is possible to experiment with training it in producing certain styles. So that I could for example load in a folder of all picassos, and then give it a string and see if it could render it in a "picasso" style. I know a single image blend is possible, but going deeper than that.

NotNANtoN commented 3 years ago

Well for one you could look up StyleCLIP. There you can basically train a StyleGAN2 on Picasso images and then steer the generation using CLIP.

If you really wanna make it work with Picasso images and deepdaze you have many options. For one, CLIP knows Picasso, so you can add that to your text prompt "... in the style of Picasso".

If you want to generalize this to unknown concepts, it might be possible to extract the image CLIP features of all images in a folder and then average them into one feature vector. This averaged image feature vector can then be merged (averaged) with the feature vector of your text prompt and fed into the Imagine class as the clip_encoding. This will probably work, but will not have a quick inference.

Lastly, you could use a simple meta learning approach such as REPTILE by OpenAI to pretrain a deepdaze network on images in a folder. The weights of that network should then be optimized to converge to Picasso images in N steps. Then you can store the weights of that network and reload it for your text prompt.