lucidrains / deep-daze

Simple command line tool for text to image generation using OpenAI's CLIP and Siren (Implicit neural representation network). Technique was originally created by https://twitter.com/advadnoun
MIT License
4.37k stars 326 forks source link

Scheduled size sampling #3

Closed dginev closed 3 years ago

dginev commented 3 years ago

Thanks for this great command-line setup for CLIP+SIREN, very cool repository!

I've been playing with the original notebook and though I would bring my somewhat experimental code over in a PR. I can't prove it's an actual general improvement over the uniformly random sampling by Ryan/advadnoun, but seems to be helpful for my runs.

So, in my own experiments, especially ones going upto 10,000 steps, the random sampling performs quite poorly in the later iterations, and seems a bit wasteful. Three small patches that seemed to increase quality (only anecdotally for me, using 32 layers and >4000 steps):

Minor notes in the PR:


I'm also a bit baffled by the fixed initial seed, though it made it easy to cross-compare from the same initial point. Here is a "red apples in a bowl" 4000 steps, 32 layers, from main:

red apples in a bowl

And here is the same "red apples in a bowl" 4000 steps, 32 layers, from this PR:

red apples in a bowl, PR

Anecdotal, but the PR-generated image is 8% larger, and the size tends to increase along with saturating the model. Even if I'm completely off base, which is possible - maybe someone else has related observations in such longer+deeper training runs, or could find this useful to experiment a little and suggest something better. I was just trying to squeeze a little more out of the compute. Thanks again for the convenient CLI tool!

dginev commented 3 years ago

Since the main branch moved from making the PR, here's another generation of the input above (again 32 layers, 4000 steps) with the latest main (at 94a60d4dcc78a8bbcdb35b815cd0ec2b63d5adf0 ). The image size halved, and training seemed to hit a roadblock a bit earlier than usual, but this is still anecdotal land - hard to make conclusions from one example run.

lucidrains commented 3 years ago

@dginev Amazing Deyan! Let me try out your regularizations later tonight and see if there is a difference

For now, I've increased the depth, because I notice a indisputable improvement just going from 16 to 32, as you are doing in your PR!

lucidrains commented 3 years ago

@dginev What are your thoughts on perhaps bringing in a discriminator from a well-trained GAN (ex. biggan) and bias the learning for realism as well? (even though the surreal outputs at the moment is already quite enjoyable)

dginev commented 3 years ago

You mean similarly to adverb's recent attempts ? I'm a fan of the goal and hope it works, but am truly just learning the ropes in the visualization domain, coming from the NLP side of things.

lucidrains commented 3 years ago

@dginev I guess my idea is different in that I'll be using both CLIP and a Discriminator on the output of the SIREN. The discriminator will be an extra critic on realism

dginev commented 3 years ago

Oh I see, interesting - that I haven't seen or thought about. You would need the latent spaces to be "compatible" in some loose sense, so that the realism doesn't appear as OOD junk to CLIP. If the prompt is "a photo of a newspaper cartoon about Calvin and Hobbes", one has to wonder how the two models will align - since that is genuinely intersecting the realms.

The big issue currently bothering me is the released CLIP 224x224 aspect ratio. It's hard for me to estimate how much of the troubles in generation fidelity are actually related to scaling into/out of that form factor, both in the generation we're doing here, as well as in the training data given to CLIP in the first place. I'm also not sure if this will add extra artifacts if next to a BigGAN co-guide...

If everything was trained over the same data and resolution it would be a little saner. The comment I love giving is that "the NLP prompts are really addictive to play with", so one can definitely sink a lot of time experimenting...