Closed dginev closed 3 years ago
Since the main branch moved from making the PR, here's another generation of the input above (again 32 layers, 4000 steps) with the latest main (at 94a60d4dcc78a8bbcdb35b815cd0ec2b63d5adf0 ). The image size halved, and training seemed to hit a roadblock a bit earlier than usual, but this is still anecdotal land - hard to make conclusions from one example run.
@dginev Amazing Deyan! Let me try out your regularizations later tonight and see if there is a difference
For now, I've increased the depth, because I notice a indisputable improvement just going from 16 to 32, as you are doing in your PR!
@dginev What are your thoughts on perhaps bringing in a discriminator from a well-trained GAN (ex. biggan) and bias the learning for realism as well? (even though the surreal outputs at the moment is already quite enjoyable)
You mean similarly to adverb's recent attempts ? I'm a fan of the goal and hope it works, but am truly just learning the ropes in the visualization domain, coming from the NLP side of things.
@dginev I guess my idea is different in that I'll be using both CLIP and a Discriminator on the output of the SIREN. The discriminator will be an extra critic on realism
Oh I see, interesting - that I haven't seen or thought about. You would need the latent spaces to be "compatible" in some loose sense, so that the realism doesn't appear as OOD junk to CLIP. If the prompt is "a photo of a newspaper cartoon about Calvin and Hobbes", one has to wonder how the two models will align - since that is genuinely intersecting the realms.
The big issue currently bothering me is the released CLIP 224x224 aspect ratio. It's hard for me to estimate how much of the troubles in generation fidelity are actually related to scaling into/out of that form factor, both in the generation we're doing here, as well as in the training data given to CLIP in the first place. I'm also not sure if this will add extra artifacts if next to a BigGAN co-guide...
If everything was trained over the same data and resolution it would be a little saner. The comment I love giving is that "the NLP prompts are really addictive to play with", so one can definitely sink a lot of time experimenting...
Thanks for this great command-line setup for CLIP+SIREN, very cool repository!
I've been playing with the original notebook and though I would bring my somewhat experimental code over in a PR. I can't prove it's an actual general improvement over the uniformly random sampling by Ryan/advadnoun, but seems to be helpful for my runs.
So, in my own experiments, especially ones going upto 10,000 steps, the random sampling performs quite poorly in the later iterations, and seems a bit wasteful. Three small patches that seemed to increase quality (only anecdotally for me, using 32 layers and >4000 steps):
sizes
for each step.Minor notes in the PR:
I'm also a bit baffled by the fixed initial seed, though it made it easy to cross-compare from the same initial point. Here is a "red apples in a bowl" 4000 steps, 32 layers, from main:
And here is the same "red apples in a bowl" 4000 steps, 32 layers, from this PR:
Anecdotal, but the PR-generated image is 8% larger, and the size tends to increase along with saturating the model. Even if I'm completely off base, which is possible - maybe someone else has related observations in such longer+deeper training runs, or could find this useful to experiment a little and suggest something better. I was just trying to squeeze a little more out of the compute. Thanks again for the convenient CLI tool!