lucidrains / big-sleep

A simple command line tool for text to image generation, using OpenAI's CLIP and a BigGAN. Technique was originally created by https://twitter.com/advadnoun
MIT License
2.57k stars 305 forks source link

A few observations #15

Open Mut1nyJD opened 3 years ago

Mut1nyJD commented 3 years ago

Not so much an issue maybe but a few observations I made while playing a bit with this repo (pretty awesome stuff btw)

1.) Default iterations and number of epochs are way too high. Usually I noticed that after 500 iterations at least with that default learning rate is tends to collapse or suddenly runs of into some weird state. That's in the first epoch, so 1 instead of 20 is probably enough. 2.) Default learning rate is probably too high too saw more stable conversions with 0.03 3.) Still not quiet clear what the best prompting strategy seems to be I tried a photo, photo of, picture of or just the object description and it tends to produce complete different results. So maybe it is the number of tokens that has a bigger influence than the token itself.

And now a few questions if you don't mind

What's the benefit of having so many many region crops to feed into clip? Seems a bit excessive and I wonder if that's the reason why a lot of results tend to look like collages after a while.

Is the gradient accumulation really necessary or does it make a difference?

enricoros commented 3 years ago

@Mut1nyJD great questions. I've experimented a bit with this project and found that for good artistic control, a human-in-the-loop approach is the best*. To answer some parts of your questions:

1) the steps performed are = iterationsepochs. The epoch variable seems not to be used; even trying 10 epochs 10 iterations, or 1 epoch 100 iterations, I get the same output. For greater quality (assuming convergence), you want to run the algo for long; however, I've seen great results even in just 500 steps. It's a parameter anyway, so very configurable, or you can stop the command-line executable when you want

2) does it depend on the input text? Sometimes you can get good results with a low rate, sometimes with a high rate. I wonder how would an outsider frame the "learning rate", and how would we pick a default value for the community?

3) I think it defaults to a photo (real world) render, so I would omit "a photo of" (maybe the picture itself will contain "a photo of..."). I love the output of "An illustration of..", which seems to be good. Also, try "x made of y". Sometimes I even use the DALL-E strategy of repeating the text multiple times (see the DALL-E web page to understand what I mean). If there are prompts that work well for you, please share.

Mut1nyJD commented 3 years ago

@enricoros

On 1 yes having Epochs really makes no sense. I noticed high number of iterations usually do nothing for me. If you use progression output I see stable results between 50-400 iterations and then it tends to become weird for me most of the time. It is like it suddenly flips and then it goes down a new alley. Maybe have to monitor the loss and see if there is a correlation.

  1. Could be haven't really investigated that deeply what I do notice though it is seems to be heavily biased towards dogs. It nearly always starts off with a dog in the image. I wonder if that's because it is one of the biggest classes in ImageNet and the highly unbalanced distribution of classes in ImageNet is too blame here.

  2. Hmm I was assuming that CLIP always would basically classify something with a photo of but yes maybe you are right and having photo in might cause confusion. I do notice it is very very responsive to colors though once you have a color token in your input text it really really focuses on that. It starts to become the most dominant feature.

Mut1nyJD commented 3 years ago

So I made a little video to show some of the experimentation of different learning rates and cutouts. And I am more and more convinced that the default for both is too high.

https://www.youtube.com/watch?v=0apLPHoUy3c

All use the same seed and the same phrase "a sailing boat in the sea" running for 2000 iterations, each frame represents results after 5 iterations. As you can see most of the time halfway it sort of stabilizes and the changes become minimal at nearly all learning rates except for higher ones. This tells me that more than 1000 iterations seems to be a total waste.

Also by far the worst results are with 256 cutouts. I think the sweat spot seems somewhere between 32 and 64.

walmsley commented 3 years ago

@lucidrains Seems like people agree with @Mut1nyJD ... can the num_cutouts default be permanently set to to 32? Or do we not want to change too many hyperparameters lest we break someone's existing workflow? I can make a PR if needed, if this is deemed an appropriate thing to do.

htoyryla commented 3 years ago
  1. Could be haven't really investigated that deeply what I do notice though it is seems to be heavily biased towards dogs. It nearly always starts off with a dog in the image. I wonder if that's because it is one of the biggest classes in ImageNet and the highly unbalanced distribution of classes in ImageNet is too blame here.

I think BigGAN was meant to be used to produce images mainly in one class at a time. Perhaps even mixing a few classes. Anyway, all training examples belong to a single class, if I am not mistaken. In other words, the training samples occupy rather small but isolated areas in the 1000-dimensional class space, along a single axis away from the origin.

What happens then if we activate all classes randomly and normalize the class vector, our class vectors will all be clustered very close to the origin. I guess that, given the absence of training examples in that area, it is simply a byproduct of training that there are dogs there.

This changes by the way when we limit the number of active classes, which we can do with the option max_classes. Then we will get class vectors further away from the origin, and consequently more variation as to the initial images we get. Just try it.

PS. Ryan Murdock did mention in Twitter the idea of dropping the use of the one-hot class vector altogether and using the 128 element embedding instead. I am talking about skipping the first of these two lines here altogether https://github.com/lucidrains/big-sleep/blob/6afb308eed92ccf748df5c2c608308bf72f7128d/big_sleep/biggan.py#L575-L576 Instead of the 1000 element one-hot vector for the class, one would use a 128 element embedding. I did try it quickly, but using it well would require some work as to how to initialise and bound it properly. So far I think using max_classes already does a good job.

LtqxWYEG commented 3 years ago

I'm having a bit of a conundrum here. It seems class BigSleep of /big_sleep/big_sleep.py is not executed, making changing the value of num_cutouts irrelevant. ... neither does deleting the FILE!? What am I doing wrong here? I'm using the notebook. Intentionally produced error messages point to the file /usr/local/lib/python3.7/dist-packages/big_sleep/big_sleep.py, which I deleted, and it still works ... what?!

Edit: I needed to restart the runtime... Oops

gregloryus commented 3 years ago

So I made a little video to show some of the experimentation of different learning rates and cutouts. And I am more and more convinced that the default for both is too high.

https://www.youtube.com/watch?v=0apLPHoUy3c

All use the same seed and the same phrase "a sailing boat in the sea" running for 2000 iterations, each frame represents results after 5 iterations. As you can see most of the time halfway it sort of stabilizes and the changes become minimal at nearly all learning rates except for higher ones. This tells me that more than 1000 iterations seems to be a total waste.

Also by far the worst results are with 256 cutouts. I think the sweat spot seems somewhere between 32 and 64.

These tips and takeaways are super useful! Have you played around with the number of classes? I know more classes = more "creativity" but I'm still kinda unclear what's happening. With max_classes = 1000, it seems to keep evolving and rarely converges to a stable result... I've seen 15 recommended for accuracy, but would love to hear other people's thoughts.