lucidrains / deep-daze

Simple command line tool for text to image generation using OpenAI's CLIP and Siren (Implicit neural representation network). Technique was originally created by https://twitter.com/advadnoun
MIT License
4.37k stars 327 forks source link

More layers=Worse Result? #96

Open mallorbc opened 3 years ago

mallorbc commented 3 years ago

I have been blessed to have been able to get an RTX 3090, and thus I can run this model with many layers and large batch size.

I have tried 64 layers, 44 layers, 32 layers, 16 layers etc. In the runs that I have done, it seems to me that at least for the 64 and 44 layers, that the produced results are actually worse than the lower number of layers. By worse, I mean less colorful and more blurry.

Is there a reason for this? Maybe it's due to the batch size? Any insight would be great.

afiaka87 commented 3 years ago

@mallorbc it would be helpful you could post some examples. Short of that, all I can say is that you'll need to decrease your learning rate by very small amounts until the image stabilizes. Having said that, a learning rate of 1e-5 (the default i believe) has worked just fine for me in 44 layers, so that's a bit strange.

I'm not sure the model really is capable of converging on 64 layers either? I don't know, I've tried to find a stable learning rate for that many layers and failed.

If you could post an example of the exact same prompt at 16, 24, 32 and finally, 44 layers, it would be very helpful. I personally think 64 layers is past the point of diminishing returns, for whatever reason that is.

mroosen commented 3 years ago

Check out this table I've made where you can see a quick example of layers/learning rate combination;

If you mouse over the final image, the video of the training is played.

Generated images are 416x416, which leaves some room for 24GB 3090 (lucky to have one too).

https://mroosen.github.io/deep-daze-dreams/ ss

NotNANtoN commented 3 years ago

That's a very nice overview! Regarding the question on how to optimally use large RAM sizes it at some point gets better to go into width and not depth. In #103 I just added the option to determine the hidden_sizein the CLI. It was previously fixed at 256. The results with higher numbers are much more colorful and converge quicker but also diverge quickly - I think when increasing the hidden_size from 256 to 512 the learning rate should probably be halved but I have not experimented extensively with it.

Here you can see a shift in hidden size from 64 to 512, doubling each time per row. I trained for 3 epochs, with 44 layers and a batch size of 16. For a hidden size of 512 the pictures shown are from only 1/6 of the training duration - notice how they already look quite converged. If I would show the final images they would have diverged, a bit like in the lower right of the matrix above by @mroosen.

deepdaze_hidden_size_64_to_512