Much better quality and easier size schedule

lucidrains / deep-daze

Simple command line tool for text to image generation using OpenAI's CLIP and Siren (Implicit neural representation network). Technique was originally created by https://twitter.com/advadnoun

MIT License

4.37k stars 327 forks source link

Much better quality and easier size schedule #61

Closed NotNANtoN closed 3 years ago

NotNANtoN commented 3 years ago

Main changes

I cleaned up the cryptic size scheduling that was used for the sampling of random cut-out sizes. Before there was a weird scheme that adapted neither to the batch size nor the total number of episodes. I inspected it in detail and found that the sampling scheme was sampling in ranges of 0.1 for intervals starting at 0.49 to 1.09 (depending on the schedule). A comment in the code says that the context should increase as the model saturates - which means the sampling should be closer to 1.

The new approach is simple: the random sizes are uniformly samples between a lower bound (default=0.1) and an upper bound (default=1.0). Both are customizable by the user in the Imagine class. I emulated some scheduling by adding the sature_bound parameter. If set to True, it linearly increases the lower bound from the starting value to a limit during the training. I set the limit to 0.8 because from 0.8 and above the generations become was

hed out and unstable. I also noticed that this scheduling does not really bring about any benefits, but I have not experimented extensively with it.

Minor changes

I cleaned up some transformations/normalizations here and there to make it more uniform.
I added some more explanations in the README for the parameters, but have NOT yet included it in the CLI code.
I also changed that the images are saved in high-quality JPGs instead of PNGs to make the saving faster and the image file sizes smaller - quality differences can not be noticed with the (my) naked eye

Results

Examples from old README

Anyways, the performance is (from my visual inspection) MUCH better now. I recreated the examples that are currently in the README (num_layers=44, batch_size=32, gradient_accumulate_every=1, 5 epochs - needs less than 8GB of RAM, and about 20 mins):

https://user-images.githubusercontent.com/19983153/109133200-0bdeac00-7755-11eb-8c87-bd18ab38bad6.mp4

https://user-images.githubusercontent.com/19983153/109133230-1305ba00-7755-11eb-840e-d424bb6cbd75.mp4

https://user-images.githubusercontent.com/19983153/109133247-17ca6e00-7755-11eb-944d-775b46da1d61.mp4

https://user-images.githubusercontent.com/19983153/109133352-3892c380-7755-11eb-9ac3-4ce9cf031c27.mp4

Life_during_the_plague

Meditative_peace_in_a_sunlit_forest

A very fancy one is "A psychedelic experience on LSD": A_psychedelic_experience_on_LSD 000050

@lucidrains Feel free to replace the images by the new ones. I can also do it, if you consent.

Generations from img and img+text

Some more hot-dog images to show that this still works: Generations using "A dog in a hotdog costume":

A_dog_in_a_hotdog_costume 000087(1)

A_dog_in_a_hotdog_costume 000087

Now given this starting image: hot-dog

We can generate:

hot-dog_imagined Add "A psychedelic experience" as text to img: A_psychedelic_experience 000064 Adding the text "A dog in a hotdog costume" to the image does not work too nicely: A_dog_in_a_hotdog_costume 000087(2)

Story creation

Lastly, I can show the story creation feature of the last PR (although with few generations per epoch, so the dream kind of happens too quickly):

"I dreamed that I was with my coworkers having a splendid party in someone's house. Even though I had many people surrounding me, I felt so lonely and I just wanted to cry. I went to the bathroom and something hit me, and I woke up."

https://user-images.githubusercontent.com/19983153/109135224-1f8b1200-7757-11eb-9ba7-ae7540cd0401.mp4

"I dreamt the house across the street from me was on fire. The people who live there were not there. It was a friend of my family and her daughter. I was looking out the window and saw all the smoke so I called 911 but it was busy."

https://user-images.githubusercontent.com/19983153/109135243-26b22000-7757-11eb-954d-6c0d54e8c34d.mp4

afiaka87 commented 3 years ago

This has been a longstanding issue! Good work.

afiaka87 commented 3 years ago

Hm, this is also failing to build for me on Python 3.6 RuntimeError: Python version >= 3.7 required. Did you add a new dependency?

Edit: Ah, nevermind. This is because of the einops>=0.3 requirement. @lucidrains You may want to bump the Python version number in setup.py to 3.7

dginev commented 3 years ago

Sorry for the very cryptic and ad-hoc size scheduling @NotNANtoN , I'm responsible for that and it was just an incremental patchwork over the original notebook from someone with no real experience. Very encouraging to see it switched to some solid footing, thanks!

dginev commented 3 years ago

Now that I have admitted that I'm a novice and am just tweaking knobs to get as good an outcome as possible, let me inquire about some details for usability here. Arguably usability is complex w.r.t training schedules, since based on your compute budget and saturation goals, you end up requiring rather different schedules.

Up till now, I am usually running with 32 layers, and am happy if 1 epoch (1050 steps, taking ~12min on a 3090) gets a reasonably recognizable image out, and then decide when to interrupt the training. Not to say this is a "Right Use" of such a library, it's just what I've done so far.

So a typical run would be:

$ imagine --num-layers 32 --save-progress "A llama wearing a scarf and glasses, reading a book in a cozy cafe."

On a freshly installed deep-daze-0.6.2 from pip one epoch gets me:

loss: ~ -65

meanwhile, installing from this branch and running the same exact invocation for the same step gets me:

loss: ~ -55

Your claim that after 5 epochs your refactor will produce a much better image is likely true - I had also noticed that the last/latest phase of my patchy schedule was rather poor in outcomes, so that needed to be improved for sure. But I wonder if the PR adds some penalty in minimally needed compute to saturate? Not that I have any great solutions besides adding yet-another-customization-flag for different schedules based on compute targets... But adding a flag just sounds like feature creep to me. shrug Whatever works for everyone I guess, but I quite enjoy getting a decent image in ~1000 steps.

General comment: I think the only sane way to compare such PRs to main is running both branches with the same parameters and juxtaposing images. Otherwise it's a bit hard to "see" what was contributed by the PR, and what by e.g. using 44 layers for 5 epochs.

NotNANtoN commented 3 years ago

Hi @dginev, thanks for your insightful comment. I agree, my comparison is far from ideal. I always played around with 44 layers and 5-10 epochs and could notice a large improvement for this setting. I agree that a more direct comparison would be better.

A major difference between the old scheduling and the new one is that now, for the default settings, the cutout size is always randomly sampled between 0.1 and 1.0. In the older scheduling, it only went as low as 0.49. So what you could try is to adjust the lower_bound_cutout to 0.5 and check if the results in that case are better for fewer epochs. Another option is my attempt at a simple scheduling system using saturate_bound. If you set it to True, the lower_bound is increased during training (dependent on the total number of batches) and should ideally make the system more robust to different training lengths. I have not tested it extensively yet!

We might need to adjust the scheduling invoked by saturate_bound. I see that in the old system there was some kind of cyclical change in cutout-bounds - that could also be implemented. I think the relatively good performance for the old scheduling might be due to use of a batch size of 4? The pieces_per_group were hardcoded to 4, so I assume it was tuned for that batch size. For larger batch sizes (I used 96 a lot - barely fits in a 8GB GPU) the old scheduling would then use many groups per partition, which, as far as I reasoned, would lead to some sort of averaging of the different cutout-scales used per partition.

I'm curious to hear what you say and would be glad to see any experiments/results.

NotNANtoN commented 3 years ago

Side note: I'm first of all surprised and a bit jealous that you have access to a 3090, but even more so about the fact that for me a 1 epoch of 1050 iterations takes 3.5 minutes (with a batch size of 32 and 44 layers). Ah, I just realized, you probably use an image_width of 512? Then it all makes sense.

dginev commented 3 years ago

Haha, I was wondering how you're so patient as to do 44 layers for 5 epochs with width 512 and not complain about it! Now things make sense. My honest reaction was "wait, deep-daze can reduce the image width now?" -- I never investigated if that's available. Btw, just double-checked and --num-layers 32 with the defaults also runs on a 1080TI, just under 10GB allocated.

So, if the image widths we are discussing are variable, and you're using a lower one than me, can we even theoretically share a non-trivial training schedule? We clearly can share a uniformly sampling one, but any dynamic trickery won't really make sense across image dimensions, if it is based on step counts. Maybe if it is based on loss...

My partition tricks are really fine-tuned to the saturation speeds one usually sees at width 512. If you halve the width they may produce something entirely different. But that's even true if you vary the number of layers... So yes, my hack has to be removed from the repo, it's only "good"-ish for 512 width 32 layers, and even then I don't like the last two stages of it.

I never went under 0.49 size because the original collab notebook author had some scary language about really bad training results if small sizes are included. He definitely understood SIREN better than me, but also wasn't proficient in it. In fact, he didn't have the .sort() trick in originally, and the training was rather bumpy - that may have been all he was seeing. I'd really have to learn more about SIREN to make any actually useful comments here, but I'm starting to wonder whether:

You are right and I should experiment with size ratios down to 0.1
There is still something to gain with a dynamic training schedule, but the partitions ought to be based on loss rather than steps.

Hm, hm...

dginev commented 3 years ago

Alright, here's a comparison more in the spirit of the PR. I did 44 layers with a reduced image size of 128. Indeed an epoch for that setup takes 3.5 mins. I should preface with a big thanks! for alerting me to that, it unlocked a whole new miniature art form for me to put llamas in.

I had to triple-check I was running the right executables, since the PR really reminded me of the notebook training runs before I made my patches with both the losses and the images produced. They should be as described here, same command:

$ imagine --save-progress --num-layers 44 --image-width 128 \
    "A llama wearing a scarf and glasses, reading a book in a cozy cafe."

epoch	main loss	PR loss
1	-57.4	-47.65
2	-68.8	-48.7
3	-64.3	-53.6
4	-72.1	-51.9
5	-66.8	-49.48

main branch, epochs 1 and 5: A_llama_wearing_a_scarf_and_glasses,_reading_a_book_in_a_cozy_cafe 000010

PR branch, epochs 1 and 5: A_llama_wearing_a_scarf_and_glasses,_reading_a_book_in_a_cozy_cafe 000010

A_llama_wearing_a_scarf_and_glasses,_reading_a_book_in_a_cozy_cafe 000050

I can confirm the images in the main branch are very close to what I am used to in the 512 width - this is a common prompt I experiment with, and I've gotten somewhat accustomed to what CLIP fishes out. Meanwhile, the PR branch results seem to have issues saturating, and also have multiple fragments (which is something you encounter if you sample too many small window sizes for too long).

It's just a single run each, so could also be a poisoned outlier, but it raises some questions...

NotNANtoN commented 3 years ago

You have a point, I can see some issues there. I just launched some experiments with your lama sentence, differing the lower_bound_cutout and saturate_bound between runs. I was hoping that if someone wants a less fractured image, then 0.5 might be a good value for the lower bound and the saturation issues might be fixable by using the saturation.

I do see your point about possibly just making this loss-dependent. I feel like it could make sense to have the lower bound be smaller as the loss gets better, as that makes the net focus on more details - at the same time it fractures the image more. I went with the approach of first focusing on details/textures by having a low lower_bound and then more on the larger picture, I'm curious how the experiments turn out. I kind of like the simplicity of my approach, but maybe it is too simple - I am basically just shifting the lower bound of a uniform distribution. We could easily extend it though: 1. Sample from a Gaussian around a mean that increases as the training goes on or as the loss decreases 2. Add some cyclical increase/decrease as there was in the old schedule.

To be honest, I feel like we are going in a bit blind without a proper plot of the loss and other diagnostics. Maybe we should add a debug mode that generates diagnostic plots?

dginev commented 3 years ago

Agree completely with your latest comments, especially since we're forced into a corner - if the library is to support all these different features, we need to establish and diagnose some reliable aspects of SIREN+CLIP training for the various use cases. And as you say, be able to cross-compare and diagnose failures.

I was really trying not to get sucked into this myself, since my computer vision experience is a very round zero, but if I find the time I'll read the SIREN paper and do some math over the weekend...

What I did try overnight was your new create_story feature on the main branch (i.e. with my patchy sizing) - fun, thanks for that as well! And there I can really painfully see how after the first couple of epochs the learning gets stuck into this wavy SIREN-y background subtstrate. That should be partially related to the sizing partitions we are discussing, and partially related to the learning method getting quite stubbornly stuck once it saturates. If one is to get high quality images on every story epoch, I guess we ought to add some noise between epochs, or start training the next epoch from the midway checkpoint of the previous one or...

NotNANtoN commented 3 years ago

It would be great if you can dig further into it. I wonder how much of this is SIREN related and how much is related to the gradient ascent training by using CLIP. As far as I have seen they do not use any random cutouts in the SIREN paper, but I just skimmed it.

Have you tried the create_story on the PR branch? It works really nicely. Your scheduling seems to work well to make the image convert and saturate - but it also hinders it from evolving into something new.

Experiments

I did some experimental runs with your query. Settings: batch size 32, image_width 256, num_layers 44, PR branch. The picture represent epoch 1, 5, and 8, respectively:

Saturate_bound=False

lower_bound_cutout=0.1: llama_0_1_epoch1 llama_0_1_epoch5 llama_0_1_epoch8 lower_bound_cutout=0.5: llama_0_5_epoch1 llama_0_5_epoch5 llama_0_5_epoch8 lower_bound_cutout=0.6: llama_0_6_epoch1 llama_0_6_epoch5 llama_0_6_epoch8

Saturate_bound=True

lower_bound_cutout=0.1: llama_0_1_epoch1_saturate llama_0_1_epoch5_saturate llama_0_1_epoch8_saturate lower_bound_cutout=0.5: llama_0_5_epoch1_saturate llama_0_5_epoch5_saturate llama_0_5_epoch8_saturate

Conclusion

To be honest, I struggle to really see which one is better and which one is worse. What I can clearly observe is that larger lower bounds lead to the appearance of more text (a bad sign IMO) and to a bit less "realistic" images. The saturate_bound condition approximates your schedule a bit, as far as I can tell from observing the images.

afiaka87 commented 3 years ago

@NotNANtoN What about the number of iterations it takes to get to a reasonable result? That's really my only concern.

@dginev I think it's safe to say the outputs are just gonna be different now that we're normalizing properly. Take the "shattered plates on the grass" example. The example on the README.md (currently) is just way too dark.

dginev commented 3 years ago

@afiaka87 Aren't the README samples rather outdated at this point? That's part of why we're doing all of this back-and-forth, it takes a lot of work just to figure out who's comparing what. Here's a version of that prompt executed today for 1 epoch:

$ imagine --num-layers=44 --image-width=200 "shattered plates on the grass" --save-progress

with the current v0.6.2 main, loss -46.6: shattered_plates_on_the_grass_main

with this PR, loss -38.9: shattered_plates_on_the_grass 000010

I like the PR's output more here as well, but the loss difference worries me. .... Another difference from the notebook where I had the scheduling and the way deep-daze is using that code is that it has a batch_size that only takes some of the sizes, rather than all together? My code was expecting 64 pieces passed into torch.cat, while deep daze is by default passing them 4 at a time, so it will take 16 steps to see the entire size batch? Unsure how that influences the training. Only way to find out, repeating the tests with --batch-size=64:

$ imagine --batch-size=64 --num-layers=44 --image-width=200 "shattered plates on the grass" --save-progress

with main v0.6.2, loss -59.5 shattered_plates_on_the_grass 000010

with this PR, loss -50.6 shattered_plates_on_the_grass 000010

An epoch takes 4 minutes with the default batch size and 8 minutes with it increased to 64, so there's a good amount of compute hidden in that.

Both runs look "good?", the images I've attached are hard for me to contrast with each other. I'm not sure I learned anything useful...

According to @NotNANtoN 's last comment, lower_bound_cutout=0.5 would get me the llama setup I was looking for with my original comment, already in the first epoch, as in main. If that's the case, feel free to merge the PR, and I'll have to remember to use that option going forward. Thanks!

afiaka87 commented 3 years ago

https://github.com/NotNANtoN/deep-daze/pull/1/files @NotNANtoN Made the CLI changes at least.

@dginev Hm, you're right the output has gotten a good deal better even in 0.6.2.

Both runs look "good?", the images I've attached are hard for me to contrast with each other. I'm not sure I learned anything useful...

Yeah I'm lost. We could feasibly come up with some sort of diverse set of phrases to evaluate against "visually" but that sounds like a pain.

NotNANtoN commented 3 years ago

@afiaka87 Thanks for the CLI additions! I'll merge them in a bit.

I ran the same experiments with a maximum of 1 episode. That changes the saturate_bound experiments, as now the lower_bound is increased much quicker. I think the results are decent and shot that one can play around with the parameters depending on what kind of result is wanted.

I think I would switch on the sature_bound by default. Atm it increases the lower bound to 0.8 during training, but I noticed that the last 10% of training lead to a loss in sharpness - that means until a lower bound of (0.8 - 0.1) * 0.9 + 0.1 = 0.73 the training went fine. So I think I'd drop the upper saturation limit from 0.8 to 0.75. I'll run some experiments for this new setting with the lama. If that goes well I'll recreate the current README examples using 1 epoch. If they look "good" I'll replace the old examples by them and we should be ready to merge.

Saturate=False, Lower_bound=0.1

llama_0_1_one_epoch

Saturate=False, Lower_bound=0.5

llama_0_5_one_epoch

Saturate=False, Lower_bound=0.6

llama_0_6_one_epoch

Saturate=True, Lower_bound=0.1

llama_0_1_one_epoch_saturate

Saturate=True, Lower_bound=0.5

llama_0_5_one_epoch_saturate

afiaka87 commented 3 years ago

@NotNANtoN We're good to merge this, correct? @lucidrains

NotNANtoN commented 3 years ago

I still wanted to replace the readme images. I'm on it

NotNANtoN commented 3 years ago

@lucidrains Ready to merge now. Maybe you want to check the README, I replaced the old generations with new ones from this branch. I added two generations for "a man painting a completely red image" and "a psychedelic experience on LSD" because I like them, but feel free to remove some if it is too crowded now.

I added sections for:

Using an img instead of a text to extract features from.
Combining an img and a text
Creating a visualization of a long text using create_story
A small section on memory/speed benchmarks.

lucidrains commented 3 years ago

@NotNANtoN thanks for the contribution! sorry I don't have much time to look at this, but I'll trust you know what you are doing ;)