Closed NotNANtoN closed 3 years ago
This has been a longstanding issue! Good work.
Hm, this is also failing to build for me on Python 3.6 RuntimeError: Python version >= 3.7 required
. Did you add a new dependency?
Edit: Ah, nevermind. This is because of the einops>=0.3 requirement. @lucidrains You may want to bump the Python version number in setup.py to 3.7
Sorry for the very cryptic and ad-hoc size scheduling @NotNANtoN , I'm responsible for that and it was just an incremental patchwork over the original notebook from someone with no real experience. Very encouraging to see it switched to some solid footing, thanks!
Now that I have admitted that I'm a novice and am just tweaking knobs to get as good an outcome as possible, let me inquire about some details for usability here. Arguably usability is complex w.r.t training schedules, since based on your compute budget and saturation goals, you end up requiring rather different schedules.
Up till now, I am usually running with 32 layers, and am happy if 1 epoch (1050 steps, taking ~12min on a 3090) gets a reasonably recognizable image out, and then decide when to interrupt the training. Not to say this is a "Right Use" of such a library, it's just what I've done so far.
So a typical run would be:
$ imagine --num-layers 32 --save-progress "A llama wearing a scarf and glasses, reading a book in a cozy cafe."
On a freshly installed deep-daze-0.6.2 from pip one epoch gets me:
loss: ~ -65
meanwhile, installing from this branch and running the same exact invocation for the same step gets me:
loss: ~ -55
Your claim that after 5 epochs your refactor will produce a much better image is likely true - I had also noticed that the last/latest phase of my patchy schedule was rather poor in outcomes, so that needed to be improved for sure. But I wonder if the PR adds some penalty in minimally needed compute to saturate? Not that I have any great solutions besides adding yet-another-customization-flag for different schedules based on compute targets... But adding a flag just sounds like feature creep to me. shrug Whatever works for everyone I guess, but I quite enjoy getting a decent image in ~1000 steps.
General comment: I think the only sane way to compare such PRs to main is running both branches with the same parameters and juxtaposing images. Otherwise it's a bit hard to "see" what was contributed by the PR, and what by e.g. using 44 layers for 5 epochs.
Hi @dginev, thanks for your insightful comment. I agree, my comparison is far from ideal. I always played around with 44 layers and 5-10 epochs and could notice a large improvement for this setting. I agree that a more direct comparison would be better.
A major difference between the old scheduling and the new one is that now, for the default settings, the cutout size is always randomly sampled between 0.1 and 1.0. In the older scheduling, it only went as low as 0.49. So what you could try is to adjust the lower_bound_cutout
to 0.5 and check if the results in that case are better for fewer epochs. Another option is my attempt at a simple scheduling system using saturate_bound
. If you set it to True, the lower_bound is increased during training (dependent on the total number of batches) and should ideally make the system more robust to different training lengths. I have not tested it extensively yet!
We might need to adjust the scheduling invoked by saturate_bound
. I see that in the old system there was some kind of cyclical change in cutout-bounds - that could also be implemented. I think the relatively good performance for the old scheduling might be due to use of a batch size of 4? The pieces_per_group were hardcoded to 4, so I assume it was tuned for that batch size. For larger batch sizes (I used 96 a lot - barely fits in a 8GB GPU) the old scheduling would then use many groups per partition, which, as far as I reasoned, would lead to some sort of averaging of the different cutout-scales used per partition.
I'm curious to hear what you say and would be glad to see any experiments/results.
Side note: I'm first of all surprised and a bit jealous that you have access to a 3090, but even more so about the fact that for me a 1 epoch of 1050 iterations takes 3.5 minutes (with a batch size of 32 and 44 layers). Ah, I just realized, you probably use an image_width of 512? Then it all makes sense.
Haha, I was wondering how you're so patient as to do 44 layers for 5 epochs with width 512 and not complain about it! Now things make sense. My honest reaction was "wait, deep-daze can reduce the image width now?" -- I never investigated if that's available. Btw, just double-checked and --num-layers 32
with the defaults also runs on a 1080TI, just under 10GB allocated.
So, if the image widths we are discussing are variable, and you're using a lower one than me, can we even theoretically share a non-trivial training schedule? We clearly can share a uniformly sampling one, but any dynamic trickery won't really make sense across image dimensions, if it is based on step counts. Maybe if it is based on loss...
My partition tricks are really fine-tuned to the saturation speeds one usually sees at width 512. If you halve the width they may produce something entirely different. But that's even true if you vary the number of layers... So yes, my hack has to be removed from the repo, it's only "good"-ish for 512 width 32 layers, and even then I don't like the last two stages of it.
I never went under 0.49 size because the original collab notebook author had some scary language about really bad training results if small sizes are included. He definitely understood SIREN better than me, but also wasn't proficient in it. In fact, he didn't have the .sort()
trick in originally, and the training was rather bumpy - that may have been all he was seeing. I'd really have to learn more about SIREN to make any actually useful comments here, but I'm starting to wonder whether:
Hm, hm...
Alright, here's a comparison more in the spirit of the PR. I did 44 layers with a reduced image size of 128. Indeed an epoch for that setup takes 3.5 mins. I should preface with a big thanks! for alerting me to that, it unlocked a whole new miniature art form for me to put llamas in.
I had to triple-check I was running the right executables, since the PR really reminded me of the notebook training runs before I made my patches with both the losses and the images produced. They should be as described here, same command:
$ imagine --save-progress --num-layers 44 --image-width 128 \
"A llama wearing a scarf and glasses, reading a book in a cozy cafe."
epoch | main loss | PR loss |
---|---|---|
1 | -57.4 | -47.65 |
2 | -68.8 | -48.7 |
3 | -64.3 | -53.6 |
4 | -72.1 | -51.9 |
5 | -66.8 | -49.48 |
main branch, epochs 1 and 5:
PR branch, epochs 1 and 5:
I can confirm the images in the main branch are very close to what I am used to in the 512 width - this is a common prompt I experiment with, and I've gotten somewhat accustomed to what CLIP fishes out. Meanwhile, the PR branch results seem to have issues saturating, and also have multiple fragments (which is something you encounter if you sample too many small window sizes for too long).
It's just a single run each, so could also be a poisoned outlier, but it raises some questions...
You have a point, I can see some issues there. I just launched some experiments with your lama sentence, differing the lower_bound_cutout and saturate_bound between runs. I was hoping that if someone wants a less fractured image, then 0.5 might be a good value for the lower bound and the saturation issues might be fixable by using the saturation.
I do see your point about possibly just making this loss-dependent. I feel like it could make sense to have the lower bound be smaller as the loss gets better, as that makes the net focus on more details - at the same time it fractures the image more. I went with the approach of first focusing on details/textures by having a low lower_bound and then more on the larger picture, I'm curious how the experiments turn out. I kind of like the simplicity of my approach, but maybe it is too simple - I am basically just shifting the lower bound of a uniform distribution. We could easily extend it though: 1. Sample from a Gaussian around a mean that increases as the training goes on or as the loss decreases 2. Add some cyclical increase/decrease as there was in the old schedule.
To be honest, I feel like we are going in a bit blind without a proper plot of the loss and other diagnostics. Maybe we should add a debug mode that generates diagnostic plots?
Agree completely with your latest comments, especially since we're forced into a corner - if the library is to support all these different features, we need to establish and diagnose some reliable aspects of SIREN+CLIP training for the various use cases. And as you say, be able to cross-compare and diagnose failures.
I was really trying not to get sucked into this myself, since my computer vision experience is a very round zero, but if I find the time I'll read the SIREN paper and do some math over the weekend...
What I did try overnight was your new create_story
feature on the main branch (i.e. with my patchy sizing) - fun, thanks for that as well! And there I can really painfully see how after the first couple of epochs the learning gets stuck into this wavy SIREN-y background subtstrate. That should be partially related to the sizing partitions we are discussing, and partially related to the learning method getting quite stubbornly stuck once it saturates. If one is to get high quality images on every story epoch, I guess we ought to add some noise between epochs, or start training the next epoch from the midway checkpoint of the previous one or...
It would be great if you can dig further into it. I wonder how much of this is SIREN related and how much is related to the gradient ascent training by using CLIP. As far as I have seen they do not use any random cutouts in the SIREN paper, but I just skimmed it.
Have you tried the create_story on the PR branch? It works really nicely. Your scheduling seems to work well to make the image convert and saturate - but it also hinders it from evolving into something new.
I did some experimental runs with your query. Settings: batch size 32, image_width 256, num_layers 44, PR branch. The picture represent epoch 1, 5, and 8, respectively:
lower_bound_cutout=0.1: lower_bound_cutout=0.5: lower_bound_cutout=0.6:
lower_bound_cutout=0.1: lower_bound_cutout=0.5:
To be honest, I struggle to really see which one is better and which one is worse. What I can clearly observe is that larger lower bounds lead to the appearance of more text (a bad sign IMO) and to a bit less "realistic" images. The saturate_bound condition approximates your schedule a bit, as far as I can tell from observing the images.
@NotNANtoN What about the number of iterations it takes to get to a reasonable result? That's really my only concern.
@dginev I think it's safe to say the outputs are just gonna be different now that we're normalizing properly. Take the "shattered plates on the grass" example. The example on the README.md (currently) is just way too dark.
@afiaka87 Aren't the README samples rather outdated at this point? That's part of why we're doing all of this back-and-forth, it takes a lot of work just to figure out who's comparing what. Here's a version of that prompt executed today for 1 epoch:
$ imagine --num-layers=44 --image-width=200 "shattered plates on the grass" --save-progress
with the current v0.6.2 main, loss -46.6:
with this PR, loss -38.9:
I like the PR's output more here as well, but the loss difference worries me. ....
Another difference from the notebook where I had the scheduling and the way deep-daze is using that code is that it has a batch_size
that only takes some of the sizes, rather than all together? My code was expecting 64 pieces passed into torch.cat
, while deep daze is by default passing them 4 at a time, so it will take 16 steps to see the entire size batch? Unsure how that influences the training. Only way to find out, repeating the tests with --batch-size=64
:
$ imagine --batch-size=64 --num-layers=44 --image-width=200 "shattered plates on the grass" --save-progress
with main v0.6.2, loss -59.5
with this PR, loss -50.6
An epoch takes 4 minutes with the default batch size and 8 minutes with it increased to 64, so there's a good amount of compute hidden in that.
Both runs look "good?", the images I've attached are hard for me to contrast with each other. I'm not sure I learned anything useful...
According to @NotNANtoN 's last comment, lower_bound_cutout=0.5
would get me the llama setup I was looking for with my original comment, already in the first epoch, as in main. If that's the case, feel free to merge the PR, and I'll have to remember to use that option going forward. Thanks!
https://github.com/NotNANtoN/deep-daze/pull/1/files @NotNANtoN Made the CLI changes at least.
@dginev Hm, you're right the output has gotten a good deal better even in 0.6.2.
Both runs look "good?", the images I've attached are hard for me to contrast with each other. I'm not sure I learned anything useful...
Yeah I'm lost. We could feasibly come up with some sort of diverse set of phrases to evaluate against "visually" but that sounds like a pain.
@afiaka87 Thanks for the CLI additions! I'll merge them in a bit.
I ran the same experiments with a maximum of 1 episode. That changes the saturate_bound experiments, as now the lower_bound is increased much quicker. I think the results are decent and shot that one can play around with the parameters depending on what kind of result is wanted.
I think I would switch on the sature_bound by default. Atm it increases the lower bound to 0.8 during training, but I noticed that the last 10% of training lead to a loss in sharpness - that means until a lower bound of (0.8 - 0.1) * 0.9 + 0.1 = 0.73 the training went fine. So I think I'd drop the upper saturation limit from 0.8 to 0.75. I'll run some experiments for this new setting with the lama. If that goes well I'll recreate the current README examples using 1 epoch. If they look "good" I'll replace the old examples by them and we should be ready to merge.
@NotNANtoN We're good to merge this, correct? @lucidrains
I still wanted to replace the readme images. I'm on it
@lucidrains Ready to merge now. Maybe you want to check the README, I replaced the old generations with new ones from this branch. I added two generations for "a man painting a completely red image" and "a psychedelic experience on LSD" because I like them, but feel free to remove some if it is too crowded now.
I added sections for:
create_story
@NotNANtoN thanks for the contribution! sorry I don't have much time to look at this, but I'll trust you know what you are doing ;)
Main changes
I cleaned up the cryptic size scheduling that was used for the sampling of random cut-out sizes. Before there was a weird scheme that adapted neither to the batch size nor the total number of episodes. I inspected it in detail and found that the sampling scheme was sampling in ranges of 0.1 for intervals starting at 0.49 to 1.09 (depending on the schedule). A comment in the code says that the context should increase as the model saturates - which means the sampling should be closer to 1.
The new approach is simple: the random sizes are uniformly samples between a lower bound (default=0.1) and an upper bound (default=1.0). Both are customizable by the user in the Imagine class. I emulated some scheduling by adding the sature_bound parameter. If set to True, it linearly increases the lower bound from the starting value to a limit during the training. I set the limit to 0.8 because from 0.8 and above the generations become was
hed out and unstable. I also noticed that this scheduling does not really bring about any benefits, but I have not experimented extensively with it.
Minor changes
Results
Examples from old README
Anyways, the performance is (from my visual inspection) MUCH better now. I recreated the examples that are currently in the README (num_layers=44, batch_size=32, gradient_accumulate_every=1, 5 epochs - needs less than 8GB of RAM, and about 20 mins):
https://user-images.githubusercontent.com/19983153/109133200-0bdeac00-7755-11eb-8c87-bd18ab38bad6.mp4
https://user-images.githubusercontent.com/19983153/109133230-1305ba00-7755-11eb-840e-d424bb6cbd75.mp4
https://user-images.githubusercontent.com/19983153/109133247-17ca6e00-7755-11eb-944d-775b46da1d61.mp4
https://user-images.githubusercontent.com/19983153/109133352-3892c380-7755-11eb-9ac3-4ce9cf031c27.mp4
A very fancy one is "A psychedelic experience on LSD":
@lucidrains Feel free to replace the images by the new ones. I can also do it, if you consent.
Generations from img and img+text
Some more hot-dog images to show that this still works: Generations using "A dog in a hotdog costume":
Now given this starting image:
We can generate:
Add "A psychedelic experience" as text to img: Adding the text "A dog in a hotdog costume" to the image does not work too nicely:
Story creation
Lastly, I can show the story creation feature of the last PR (although with few generations per epoch, so the dream kind of happens too quickly):
"I dreamed that I was with my coworkers having a splendid party in someone's house. Even though I had many people surrounding me, I felt so lonely and I just wanted to cry. I went to the bathroom and something hit me, and I woke up."
https://user-images.githubusercontent.com/19983153/109135224-1f8b1200-7757-11eb-9ba7-ae7540cd0401.mp4
"I dreamt the house across the street from me was on fire. The people who live there were not there. It was a friend of my family and her daughter. I was looking out the window and saw all the smoke so I called 911 but it was busy."
https://user-images.githubusercontent.com/19983153/109135243-26b22000-7757-11eb-954d-6c0d54e8c34d.mp4