Closed NotNANtoN closed 3 years ago
What's the easiest way to use your PR locally? I tried pip installing your branch directly like so:
pip install git+https://github.com/NotNANtoN/deep-daze.git@new_augmentations
It seemed like it had worked, although when I try the --avg_feats
argument, I got an error at the very end.
Here was my command:
imagine --num_layers=44 --batch_size=32 --epochs=2 --save_every=15 --save_date_time --avg_feats --center_bias --gauss_sampling --text="here is my text prompt"
And the error at the very end of the generation:
ERROR: Could not consume arg: --avg_feats
Usage: imagine --num_layers=44 --batch_size=32 --epochs=2 --save_every=15 --save_date_time --avg_feats -
For detailed information on this command, run:
imagine --num_layers=44 --batch_size=32 --epochs=2 --save_every=15 --save_date_time --avg_feats - --help
Hi! you installed it correctly, I just (again) forgot to explicitly add --avg_feats
etc. to the CLI. Let me fix this...
I fixed the issue, your command should work now. It might not run on a machine with 8GB VRAM because your image size is 512, so you might need to reduce it to 256.
I also added CPU support and switched the CLIP model to use jit-compiling by default (can be turned off by setting jit to False).
Perfect, thanks! Got it updated on my end, and while this epoch hasn't completed yet so I could see whether there was an arg error or not... I can already tell just by looking at it, that the center_bias and avg_feats is doing magic. (I removed --gauss_sampling
arg from the command I quoted in my previous comment)
This prompt is a bit weird, so I'll share some results from a different prompt for comparison soon.
It might not run on a machine with 8GB VRAM because your image size is 512, so you might need to reduce it to 256.
I'm fortunate enough to have grabbed a 3090! So I've been staring at the VRAM usage, and noticed with these settings, it sits around 16GB. Is increasing the batch_size
param the best thing I can do to utilize the remaining 8GB?
@mroosen's experiments here might contain some good hints: https://github.com/lucidrains/deep-daze/issues/96#issuecomment-802298560
I also added CPU support and switched the CLIP model to use jit-compiling by default (can be turned off by setting jit to False).
Is there any advantage to using the non-jit version?
@russelldc Good to hear that it runs and gives nice results. I'm also not convinced by the Gaussian sampling (it's used in BigSleep too, but it might be better there too to switch to uniform sampling).
Very cool overview of num_layers and the learning rate! I used a batch size of 96 for 44 layers for a long time, but then switched to a batch size of 32. The results are pretty much the same and it runs faster. Not sure how you can utilize the rest of you VRAM efficiently otherwise - in the overview that you linked around 44 layers seems to be a sweet spot that trades off image quality with learning stability. So maybe that's the limit of the current architecture.
What would be interesting is to increase the width of the linear layers instead. It is fixed at 256 at the moment but I also wanted to check at some point if 512 might result in better results. Of course the lr would need to be tuned then.
As for non-jit: I'm not sure why, but @lucidrains switched to that model in one commit, even though it runs slower than the jit model. So I switched back to the jit model but left the option open to disable it.
Are you talking about "perceptor, normalize_image = load('ViT-B/32', jit = False)" ? That's due to an issue with CLIP + PyTorch 1.8.0. 1.7.1 is OK.
As the smaller batch sizes are much faster I mostly use "--num_layers=24 --batch_size=8 --gradient_accumulate_every=1" which gives a fairly reasonable 7.6it/s on a 3090. Haven't tried the PR yet though, I'll have to take a look - thanks for the updates! :)
Thanks for the input @nerdyrodent! That means that instead of adding jit
to the CLI it could be better to just check if torch.__version__ == "1.7.1"
and only in that case set jit to True?
Curious to hear what you get out of the new augmentations.
@russelldc check my comment in #96 for some advice to use the RAM in a better way. I just pushed the option to change hidden_size
in this PR. You can increase it to something like 512 or even 1024 - but for very large sizes you might need to reduce the learning rate (maybe halve it or more).
I tried installing it using:
pip3 install git+https://github.com/NotNANtoN/deep-daze.git@new_augmentations
After that running imagine
gave:
100%|███████████████████████████████████████| 354M/354M [05:33<00:00, 1.06MiB/s]
Traceback (most recent call last):
File "/usr/local/bin/imagine", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.9/site-packages/deep_daze/cli.py", line 138, in main
fire.Fire(train)
File "/usr/local/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/usr/local/lib/python3.9/site-packages/fire/core.py", line 466, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/usr/local/lib/python3.9/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/usr/local/lib/python3.9/site-packages/deep_daze/cli.py", line 91, in train
imagine = Imagine(
File "/usr/local/lib/python3.9/site-packages/deep_daze/deep_daze.py", line 321, in __init__
clip_perceptor, norm = load(model_name, jit=jit, device=self.device)
File "/usr/local/lib/python3.9/site-packages/deep_daze/clip.py", line 127, in load
model.apply(patch_device)
File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 473, in apply
module.apply(fn)
File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 473, in apply
module.apply(fn)
File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 473, in apply
module.apply(fn)
[Previous line repeated 3 more times]
File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 474, in apply
fn(self)
File "/usr/local/lib/python3.9/site-packages/deep_daze/clip.py", line 118, in patch_device
graphs = [module.graph] if hasattr(module, "graph") else []
File "/usr/local/lib/python3.9/site-packages/torch/jit/_script.py", line 449, in graph
return self._c._get_method("forward").graph
RuntimeError: Method 'forward' is not defined.
➜ ~ imagine
Traceback (most recent call last):
File "/usr/local/bin/imagine", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.9/site-packages/deep_daze/cli.py", line 138, in main
fire.Fire(train)
File "/usr/local/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/usr/local/lib/python3.9/site-packages/fire/core.py", line 466, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/usr/local/lib/python3.9/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/usr/local/lib/python3.9/site-packages/deep_daze/cli.py", line 91, in train
imagine = Imagine(
File "/usr/local/lib/python3.9/site-packages/deep_daze/deep_daze.py", line 321, in __init__
clip_perceptor, norm = load(model_name, jit=jit, device=self.device)
File "/usr/local/lib/python3.9/site-packages/deep_daze/clip.py", line 127, in load
model.apply(patch_device)
File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 473, in apply
module.apply(fn)
File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 473, in apply
module.apply(fn)
File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 473, in apply
module.apply(fn)
[Previous line repeated 3 more times]
File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 474, in apply
fn(self)
File "/usr/local/lib/python3.9/site-packages/deep_daze/clip.py", line 118, in patch_device
graphs = [module.graph] if hasattr(module, "graph") else []
File "/usr/local/lib/python3.9/site-packages/torch/jit/_script.py", line 449, in graph
return self._c._get_method("forward").graph
RuntimeError: Method 'forward' is not defined.
@mehdibo Hi, what PyTorch version are you using? You can check with torch.__version__
. It looks like you are not using 1.7.1 but 1.8 instead, so you need to set jit
to False
in the Imagine classes.
I might just set it to False for all non 1.7 versions later when I get to it
Thanks @NotNANtoN, I found that I have version 1.8, I downgraded and it worked!
I think the conditional for disabling jit is a bit too specific right now, leading to false negatives.
I was seeing it display the log about forcing jit to false, so I printed out my torch version to double-check:
torch version: 1.7.1+cu110
Setting jit to False because torch version is not 1.7.1.
Starting up...
Okay, so I:
Furthermore, I was thinking of setting the avg_feats
to True by default, as the images get quite detailed. The issue is that the averaging leads to a parcellation of the image, where each part focuses more on the specific "fraction of the meaning" (of the feature vector) it is supposed to represent.
Therefore, I switched from a binary setting to a smooth interpolation between both approaches using a new averaging_weight
. An example of this can be seen here for the prompts ["A wizard in blue robes is painting a completely red image in a castle", "Consciousness.", "Depression".] I trained with a batch size of 32, for 10 epochs, a hidden_size of 512 and 32 layers. The first row represents an averaging_weight
of 0.2, the second 0.4, the third 0.6 and the fourth 0.8:
I observe that for higher averaging weights the images seem to have more small details and also the optimization does not stay fixed, it improves more for longer training durations. But the meaning also seems to be more fracture for high values, leading to highly detailed disjoint scenes. Therefore, I suggest a medium value of 0.3 that sharpens up the image while not parcellating everything.
Awesome job, @NotNANtoN! Can't wait to try this out after work...
So after these most recent changes, if you wanted to generate an image as it was pre-PR, then you need to instead set averaging_weight=0
(rather than avg_feats=False
)?
Are there any hints as to why the feature averaging seems to contribute some small random logos, as you've mentioned before?
@russelldc hopefully you like it! Try out the hidden_size
parameter if you haven't already - I really like the faster convergence and more colorful results if it is increased to 512.
Yes, set averaging_weight
to 0 to get pre-PR behavior.
As for the appearance of the logos I can only hypothesize... I assume it's related to CLIP being trained on a large number of logos. So it "knows" them and their meaning quite well. If we now optimize for the averaged meaning of each random cutout to equal the meaning of our text prompt, CLIP might put specific logos it knows well in some location to push the averaged meaning in a specific direction.
Try out the hidden_size parameter if you haven't already - I really like the faster convergence and more colorful results if it is increased to 512.
I had a chance to try it out late Friday night, and was getting interesting results. I agree, it quickly reaches a colorful image, rather than being stuck in a blurry brown/gray zone for a while.
I was doing some random tests trying to raise the hidden_size
super high, while balancing the other params so my 24GB of VRAM hits closest to 100% as possible. I found I was able to use a hidden_size
of 4000+ for 128px images, with 32 layers and a batch size of 16. I've been too busy since then to be able to put together any sort of analysis.
I couldn't tell whether it became more colorful with further increases beyond 512, but I got this feeling that the resolution of the image being painted was quite high/detailed, before "getting shrunk down" into the 128px canvas. Might just be late-night delusional thinking on my part.
Just ran this a few minutes ago with those same settings, averaging_weight=0.3
and DiffGrad:
imagine --text="an overhead drone photo of the Suez canal blocked by a container ship" --num_layers=32 --batch_size=16 --learning_rate=0.0000008 --gradient_accumulate_every=1 --iterations=1050 --epochs=1 --save_every=1 --save_date_time --open_folder=False --image_width=128 --hidden_size=4000 --optimizer="DiffGrad" --averaging_weight=0.3
averaging_weight | Image | Video |
---|---|---|
0 | ||
0.3 | ||
0.8 |
On my local copy, I've added optimizer
and model_name
to the cli. I'll try to suggest those changes directly here in Github.
I took your lead with opening up the options for more optimizers, and I've been mostly blindly trying some others, like Ranger
and AdaBelief
. Those 2 were causing some interesting color fluctuations frame to frame... I'm not sure if this would be considered diverging, since the content was remaining mostly stable while the colors rotated.
Here's an example. It was several different generations of the same prompt (should be easy to guess!). If my bash history is correct, these were all using AdaBelief with varying center_focus
and learning_rate
:
https://user-images.githubusercontent.com/5100126/113137409-59798900-91d9-11eb-9456-93aaa678cf41.mp4
I've been meaning to ask: what's the effect of >1 epochs vs just more iterations?
Hi, I added the optimizer, the model name and a new save_video
to the PR.
Interesitng ship generations! Do you mind trying out averaging_weight=0.3
? I feel like that might be a good trade-off.
The optimizer definitely looks interesting - quite a different optimization behavior, DiffGrad, AdamP and Adam do not differ too wildly between each other. It looks like the learning rate is too high in some parts of the video.
Hi, I added the optimizer, the model name and a new save_video to the PR.
Thanks!
Interesitng ship generations! Do you mind trying out averaging_weight=0.3? I feel like that might be a good trade-off.
Yeah, that was the first one I had in the original comment. I updated that comment again to organize them into a table per averaging_weight.
It looks like the learning rate is too high in some parts of the video.
You're probably right
I also just finished some experiments showing the promise of hidden_size=512
as well as the problem of a too large averaging_weight
. I used the wizard and consciousness prompts from above, next to "The sun setting spectacolously over the beautiful ocean", "A painting of a sunset" and "A painting of a sunrise". The first two rows are with a averaging_weight=0
(or avg_feats=False
), the last two have a averaging_weight=1
. The first and third row have a hidden_size=256
, the second and fourth hidden_size=512
.
Here are the results for the first epoch:
And here for the ninth epoch:
So you can see clearly how hidden_size=512
leads to much quicker convergence (epoch 1 looks already promising) and averaging_weight=1
takes longer to optimize but produces sharper, more parcellated images. So for a large averaging_weight, more epochs make sense.
I think it's quite funny that in the very lower right of the ninth episode the network generates the casino machine named "sunrise" for "A painting of a sunrise" in the lower right corner. Kind of accurate, but it shows how global coherence gets lost.
On an RTX 3090, I'm able to reach as high as hidden_size=600
for 512px images before running out of VRAM. Mildly interesting fact: the generations are exactly 2x faster compared to using hidden_size=4000
for 128px images.
This is with these same other params as I had reported using previously: num_layers=32
and batch_size=16
.
I think it's quite funny that in the very lower right of the ninth episode the network generates the casino machine named "sunrise" for "A painting of a sunrise" in the lower right corner.
Ha, yeah, I love those moments. Sometimes the associations being made are so "creative", or at least something the average human wouldn't think up.
Unless there are any other concerns, I think this PR can be merged @lucidrains.
I might work on trying to enforce more detail while keeping global coherence, but for now the averaging_weight
seems like an acceptable solution to me.
@NotNANtoN thank you as always for your work :D
avg_feats
(leads to more concrete scenes) andcenter_bias
(leads to the object in question - if there is an object talked about in the sentence - to be centered in the middle of the image) are interesting