lucidrains / deep-daze

Simple command line tool for text to image generation using OpenAI's CLIP and Siren (Implicit neural representation network). Technique was originally created by https://twitter.com/advadnoun
MIT License
4.37k stars 327 forks source link

Some new augmentations #103

Closed NotNANtoN closed 3 years ago

NotNANtoN commented 3 years ago
russelldc commented 3 years ago

What's the easiest way to use your PR locally? I tried pip installing your branch directly like so:

pip install git+https://github.com/NotNANtoN/deep-daze.git@new_augmentations

It seemed like it had worked, although when I try the --avg_feats argument, I got an error at the very end. Here was my command:

imagine --num_layers=44 --batch_size=32 --epochs=2 --save_every=15 --save_date_time --avg_feats --center_bias --gauss_sampling --text="here is my text prompt"

And the error at the very end of the generation:

ERROR: Could not consume arg: --avg_feats
Usage: imagine --num_layers=44 --batch_size=32 --epochs=2 --save_every=15 --save_date_time --avg_feats -

For detailed information on this command, run:
  imagine --num_layers=44 --batch_size=32 --epochs=2 --save_every=15 --save_date_time --avg_feats - --help
NotNANtoN commented 3 years ago

Hi! you installed it correctly, I just (again) forgot to explicitly add --avg_feats etc. to the CLI. Let me fix this...

NotNANtoN commented 3 years ago

I fixed the issue, your command should work now. It might not run on a machine with 8GB VRAM because your image size is 512, so you might need to reduce it to 256.

I also added CPU support and switched the CLIP model to use jit-compiling by default (can be turned off by setting jit to False).

russelldc commented 3 years ago

Perfect, thanks! Got it updated on my end, and while this epoch hasn't completed yet so I could see whether there was an arg error or not... I can already tell just by looking at it, that the center_bias and avg_feats is doing magic. (I removed --gauss_sampling arg from the command I quoted in my previous comment)

This prompt is a bit weird, so I'll share some results from a different prompt for comparison soon.

It might not run on a machine with 8GB VRAM because your image size is 512, so you might need to reduce it to 256.

I'm fortunate enough to have grabbed a 3090! So I've been staring at the VRAM usage, and noticed with these settings, it sits around 16GB. Is increasing the batch_size param the best thing I can do to utilize the remaining 8GB? @mroosen's experiments here might contain some good hints: https://github.com/lucidrains/deep-daze/issues/96#issuecomment-802298560

I also added CPU support and switched the CLIP model to use jit-compiling by default (can be turned off by setting jit to False).

Is there any advantage to using the non-jit version?

NotNANtoN commented 3 years ago

@russelldc Good to hear that it runs and gives nice results. I'm also not convinced by the Gaussian sampling (it's used in BigSleep too, but it might be better there too to switch to uniform sampling).

Very cool overview of num_layers and the learning rate! I used a batch size of 96 for 44 layers for a long time, but then switched to a batch size of 32. The results are pretty much the same and it runs faster. Not sure how you can utilize the rest of you VRAM efficiently otherwise - in the overview that you linked around 44 layers seems to be a sweet spot that trades off image quality with learning stability. So maybe that's the limit of the current architecture.

What would be interesting is to increase the width of the linear layers instead. It is fixed at 256 at the moment but I also wanted to check at some point if 512 might result in better results. Of course the lr would need to be tuned then.

As for non-jit: I'm not sure why, but @lucidrains switched to that model in one commit, even though it runs slower than the jit model. So I switched back to the jit model but left the option open to disable it.

nerdyrodent commented 3 years ago

Are you talking about "perceptor, normalize_image = load('ViT-B/32', jit = False)" ? That's due to an issue with CLIP + PyTorch 1.8.0. 1.7.1 is OK.

As the smaller batch sizes are much faster I mostly use "--num_layers=24 --batch_size=8 --gradient_accumulate_every=1" which gives a fairly reasonable 7.6it/s on a 3090. Haven't tried the PR yet though, I'll have to take a look - thanks for the updates! :)

NotNANtoN commented 3 years ago

Thanks for the input @nerdyrodent! That means that instead of adding jitto the CLI it could be better to just check if torch.__version__ == "1.7.1" and only in that case set jit to True?

Curious to hear what you get out of the new augmentations.

NotNANtoN commented 3 years ago

@russelldc check my comment in #96 for some advice to use the RAM in a better way. I just pushed the option to change hidden_sizein this PR. You can increase it to something like 512 or even 1024 - but for very large sizes you might need to reduce the learning rate (maybe halve it or more).

mehdibo commented 3 years ago

I tried installing it using: pip3 install git+https://github.com/NotNANtoN/deep-daze.git@new_augmentations

After that running imagine gave:

100%|███████████████████████████████████████| 354M/354M [05:33<00:00, 1.06MiB/s]
Traceback (most recent call last):
  File "/usr/local/bin/imagine", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.9/site-packages/deep_daze/cli.py", line 138, in main
    fire.Fire(train)
  File "/usr/local/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/usr/local/lib/python3.9/site-packages/fire/core.py", line 466, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/usr/local/lib/python3.9/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/deep_daze/cli.py", line 91, in train
    imagine = Imagine(
  File "/usr/local/lib/python3.9/site-packages/deep_daze/deep_daze.py", line 321, in __init__
    clip_perceptor, norm = load(model_name, jit=jit, device=self.device)
  File "/usr/local/lib/python3.9/site-packages/deep_daze/clip.py", line 127, in load
    model.apply(patch_device)
  File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 473, in apply
    module.apply(fn)
  File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 473, in apply
    module.apply(fn)
  File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 473, in apply
    module.apply(fn)
  [Previous line repeated 3 more times]
  File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 474, in apply
    fn(self)
  File "/usr/local/lib/python3.9/site-packages/deep_daze/clip.py", line 118, in patch_device
    graphs = [module.graph] if hasattr(module, "graph") else []
  File "/usr/local/lib/python3.9/site-packages/torch/jit/_script.py", line 449, in graph
    return self._c._get_method("forward").graph
RuntimeError: Method 'forward' is not defined.
➜  ~ imagine
Traceback (most recent call last):
  File "/usr/local/bin/imagine", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.9/site-packages/deep_daze/cli.py", line 138, in main
    fire.Fire(train)
  File "/usr/local/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/usr/local/lib/python3.9/site-packages/fire/core.py", line 466, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/usr/local/lib/python3.9/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/deep_daze/cli.py", line 91, in train
    imagine = Imagine(
  File "/usr/local/lib/python3.9/site-packages/deep_daze/deep_daze.py", line 321, in __init__
    clip_perceptor, norm = load(model_name, jit=jit, device=self.device)
  File "/usr/local/lib/python3.9/site-packages/deep_daze/clip.py", line 127, in load
    model.apply(patch_device)
  File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 473, in apply
    module.apply(fn)
  File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 473, in apply
    module.apply(fn)
  File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 473, in apply
    module.apply(fn)
  [Previous line repeated 3 more times]
  File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 474, in apply
    fn(self)
  File "/usr/local/lib/python3.9/site-packages/deep_daze/clip.py", line 118, in patch_device
    graphs = [module.graph] if hasattr(module, "graph") else []
  File "/usr/local/lib/python3.9/site-packages/torch/jit/_script.py", line 449, in graph
    return self._c._get_method("forward").graph
RuntimeError: Method 'forward' is not defined.
NotNANtoN commented 3 years ago

@mehdibo Hi, what PyTorch version are you using? You can check with torch.__version__. It looks like you are not using 1.7.1 but 1.8 instead, so you need to set jitto Falsein the Imagine classes.

I might just set it to False for all non 1.7 versions later when I get to it

mehdibo commented 3 years ago

Thanks @NotNANtoN, I found that I have version 1.8, I downgraded and it worked!

russelldc commented 3 years ago

I think the conditional for disabling jit is a bit too specific right now, leading to false negatives.

I was seeing it display the log about forcing jit to false, so I printed out my torch version to double-check:

torch version: 1.7.1+cu110
Setting jit to False because torch version is not 1.7.1.
Starting up...
NotNANtoN commented 3 years ago

Okay, so I:

Furthermore, I was thinking of setting the avg_feats to True by default, as the images get quite detailed. The issue is that the averaging leads to a parcellation of the image, where each part focuses more on the specific "fraction of the meaning" (of the feature vector) it is supposed to represent.

Therefore, I switched from a binary setting to a smooth interpolation between both approaches using a new averaging_weight. An example of this can be seen here for the prompts ["A wizard in blue robes is painting a completely red image in a castle", "Consciousness.", "Depression".] I trained with a batch size of 32, for 10 epochs, a hidden_size of 512 and 32 layers. The first row represents an averaging_weight of 0.2, the second 0.4, the third 0.6 and the fourth 0.8:

wizard_consc_depr_as_averaging_is_increased

I observe that for higher averaging weights the images seem to have more small details and also the optimization does not stay fixed, it improves more for longer training durations. But the meaning also seems to be more fracture for high values, leading to highly detailed disjoint scenes. Therefore, I suggest a medium value of 0.3 that sharpens up the image while not parcellating everything.

russelldc commented 3 years ago

Awesome job, @NotNANtoN! Can't wait to try this out after work...

So after these most recent changes, if you wanted to generate an image as it was pre-PR, then you need to instead set averaging_weight=0 (rather than avg_feats=False)?

Are there any hints as to why the feature averaging seems to contribute some small random logos, as you've mentioned before?

NotNANtoN commented 3 years ago

@russelldc hopefully you like it! Try out the hidden_size parameter if you haven't already - I really like the faster convergence and more colorful results if it is increased to 512.

Yes, set averaging_weight to 0 to get pre-PR behavior.

As for the appearance of the logos I can only hypothesize... I assume it's related to CLIP being trained on a large number of logos. So it "knows" them and their meaning quite well. If we now optimize for the averaged meaning of each random cutout to equal the meaning of our text prompt, CLIP might put specific logos it knows well in some location to push the averaged meaning in a specific direction.

russelldc commented 3 years ago

Try out the hidden_size parameter if you haven't already - I really like the faster convergence and more colorful results if it is increased to 512.

I had a chance to try it out late Friday night, and was getting interesting results. I agree, it quickly reaches a colorful image, rather than being stuck in a blurry brown/gray zone for a while.

I was doing some random tests trying to raise the hidden_size super high, while balancing the other params so my 24GB of VRAM hits closest to 100% as possible. I found I was able to use a hidden_size of 4000+ for 128px images, with 32 layers and a batch size of 16. I've been too busy since then to be able to put together any sort of analysis. I couldn't tell whether it became more colorful with further increases beyond 512, but I got this feeling that the resolution of the image being painted was quite high/detailed, before "getting shrunk down" into the 128px canvas. Might just be late-night delusional thinking on my part.

Just ran this a few minutes ago with those same settings, averaging_weight=0.3 and DiffGrad:

imagine --text="an overhead drone photo of the Suez canal blocked by a container ship" --num_layers=32 --batch_size=16 --learning_rate=0.0000008 --gradient_accumulate_every=1 --iterations=1050 --epochs=1 --save_every=1 --save_date_time --open_folder=False --image_width=128 --hidden_size=4000 --optimizer="DiffGrad" --averaging_weight=0.3
averaging_weightImageVideo
0
0.3
0.8

On my local copy, I've added optimizer and model_name to the cli. I'll try to suggest those changes directly here in Github.

russelldc commented 3 years ago

I took your lead with opening up the options for more optimizers, and I've been mostly blindly trying some others, like Ranger and AdaBelief. Those 2 were causing some interesting color fluctuations frame to frame... I'm not sure if this would be considered diverging, since the content was remaining mostly stable while the colors rotated.

Here's an example. It was several different generations of the same prompt (should be easy to guess!). If my bash history is correct, these were all using AdaBelief with varying center_focus and learning_rate:

https://user-images.githubusercontent.com/5100126/113137409-59798900-91d9-11eb-9456-93aaa678cf41.mp4

I've been meaning to ask: what's the effect of >1 epochs vs just more iterations?

NotNANtoN commented 3 years ago

Hi, I added the optimizer, the model name and a new save_video to the PR.

Interesitng ship generations! Do you mind trying out averaging_weight=0.3? I feel like that might be a good trade-off.

The optimizer definitely looks interesting - quite a different optimization behavior, DiffGrad, AdamP and Adam do not differ too wildly between each other. It looks like the learning rate is too high in some parts of the video.

russelldc commented 3 years ago

Hi, I added the optimizer, the model name and a new save_video to the PR.

Thanks!

Interesitng ship generations! Do you mind trying out averaging_weight=0.3? I feel like that might be a good trade-off.

Yeah, that was the first one I had in the original comment. I updated that comment again to organize them into a table per averaging_weight.

It looks like the learning rate is too high in some parts of the video.

You're probably right

NotNANtoN commented 3 years ago

I also just finished some experiments showing the promise of hidden_size=512 as well as the problem of a too large averaging_weight. I used the wizard and consciousness prompts from above, next to "The sun setting spectacolously over the beautiful ocean", "A painting of a sunset" and "A painting of a sunrise". The first two rows are with a averaging_weight=0 (or avg_feats=False), the last two have a averaging_weight=1. The first and third row have a hidden_size=256, the second and fourth hidden_size=512.

Here are the results for the first epoch: wiz_conc_spec_sun_sunset_sunrise_avg_feats_256_512_hidden_ep1

And here for the ninth epoch:

wiz_conc_spec_sun_sunset_sunrise_avg_feats_256_512_hidden_ep9

So you can see clearly how hidden_size=512 leads to much quicker convergence (epoch 1 looks already promising) and averaging_weight=1 takes longer to optimize but produces sharper, more parcellated images. So for a large averaging_weight, more epochs make sense.

I think it's quite funny that in the very lower right of the ninth episode the network generates the casino machine named "sunrise" for "A painting of a sunrise" in the lower right corner. Kind of accurate, but it shows how global coherence gets lost.

russelldc commented 3 years ago

On an RTX 3090, I'm able to reach as high as hidden_size=600 for 512px images before running out of VRAM. Mildly interesting fact: the generations are exactly 2x faster compared to using hidden_size=4000 for 128px images.

This is with these same other params as I had reported using previously: num_layers=32 and batch_size=16.

I think it's quite funny that in the very lower right of the ninth episode the network generates the casino machine named "sunrise" for "A painting of a sunrise" in the lower right corner.

Ha, yeah, I love those moments. Sometimes the associations being made are so "creative", or at least something the average human wouldn't think up.

NotNANtoN commented 3 years ago

Unless there are any other concerns, I think this PR can be merged @lucidrains.

I might work on trying to enforce more detail while keeping global coherence, but for now the averaging_weight seems like an acceptable solution to me.

lucidrains commented 3 years ago

@NotNANtoN thank you as always for your work :D