Stable Diffusion PR optimizes VRAM, generate 576x1280 images with 6 GB VRAM

mgcrea commented 2 years ago

Seen on HN, might be interesting to pull in this repo? (PR looks a bit dirty with a lot of extra changes though).

https://github.com/basujindal/stable-diffusion/pull/103

neonsecret commented 2 years ago

check out my other compvis PR https://github.com/CompVis/stable-diffusion/pull/177 might be more suitable for you

lstein commented 2 years ago

Thanks for the tip! I'll check them both out.

sunija-dev commented 2 years ago

Would be cool to get this implemented! ❤️ I got two users with only 4 GB VRAM and the model won't even load. If I saw it correctly, that should work with basujindal's version.

lstein commented 2 years ago

Don't I know it!

Vargol commented 2 years ago

By the looks of it without all the white space changes we get...

diff ldm/modules/attention.py ldm/modules/attention.py.opt
181a182
>         del context, x
187a189
>         del q, k
193a196
>             del mask
196c199,200
<         attn = sim.softmax(dim=-1)
---
>         sim[4:] = sim[4:].softmax(dim=-1)
>         sim[:4] = sim[:4].softmax(dim=-1)
198,200c202,204
<         out = einsum('b i j, b j d -> b i d', attn, v)
<         out = rearrange(out, '(b h) n d -> b n (h d)', h=h)
<         return self.to_out(out)
---
>         sim = einsum('b i j, b j d -> b i d', sim, v)
>         sim = rearrange(sim, '(b h) n d -> b n (h d)', h=h)
>         return self.to_out(sim)

attached in in diff -u format for patching

attn.patch.txt

lstein commented 2 years ago

Oh thank you very much for that! I actually just did the same thing with @neonsecret 's attention optimization and it works amazingly. Without any change to execution speed my test prompt now uses 3.60G of VRAM. Previously it was using 4.42G.

Now I'm looking to see if image quality is affected.

Any reason to prefer basunjal's attention.py optimization?

smoke2007 commented 2 years ago

nothing to add to this conversation, except to say that i'm excited for this lol ;)

lstein commented 2 years ago

I've merged @neonsecret 's optimizations into the development branch "refactoring-simplet2i" and would welcome people testing it and sending feedback. This branch probably still has major bugs in it, but I refactored the code to make it much easier to add optimizations and new features (particularly inpainting, which I'd hoped to have done by today).

Vargol commented 2 years ago

Hmmm....on my barely coping 8G M1 its not so hot, the image is different and it took twice as long, but its an old clone, let me try in on a fresher one

lstein commented 2 years ago

Darn. I'd hoped that there was such a thing as a free lunch.

I'm on an atypical system with 32G of VRAM, so maybe my results aren't representative. I did timing and peak VRAM usage, and then looked at two images generated with the same seed and they were indistinguishable to the eye. Let me know what you find out.

Are you on an Apple? I didn't know there were clones. The M1 MPS support in this fork is really new, and I wouldn't be surprised it needs additional tweaking to get it to work properly with the optimization.

Vargol commented 2 years ago

Sorry meant its an old local clone of your repo as I didn't want to make changes in my local clone of the current one as hot works quite nicely :-) , but yes I'm not surprised mps is breaking things and pytorch is pretty buggy too, I raised a few mps related issues over there that stable diffusion hits.

neonsecret commented 2 years ago

hey guys see https://www.reddit.com/r/StableDiffusion/comments/x56e8x/the_optimized_stable_diffusion_repo_got_a_pr_that/in032db

Vargol commented 2 years ago

okay, on the main branch the images are the same, but it is really slow, even compared to my normal times...

10/10 [06:49<00:00, 40.94s/it] compared 10/10 [02:14<00:00, 13.44s/it] (spot the man with the 8Gb M1)

I'll do some more digging

@magnusviri any chance you can check this out on a bigger M1 ?

lstein commented 2 years ago

Ok, this is @neonsecret 's PR, which I just tested and merged into the refactor branch. I'm seeing a 20% reduction in memory footprint, but unfortunately not the 35% reduction reported in the Reddit post. Presumably this is due to the earlier optimizations in basunjal's branch. I haven't really wanted to use those opts because the code is complex and I hear it has a performance hit. Advice?

Vargol commented 2 years ago

Last I looks at basunjal's there where loads of assumption about using cuda, a big chunk of the memory saving seemed to come from forcing half. Was a week ago things might have changed

lstein commented 2 years ago

I've got half precision on already as the default. I think what I'm missing is basunjal's optimization of splitting the task into several chunks and loading them into the GPU sequentially.

lstein commented 2 years ago

Frankly, I'm happy with the 20% savings for now.

Vargol commented 2 years ago

Seems the speed loss is coming from the twin calls to softmax.

<         attn = sim.softmax(dim=-1)
---
>         sim[4:] = sim[4:].softmax(dim=-1)
>         sim[:4] = sim[:4].softmax(dim=-1)

If I change it to use sim = sim.softmax(dim=-1)

instead I get all my speed back (I Assume more memory usage though I need better diagnostic stuff than Activity Monitor).

I can do 640x512 images now so there do appear to be some memory saving even reverting that change, be interesting to see what happens on a larger box and if is it worth wrapping them into one variation in a "if mps" statement

EDIT:

seems I can also now do 384 x320 without it using swap.

lstein commented 2 years ago

No measurable slowdown at all on CUDA. Maybe we make the twin softmax conditional on not MPS?

veprogames commented 2 years ago

test run:

751283a2de81bee4bb571fbabe4adb19f1d85b97 (main) EDIT: corrected hash "test" -s50 -W512 -H512 -C7.5 -Ak_lms -S42 00:15<00:00, 3.30it/s Max VRAM used for this generation: 4.44G

89a762203449a1efc1d34632fe76bf942669031d (refactoring-simplet2i) "test" -s50 -W512 -H512 -C7.5 -Ak_lms -S42 00:13<00:00, 3.77it/s Max VRAM used for this generation: 3.61G

XOR of both images gave a pitch black result -> seems to be no difference

maybe this helps. Inference even seems to be a little faster in this test run.

lstein commented 2 years ago

CUDA platform? What hardware?

test run:

4406fd1 (main) "test" -s50 -W512 -H512 -C7.5 -Ak_lms -S42 00:15<00:00, 3.30it/s Max VRAM used for this generation: 4.44G

89a7622 (refactoring-simplet2i) "test" -s50 -W512 -H512 -C7.5 -Ak_lms -S42 00:13<00:00, 3.77it/s Max VRAM used for this generation: 3.61G

XOR of both images gave a pitch black result -> seems to be no difference

maybe this helps. Inference even seems to be a little faster in this test run.

veprogames commented 2 years ago

Windows 10, NVIDIA GeForce RTX 2060 SUPER 8GB, CUDA

if there's info missing I'll edit it in

blessedcoolant commented 2 years ago

I made a new local PR with just changes to the attention.py. There are definitely memory improvements but nothing as drastic as what the PR claims.

Here are some of my test results after extensive testing - RTX 3080 8GB Card.

Base Repo:

Max Possible Resolution: 512x768
Max VRAM Usage: 7.11GB

Updated attention.py

Max Possible Resolution: 576x768
Max VRAM Usage: 6.91GB

For a 512x768 image, the updated repo consumes 5.94GB of memory. That's approximately an 18% memory boost.

I saw no difference in performance or inference time when using a single or twin softmax. On CUDA, the difference seems to be negligible if there is any.

tl;dr -- Just the attention.py changes are giving an approximate 18% VRAM saving.

Vargol commented 2 years ago

I think its a side affect of the unified architecture, it looks okay at 256x256 when all of the python 'image' fits in memory, but as soon as swapping kicks in I get half speed compared to the original code or single softmax

bmaltais commented 2 years ago

I just tried regenerating an image done from the current development branch and the new refactor branch and they are totally different... Not sure if it is the memory saving feature doing it or something else. Is there a switch to activate/deactivate the memory saving?

Would be nice to isolate if the diff is related to that or something else. Running on a GTX 3060.

EDIT:

OK... for some reason the picture I got when running the command the 1st time is different from the result when running the prompt logged in the prompt log file... Strange... but using the log file prompt on both the dev and the refactoring does indeed produce the same result with much less VRAM usage...

I will try to reproduce the variation... This might be an issue with the variation code base not producing consistent result on 1st run vs reruns from logs.

EDIT 2: I tracked the issue with the different outputs... it was a PEBKAC... I pasted the file name and directory info in front of the prompt in the log... this is why it resulted in a different output... so all good, it was my error.

So as far as I can see the memory optimisation has no side effect on time not quality to generate an image.

6.4G on the dev branch vs 4.84G on the refactoring branch... so a 34% memory usage reduction and exactly the same run time.

thelemuet commented 2 years ago

I did not do extensive testing to compare generation times but so far I have gotten the same exact results visually when comparing to images I've generated yesterday on main with same prompt/seeds.

And I can cramp up resolution from max 576x576 up to 640x704, using 6.27G on my RTX 2070.

Last time I have tried basunjal's, I could manage 704x768 but it was very slow. However if they implemented this PR + their original optimization, if it uses up even less memory than it used to, I can image doing even higher resolution. Very impressive.

cvar66 commented 2 years ago

The basujindal fork is very slow. I would take a 20% memory improvement with no speed hit over 35% that is much slower any day.

bmaltais commented 2 years ago

The basujindal fork is very slow. I would take a 20% memory improvement with no speed hit over 35% that is much slower any day.

On my 3060 with 12GB VRAM I am seeing a 34% memory improvement... so this is pretty great. hat is when generating 512x704 images.

Ratinod commented 2 years ago

~~768x2048 or 1216x1216 on 8 gb vram~~ (neonsecret/stable-diffusion). ~~Incredible.~~ 1024x1024 on 8 gb vram (and maybe even more)

My mistake... believed what was written before checking...

upd. It works!

blessedcoolant commented 2 years ago

768x2048 or 1216x1216 on 8 gb vram (neonsecret/stable-diffusion). Incredible.

How are you managing those sizes .. ? I've tested it out on an 8GB card. Just a minor step increase from 512x768 to 576x768.

i3oc9i commented 2 years ago

I checkout the branch refactoring-simplet2i on my Mac M1 Ultra with 128G Note 1: there is no VRAM on Mac M1, CPU and GPU share the same memory but I'm able to generate max -W640 -H768 Note 2: Max VRAM used for this generation: always reports 0.00G, also when images are correctly generated

1/ test -W1024 -H640

dream> a dog on the moon -W1024 -H640
>> This input is larger than your defaults. If you run out of memory, please use a smaller image.
100% 50/50 [02:05<00:00,  2.50s/it]
Generating: 100%| 1/1 [02:06<00:00, 126.06s/it]
>> Usage stats:
>>   1 image(s) generated in 126.26s
>>   Max VRAM used for this generation: 0.00G
Outputs:
outputs/img-samples/000017.951194398.png: "a dog on the moon" -s50 -W1024 -H640 -C7.5 -Ak_lms -F -S951194398

I get 000017 951194398

2/ test -W1024 -H1024

dream> a dog on the moon -W1024 -H1024
>> This input is larger than your defaults. If you run out of memory, please use a smaller image.
Generating:   0%|                                                                                                                                                                                                                                                                                                                          | 0/1 [00:00<?, ?it/s/AppleInternal/Library/BuildRoots/20d6c351-ee94-11ec-bcaf-7247572f23b4/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:705: failed assertion `[MPSNDArray initWithDevice:descriptor:] Error: product of dimension sizes > 2**31'                                                                  | 0/50 [00:00<?, ?it/s]
zsh: abort      python scripts/dream.py --full_precision
/Users/ivano/.miniconda/envs/dream-dev/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

conda list | grep torch
pytorch                   1.12.1                 py3.10_0    pytorch
pytorch-lightning         1.6.5              pyhd8ed1ab_0    conda-forge
torch-fidelity            0.3.0                    pypi_0    pypi
torchdiffeq               0.2.3                    pypi_0    pypi
torchmetrics              0.9.3              pyhd8ed1ab_0    conda-forge
torchvision               0.13.1                py310_cpu    pytorch

Ratinod commented 2 years ago

How are you managing those sizes .. ? I've tested it out on an 8GB card. Just a minor step increase from 512x768 to 576x768.

Yes, I should have checked everything first. I managed to generate 768x768 (7.8 / 8.0 VRAM) on my video card. python optimizedSD/optimized_txt2img.py --prompt "Cyberpunk style image of a Telsa car reflection in rain" --n_iter 1 --n_samples 1 --H 768 --W 768

Vargol commented 2 years ago

@i3oc9i I've had noise when I've been on the limits of memory, see if 1024 x 576 renders and 1024x704 fails to render

i3oc9i commented 2 years ago

@Vargol

1/ "a dog on the moon" -s50 -W1024 -H576 -C7.5 -Ak_lms -F -S3689890620 50/50 [01:45<00:00, 2.12s/it] 000018 3689890620

2/"a dog on the moon" -s50 -W1024 -H704 -C7.5 -Ak_lms -F -S1744562916 50/50 [02:44<00:00, 3.30s/it] Does not fail, bu I get noise.

Do you know the reason ?

Mac Studio M1 Ultra 128GRam

conda list | grep torch
pytorch                   1.12.1                 py3.10_0    pytorch
pytorch-lightning         1.6.5              pyhd8ed1ab_0    conda-forge
torch-fidelity            0.3.0                    pypi_0    pypi
torchdiffeq               0.2.3                    pypi_0    pypi
torchmetrics              0.9.3              pyhd8ed1ab_0    conda-forge
torchvision               0.13.1                py310_cpu    pytorch

JohnAlcatraz commented 2 years ago

@lstein There is an even more effective version of this optimization now that should ideally also be merged into this repo. The one this topic originally was about, which you already merged, allowed to generate up to 0.426 Megapixels on 8 GB VRAM. With the new improved version of the optimization, it goes up to 1.14 Megapixels. So a monumental improvement.

See discussion here: https://github.com/basujindal/stable-diffusion/pull/117

Thanks for the heads up. I've ported this into the refactor branch and pushed it. I tested on some large dimension images, and indeed it seems to work as advertised. We're going to have to do some work to make sure it doesn't break on some architectures but at least on my CUDA system it's a noticeable improvement in VRAM usage and has no effect on performance speed.

Ratinod commented 2 years ago

How are you managing those sizes .. ? I've tested it out on an 8GB card. Just a minor step increase from 512x768 to 576x768.

@blessedcoolant it really works! It is very important to change file "attention.py" 3.0 /8.0 VRAM 832x768 (50 samples: 2:55) 7.8 / 8.0 VRAM 1216x1216 (50 samples: crash afret 50 samples (Tried to allocate 1.41 GiB error) ???) 4.9 / 8.0 VRAM 1024x1024 (50 samples: 4:36) (in theory, the maximum for this model has been reached (2x the size of the dataset on which training was performed)) python optimizedSD/optimized_txt2img.py --prompt "Cyberpunk style image of a Telsa car reflection in rain" --n_iter 1 --n_samples 1 --H 1024 --W 1024 seed_689375_00002

blessedcoolant commented 2 years ago

@lstein There is an even more effective version of this optimization now that should ideally also be merged into this repo. The one this topic originally was about, which you already merged, allowed to generate up to 0.426 Megapixels on 8 GB VRAM. With the new improved version of the optimization, it goes up to 1.14 Megapixels. So a monumental improvement.

See discussion here: basujindal#117

The best way to implement it is likely similar to the way done in this PR, behind an extra flag since it might possibly have some performance impact on some architectures it seems: AUTOMATIC1111/stable-diffusion-webui@5bb126b

I have tested applying that optimization to this repo, and it works very well. The speed becomes very slightly, 4.6%, slower for me, but on 8 GB VRAM I can now generate images up to 512x2176 instead of just 512x832 with the optimization here that you already merged. I also checked that output from the same seed indeed stays identical. Very impressive optimization!

I applied these changes. I went from being able to generate 512x768 max resolution to 768x1024 max resolution. It works surprisingly well. The inference speeds do not get affected at lower resolutions. But as the size of the output goes out, so do the inference speeds --- but they're still quite fast.

Gonna do more testing.

EDIT: Ok. Did some more optimization and now I'm able to generate a max res of 1088x1344 --- that is nearly 175% larger than what I was capable of doing earlier .. HOLY CRAP!!

There seems to be no inference speed impacts at smaller resolutions but larger resolutions (obviously take longer).

I generated this 1088x1344 image on an 8GB card in 3 Mins 6 seconds at 50 Samples -- Max VRAM used is 5.72G .. Even though I have another 2GB free, it does not let render larger images. Let me see if I can fix that.

Edit Spoke too soon .. The sampling is done at that res... but it crashes when trying to generate the final image. Let me see if I can fix that.

Edit 2: So far I managed to get 960x1280 working properly by experimenting with various steps an increment sizes. That is up from 512x768. Will do some more testing through the night and see if I can find ways to optimize this further.

magnusviri commented 2 years ago

I've read that the optimizations trade VRAM usage for speed. They're directly linked. Less VRAM, longer times. I think these additions are great but they need to be a flag. Unless there's a way to detect how much VRAM there is and switch when there isn't enough.

blessedcoolant commented 2 years ago

Here are some findings from my own testing.

The inference time increases as you increase resolution as expected. So no surprise there.

In the optimize code, you can control the steps at which you want to process..

Higher the step count, the faster the inference but you will give up some resolution for this.

With that in mind, here are some test results on an 8GB system. Prior to these changes, I could render a maximum of 512x768.

Step 8: Max res: 640 x 896 .. Inference time is quick enough
Step 4: Max res: 768x1024 .. Inference is slower but not a huge drop from 8.
Step 2: Max res: 896 x 1152 -- Inference is really slow compared to the above two. Don't know if its worth it.
Step 1: Max res: 1024 x 1280 -- Takes forever.

Ideal setting seems to be a step size of 4. It gives nearly 175% larger output at a relatively decent inference speed. But this requires a lot of testing to figure what's actually worth the change.

JohnAlcatraz commented 2 years ago

@blessedcoolant are you talking about some other changes you did on top of the changes I linked? the changes I linked I did benchmark, and the difference in speed between 8 steps and 2 steps is 3.3% for me. I posted the benchmark results in the discussion I linked above, but I can also post them here again:

Default: 5.0 it/s | 0.39 Megapixels Max Res loop steps of 8: 4.94 it/s | Didn't test Max Res loop steps of 4: 4.87 it/s | 0.79 Megapixels Max Res loop steps of 2: 4.78 it/s | 1.14 Megapixels Max Res

So that optimization basically always makes sense for me with 2 steps, it's such a minimal slow down. For me there is no reason to ever go for 4 or 8 steps. A flag to switch between the official code, and the new code with a loop step of 2 is all that's needed.

blessedcoolant commented 2 years ago

So that optimization basically always makes sense for me with 2 steps, it's such a minimal slow down. There is no reason to ever go for 4 or 8 steps.

The difference between 2 and 8 is huge for me. Didn't log the numbers but felt like it took way longer. Let me test again.

JohnAlcatraz commented 2 years ago

I have done my benchmarks at 512x512 on a RTX 2070 Super. Always takes ~10 seconds to generate an image, default settings.

blessedcoolant commented 2 years ago

I have done my benchmarks at 512x512 on a RTX 2070 Super. Always takes ~10 seconds to generate an image, default settings.

I've been testing the max resolutions these can support. I'm guessing it'll fast at 512 but the speed hit kicks it in at highe resolutions.

JohnAlcatraz commented 2 years ago

It makes no sense to compare the speed at different resolutions. You need to use the same resolution for benchmarking the speed. A higher resolution is always way slower, even without any code changes.

blessedcoolant commented 2 years ago

It makes no sense to compare the speed at different resolutions. You need to use the same resolution for benchmarking the speed. A higher resolution is always slower.

I understand that. But I am more interested in checking out the inference speeds at higher resolutions. Because the whole point of this change is to support higher resolutions on set memory. And the trade off is inference time.

I'm not particularly looking to log it at the same res. Which I'm sure would have minimal difference. The idea is to compare them at their best and see which one delivers the best trade off between memory, resolution and inference time.

If want to render at 512x512, I can do it right now. Why even bother with these changes right.. ?

JohnAlcatraz commented 2 years ago

I'm not particularly looking to log it at the same res. Which I'm sure would have minimal difference. The idea is to compare them at their best and see which one delivers the best trade off between memory, resolution and inference time.

That makes no sense regarding this discussion. Of course you can look at what resolution in SD gives you the "best trade off between memory, resolution and inference time", but that is completely independent of any optimizations. An optimization is about allowing you to go higher if you want to, but it does not force you to go higher. The only question is if an optimization has any downsides. It makes 0 sense to say that the "downside" of an optimization that allows you to generate an image at 1024x1024, which was impossible without the optimization, is that it's slower than generating an image at 512x512. That is not a downside of the optimization, that is a downside of a high resolution, that the optimization only allowed you to see.

The only relevant question regarding downsides of the optimization is if the same resolution that was possible to generate without the optimization is now slower to generate with the optimization enabled. That is what you need to benchmark, and what I benchmarked. That is what defines if an optimization should be enabled for everyone by default, or be optional behind some flag.

blessedcoolant commented 2 years ago

I'm not particularly looking to log it at the same res. Which I'm sure would have minimal difference. The idea is to compare them at their best and see which one delivers the best trade off between memory, resolution and inference time.

That makes no sense regarding this discussion. Of course you can look at what resolution in SD gives you the "best trade off between memory, resolution and inference time", but that is completely independent of any optimizations. An optimization is about allowing to to go higher if you want to, but it does not force you to to higher. The only question is if an optimization has any downsides. It makes 0 sense to say that the "downside" of an optimization that allows you to generate an image at 1024x1024, which was impossible without the optimization, is that it's slower than generating an image at 512x512. That is not a downside of the optimization, that is a downside of a high resolution, that the optimization only allowed you to see.

The only relevant question regarding downsides of the optimization is if the same resolution that was possible to generate without the optimization is now slower to generate with the optimization enabled. That is what you need to benchmark, and what I benchmarked. That is what defines if an optimization should be enabled for everyone by default, or be optional behind some flag.

Nothing to argue with there. Like I said, I wasn't looking to benchmark per se. Was just experimenting with the max resolution it can support and see what the inferences times were at those levels.

lstein commented 2 years ago

I checkout the branch refactoring-simplet2i on my Mac M1 Ultra with 128G Note 1: there is no VRAM on Mac M1, CPU and GPU share the same memory but I'm able to generate max -W640 -H768 Note 2: Max VRAM used for this generation: always reports 0.00G, also when images are correctly generated

1/ test -W1024 -H640

dream> a dog on the moon -W1024 -H640
>> This input is larger than your defaults. If you run out of memory, please use a smaller image.
100% 50/50 [02:05<00:00,  2.50s/it]
Generating: 100%| 1/1 [02:06<00:00, 126.06s/it]
>> Usage stats:
>>   1 image(s) generated in 126.26s
>>   Max VRAM used for this generation: 0.00G
Outputs:
outputs/img-samples/000017.951194398.png: "a dog on the moon" -s50 -W1024 -H640 -C7.5 -Ak_lms -F -S951194398

I get 000017 951194398

2/ test -W1024 -H1024

dream> a dog on the moon -W1024 -H1024
>> This input is larger than your defaults. If you run out of memory, please use a smaller image.
Generating:   0%|                                                                                                                                                                                                                                                                                                                          | 0/1 [00:00<?, ?it/s/AppleInternal/Library/BuildRoots/20d6c351-ee94-11ec-bcaf-7247572f23b4/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:705: failed assertion `[MPSNDArray initWithDevice:descriptor:] Error: product of dimension sizes > 2**31'                                                                  | 0/50 [00:00<?, ?it/s]
zsh: abort      python scripts/dream.py --full_precision
/Users/ivano/.miniconda/envs/dream-dev/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

conda list | grep torch
pytorch                   1.12.1                 py3.10_0    pytorch
pytorch-lightning         1.6.5              pyhd8ed1ab_0    conda-forge
torch-fidelity            0.3.0                    pypi_0    pypi
torchdiffeq               0.2.3                    pypi_0    pypi
torchmetrics              0.9.3              pyhd8ed1ab_0    conda-forge
torchvision               0.13.1                py310_cpu    pytorch

A bit of instability is too be expected with a big change like this. In addition to the memory optimization, there have been oodles of internal code changes in the refactor branch. Take heart. Things will settle down in a few days.

lstein commented 2 years ago

There seems to be no inference speed impacts at smaller resolutions but larger resolutions (obviously take longer).

I have enough VRAM to generate large images on the unoptimized versions and I'll do some benchmarking of performance and VRAM tomorrow and post the results. Too much excitement for the day; I'm knocking off.

i3oc9i commented 2 years ago

@lstein

Note that I run on a Mac M1 Ultra with 128G on unified RAM, but there is a bug in PyTorch (1.12 and nightly) when requesting -W1024 -H1024

/Users/ivano/Code/Ai/dream-dev/ldm/modules/embedding_manager.py:152: UserWarning: The operator 'aten::nonzero' is not currently supported on the MPS backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at  /Users/runner/work/_temp/anaconda/conda-bld/pytorch_1659484612588/work/aten/src/ATen/mps/MPSFallback.mm:11.)
  placeholder_idx = torch.where(
  AppleInternal/Library/BuildRoots/20d6c351-ee94-11ec-bcaf-7247572f23b4/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:705: failed assertion `[MPSNDArray initWithDevice:descriptor:] Error: product of dimension sizes > 2**31'

  zsh: abort      python scripts/dream.py --full_precision --outdir ../@Stuffs/images/samples
/Users/ivano/.miniconda/envs/dream-dev/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d

The bug has been reported to pythorc team

https://github.com/pytorch/pytorch/issues/84039

invoke-ai / InvokeAI

Stable Diffusion PR optimizes VRAM, generate 576x1280 images with 6 GB VRAM #364