MPS support for doggettx-optimizations

Any-Winter-4079 commented 2 years ago

Okay, so I've seen @lstein has added x = x.contiguous() if x.device.type == 'mps' else x to ldm/modules/attention.py in the doggettx-optimizations branch but there's another error happening how KeyError: 'active_bytes.all.current' and this has to do with this function in attention.py

def forward(self, x, context=None, mask=None):
        h = self.heads

        q_in = self.to_q(x)
        context = default(context, x)
        k_in = self.to_k(context)
        v_in = self.to_v(context)
        del context, x

        q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> (b h) n d', h=h), (q_in, k_in, v_in))
        del q_in, k_in, v_in

        r1 = torch.zeros(q.shape[0], q.shape[1], v.shape[2], device=q.device)

        stats = torch.cuda.memory_stats(q.device)
        mem_active = stats['active_bytes.all.current']
        mem_reserved = stats['reserved_bytes.all.current']
        mem_free_cuda, _ = torch.cuda.mem_get_info(torch.cuda.current_device())
        mem_free_torch = mem_reserved - mem_active
        mem_free_total = mem_free_cuda + mem_free_torch

        gb = 1024 ** 3
        tensor_size = q.shape[0] * q.shape[1] * k.shape[1] * 4
        mem_required = tensor_size * 2.5
        steps = 1

        if mem_required > mem_free_total:
            steps = 2**(math.ceil(math.log(mem_required / mem_free_total, 2)))
            # print(f"Expected tensor size:{tensor_size/gb:0.1f}GB, cuda free:{mem_free_cuda/gb:0.1f}GB "
            #       f"torch free:{mem_free_torch/gb:0.1f} total:{mem_free_total/gb:0.1f} steps:{steps}")

        if steps > 64:
            max_res = math.floor(math.sqrt(math.sqrt(mem_free_total / 2.5)) / 8) * 64
            raise RuntimeError(f'Not enough memory, use lower resolution (max approx. {max_res}x{max_res}). '
                               f'Need: {mem_required/64/gb:0.1f}GB free, Have:{mem_free_total/gb:0.1f}GB free')

        slice_size = q.shape[1] // steps if (q.shape[1] % steps) == 0 else q.shape[1]
        for i in range(0, q.shape[1], slice_size):
            end = i + slice_size
            s1 = einsum('b i d, b j d -> b i j', q[:, i:end], k) * self.scale

            s2 = s1.softmax(dim=-1)
            del s1

            r1[:, i:end] = einsum('b i j, b j d -> b i d', s2, v)
            del s2

        del q, k, v

        r2 = rearrange(r1, '(b h) n d -> b n (h d)', h=h)
        del r1

        return self.to_out(r2)

Which is basically the code that detects your free memory, and then splits the softmax operation in steps, to allow to generate larger images.

Now, because we are on Mac, I'm not sure @lstein can help us much (unless he has one around), but I open this issue for anyone that wants to collaborate in porting this functionality to M1

lstein commented 2 years ago

If someone knows how to get free VRAM memory on MPS devices, we just need to replace the torch.cuda calls.

lstein commented 2 years ago

I Googled around, and there doesn't seem to be an equivalent set of memory interrogation calls for CPU.

I'm not sure how the M1 works, but if it is sharing main memory (i.e. RAM) you might be able to get the needed metrics using psutil

Vargol commented 2 years ago

I just hacked it all out in my fork and set slice_size to 1 :-) that gets me doing 1024x1024 (very slowly) on a 8Gb M1 mini. Be interesting to see the results of that on a larger GPU.

Any-Winter-4079 commented 2 years ago

I just hacked it all out in my fork and set slice_size to 1 :-) that gets me doing 1024x1024 (very slowly) on a 8Gb M1 mini. Be interesting to see the results of that on a larger GPU.

It definitely works. I'll add the results below. In this v1, I've changed ldm/modules/diffusionmodules/model.py and ldm/modules/attention.py from Doggettx-optimization branch. attention.py.zip model.py.zip

In model.py I've commented out

# stats = torch.cuda.memory_stats(q.device)
# mem_active = stats['active_bytes.all.current']
# mem_reserved = stats['reserved_bytes.all.current']
# mem_free_cuda, _ = torch.cuda.mem_get_info(torch.cuda.current_device())
# mem_free_torch = mem_reserved - mem_active
# mem_free_total = mem_free_cuda + mem_free_torch

and left steps at 1

tensor_size = q.shape[0] * q.shape[1] * k.shape[2] * 4
mem_required = tensor_size * 2.5
steps = 1

And commented out

# if mem_required > mem_free_total:
#     steps = 2**(math.ceil(math.log(mem_required / mem_free_total, 2)))

so there's probably an improvement to be made using psutil here.

Where I did use psutil is in attention.py Again, convention this out

# stats = torch.cuda.memory_stats(q.device)
# mem_active = stats['active_bytes.all.current']
# mem_reserved = stats['reserved_bytes.all.current']
# mem_free_cuda, _ = torch.cuda.mem_get_info(torch.cuda.current_device())
# mem_free_torch = mem_reserved - mem_active
# mem_free_total = mem_free_cuda + mem_free_torch

but importing psutil import psutil and using mem_free_total = psutil.virtual_memory().available So we can use it to calculate steps (the same way they do), instead of leaving it at steps=1

gb = 1024 ** 3
tensor_size = q.shape[0] * q.shape[1] * k.shape[1] * 4
mem_required = tensor_size * 2.5
steps = 1

if mem_required > mem_free_total:
        steps = 2**(math.ceil(math.log(mem_required / mem_free_total, 2)))

Any-Winter-4079 commented 2 years ago

Definitely a step in the right direction

lstein commented 2 years ago

This is looking pretty encouraging. When you are satisfied with the performance on MPS, could you make your changes conditional on the device type so that CUDA systems will work as well? Then make a PR against the doggettx-optimizations branch.

Think this might be done by tonight? I'm planning a development freeze, some testing, and then pulling into main over the weekend.

Any-Winter-4079 commented 2 years ago

That's the plan, yes, to make the changes conditional based on device type. I'm not sure about the PR tonight (b/c I've never done a pull request and it's already 00:00h here, so I might be a bit tired to look into how to make one tonight -fork, pull...), but I'll leave my code here in #431 in 1-1h30' max.

Any-Winter-4079 commented 2 years ago

Okay, changes are done. I'm doing the testing.

Any-Winter-4079 commented 2 years ago

attention 3.py.zip model 2.py.zip

Any-Winter-4079 commented 2 years ago

@lstein Above are the files. ldm/modules/diffusionmodules/model.py and ldm/modules/attention.py

Performance seems comparable to CompViz vanilla with some M1 workarounds dbc8fc79008795875eb22ebf0c57927061af86bc (lstein fork) which is the best performance I've seen on M1.

Regarding memory, I have to do more digging because while this afternoon I could generate 896x896 and 1024x768 (results I couldn't generate before), now at night I'm back to memory errors.

In any case, this change should benefit CUDA users while allowing MPS devices to (apparently/presumably/hopefully) function at least as well as we currently do on the development branch

Any-Winter-4079 commented 2 years ago

In the end, still awake :) And I've found something very interesting for version 2 of these changes (post merge into development) https://pullanswer.com/questions/mps-mpsndarray-error-product-of-dimension-sizes-2-31

The common error I think all M1 users get of Error: product of dimension sizes > 2**31' is referenced here

which made me think about tinkering with values, and setting steps = 64 (max value), it generates a 1024x1024 on my M1! However, if we do it like this (currently the way in Doggettx-optimizations branch) steps = 2**(math.ceil(math.log(mem_required / mem_free_total, 2))) it fails

It takes a long time with steps=64, but testing around, it also works with steps=32, and even steps=4 (taking much less time).

Pretty nice, and calls for some testing tomorrow

PS: I'd just merge the 2 files above and leave this "finding" for a future PR.

heurihermilab commented 2 years ago

I can confirm that it's working and an improvement. Speedup was 2x over plain development branch, and now I'm testing larger image sizes...

Environment: Development branch with the two files above swapped in. Machine: MBP 14", M1 Pro, 16GB, latest OS, running miniforge with a base of Python 3.10.6. Browser: Firefox 104.0.2.

Vargol commented 2 years ago

which made me think about tinkering with values, and setting steps = 64 (max value), it generates a 1024x1024 on my M1! However, if we do it like this (currently the way in Doggettx-optimizations branch) steps = 2**(math.ceil(math.log(mem_required / mem_free_total, 2))) it fails

That's odd I've managed a very slow 1024x1024 from doggettx's optimaztions on my 8Gb M1 Here's the code I'm using, as I'm said before it was a cut of lstein's paint but with the Doggettx code added and hard coded to 1 step.

https://github.com/Vargol/stable-diffusion_m1_8gb

Have you got an lot of other stuff running at the same time eating up Memory ?

Any-Winter-4079 commented 2 years ago

This is my memory usage right after booting the computer. Only the model loaded + 1 VS tab with the code

So, I introduced 2 prints

print('mem_required', mem_required / 10**9)
print('mem_free_total', mem_free_total / 10**9)

and I get

mem_required 25.17630976
mem_free_total 51.890225152

The mem_free_total is calculated with mem_free_total = psutil.virtual_memory().available and it makes sense with the picture above from Activity Monitor. (64 - 13) GB = 51 GB However, it crashes with Error: product of dimension sizes > 2**31 even though in theory, it only needs 25GB. Which doesn't make sense, does it?

In this discussion https://pullanswer.com/questions/mps-mpsndarray-error-product-of-dimension-sizes-2-31 they were saying that the problem was with Metal and that depending on the size/number of dimensions of the operation (e.g. einsum), a different algorithm might get selected.

So maybe you give it a smaller array and it fails but feed it a bigger array and chooses a different algorithm that doesn't have the Error: product of dimension sizes > 2**31 bug and it works. That's my understanding.

Any-Winter-4079 commented 2 years ago

For example, setting the steps as you do (with fixed value instead of calculating it), and "banana sushi" -s50 -C7.5 -n3 -W896 -H896 here are my results:

With steps = 1

mem_required 25.17630976
mem_free_total 53.156200448

Result: Error: product of dimension sizes > 2**31

With steps = 2

mem_required 25.17630976
mem_free_total 52.520927232

It works.

Why, it's really not apparent to me.

The problem with hard-setting the steps, though, is that, as the code progresses, mem_free_total reduces (note: leak or expected behavior ?).

mem_required 25.17630976
mem_free_total 32.847593472

So, maybe there could be a point where it failed mid-execution, because we hard-set steps = 2 and it can't do it anymore.

The solution I'm thinking is a mix of both techniques. Setting the steps dynamically (so it doesn't run out of memory), but also setting steps = max (2, steps) -not letting it be steps = 1, where it throws the error- for images larger than 512x512 or something like that.

Vargol commented 2 years ago

The 2**31 seems to be einsum trying to use a Tensor with more than 2,147,483,648 values as part of its calculation it not a memory thing, just a bug or limitation in the einsum implementation somewhere.

I remember have a similar issue when I simply set steps to 1 but allowed the slice_size calculation to go ahead

slice_size = q.shape[1] // steps if (q.shape[1] % steps) == 0 else q.shape[1]

which wasn't in an older cut of the code..

my code doing this , is that what you tried ?

        steps=1

        for i in range(0, q.shape[0], steps):
            end = i + steps

And yes I appreciate that if people try even bigger images they may run out of memory but for me more steps just means slower renders and 1024x1024 is already 50 S / IT, as in n_sample=50, n_iter=1 takes 40 odd Minutes to generate an image .

lstein commented 2 years ago

attention 3.py.zip model 2.py.zip

Looks like I was monitoring the wrong thread! I'll fold in these changes this morning and freeze development for testing. Thanks so much for this.

Any-Winter-4079 commented 2 years ago

@Vargol I can take slice_size up to ~10k. More than that, it seems to fail.

Have you tried bigger slice_size? You are basically using 1 as slice_size, which seems very low (even for 8GB)

I tried something very similar to your code, simply with larger slice_size, in my case using this formula they have slice_size = q.shape[1] // steps if (q.shape[1] % steps) == 0 else q.shape[1] which depends on steps.

For example, for "banana sushi" -s5 -C7.5 -n3 -W896 -H896 then q.shape = torch.Size([16, 12544, 40]) so q.shape[1] is 12544 and given that my mem_free_total >= mem_required, it maintains steps = 1, so with the formula above, keeps slice_size = 12544 and fails.

We should be able to find a sweet spot, shouldn't we?

Any-Winter-4079 commented 2 years ago

For example, slice_size = 8192 (or below) allows me to run 896x896 slice_size = 6000 (or below) allows me to run 1024x1024

Hopefully there's some formula we can come up with for all M1 machines (8GB to 128GB)

Update: So for 1024x1024, doggettx-optimizations branch suggests I use slice_size = 8192 (which fails) But manually hard-setting slice_size = 8185 works. So their calculation is not far off, but not completely precise for M1

Vargol commented 2 years ago

So with my fixed value steps = slice_size, running dream> "banana sushi" -s1 -C7.5 -n1 -W832 -H768

1 - 6 steps work, 6 steps are over 5x slower than 1 step

step = slice_size = 1
 1/1 [00:20<00:00, 20.45s/it

step = slice_size = 6
1/1 [01:50<00:00, 110.50s/it]

7 - 10 steps blow memory while sampling,

step = slice_size = 7 - 10
    The Metal Performance Shaders operations encoded on it may not have completed.
    Error: 
    (null)
    Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
    <AGXG13GFamilyCommandBuffer: 0x16d1049d0>

steps >= 11 fail with a oversized? buffer before sampling shows up in dreams.py

step = slice_size = 11
RuntimeError: Invalid buffer size: 5.94 GB

Any-Winter-4079 commented 2 years ago

@Vargol Hmm I'll study your case too. To me, it happened the weirdest thing. Doggettx branch suggested 8192 slice size. Guess what? It failed. But 8191 works for 1024x1024

Any-Winter-4079 commented 2 years ago

Oh, this is interesting. So my computer can take slice_size = 8191 for 1024x1024. Found by trial and error.

Okay, so what slice_size could It take for 896x896? Well, I did (1024x1024 / 896x896) * 8191 = 10698.4489796, which rounding up is 10699. I tried that value and... it works!

But, I tried 10700 (one more) and it fails!

I'm sure there is a formula to be found (including RAM), but at least we seem to be able to hack the max slice_size for our own devices, which is awesome!

Update: So, I picked a random size. I wanted a 3200x1600 image. I used the formula and slice_size = 1677.5168. This time, I could not round up, but rounding down to 1677, it works again! (I only did it for 1 step but hey, it completed successfully) And 1678 fails.

Any-Winter-4079 commented 2 years ago

I'll study it a bit more, but the problem with Doggettx (besides 8192 vs 8191) is that sometimes it suggests an even larger slice_size (I guess when it computes steps = 1 instead of steps = 2 based on memory), and then it breaks. If it weren't for that, maybe we could have used Doggettx's slice_size - 1

Any-Winter-4079 commented 2 years ago

@i3oc9i can you try your max slice_size for say 1024x1024 in your Mac with 128GB? We might be able to work out a formula including the RAM. Or someone else with a Mac different than 64 GB (which I have)

i3oc9i commented 2 years ago

@Any-Winter-4079 sorry, this last two days I was busy at my work, I just checkout the development branch 75f633cda887d7bfcca3ef529d25c52461e11d99

and fail, may be I'm missing somethibg ? whre is the new code to test ?

dream> "a don on the moon" -s50 -W1024 -H1024 -C7.5 -Ak_lms
>> This input is larger than your defaults. If you run out of memory, please use a smaller image.
/Users/ivano/Code/Ai/dream-dev/ldm/modules/embedding_manager.py:152: UserWarning: The operator 'aten::nonzero' is not currently supported on the MPS backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at  /Users/runner/work/_temp/anaconda/conda-bld/pytorch_1659484612588/work/aten/src/ATen/mps/MPSFallback.mm:11.)
  placeholder_idx = torch.where(
Generating:   0%|                                                                                                                                                                                                                                                                                                                          | 0/1 [00:00<?, ?it/s/AppleInternal/Library/BuildRoots/20d6c351-ee94-11ec-bcaf-7247572f23b4/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:705: failed assertion `[MPSNDArray initWithDevice:descriptor:] Error: product of dimension sizes > 2**31'                                                                  | 0/50 [00:00<?, ?it/s]
zsh: abort      python scripts/dream.py --full_precision --outdir ../@Stuffs/images/samples
/Users/ivano/.miniconda/envs/dream-dev/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

dream> "test" -s50 -W832 -H832 -C7.5 -Ak_lms -S12345678 (run but I get noise)

ryudrigo commented 2 years ago

Could someone with a Mac please run these lines?

import torch
print (torch.cuda.get_device_name(0))

That's the best way I know to detect if it's a Mac GPU, but I couldn't find what to check it against. Thanks!

Any-Winter-4079 commented 2 years ago

@ryudrigo

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/eduardoarinopelegrin/opt/anaconda3/envs/do_not_touch-osx-arm64-stable-diffusion/lib/python3.9/site-packages/torch/cuda/__init__.py", line 329, in get_device_name
    return get_device_properties(device).name
  File "/Users/eduardoarinopelegrin/opt/anaconda3/envs/do_not_touch-osx-arm64-stable-diffusion/lib/python3.9/site-packages/torch/cuda/__init__.py", line 359, in get_device_properties
    _lazy_init()  # will define _get_device_properties
  File "/Users/eduardoarinopelegrin/opt/anaconda3/envs/do_not_touch-osx-arm64-stable-diffusion/lib/python3.9/site-packages/torch/cuda/__init__.py", line 211, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

Basically the command torch.cuda is going to fail because we don't have cuda. You can detect it like this: device_type = 'mps' if x.device.type == 'mps' else 'cuda' Here you have 2 files that check for Mac GPU https://github.com/lstein/stable-diffusion/issues/431#issuecomment-1241327646

Doggettx commented 2 years ago

I'll study it a bit more, but the problem with Doggettx (besides 8192 vs 8191) is that sometimes it suggests an even larger slice_size (I guess when it computes steps = 1 instead of steps = 2 based on memory), and then it breaks. If it weren't for that, maybe we could have used Doggettx's slice_size - 1

I wouldn't adjust the slice_size because then it starts running incomplete parts of the whole array. It's best to increase the multiplier, which is probably too low then. So this part:

mem_required = tensor_size * 2.5

Probably needs more than .5 extra, could try 2.6 or if you want to be safe just put it at 3, it'll just scale up the steps a bit earlier than needed. Which scales down the slice_size

On a side note, it doesn't really have to step up in powers of 2, I just found that that was faster on average. You could change this part:

    slice_size = q.shape[1] // steps if (q.shape[1] % steps) == 0 else q.shape[1]
    for i in range(0, q.shape[1], slice_size):
        end = i + slice_size

To something like

    slice_size = q.shape[1] // steps
    for i in range(0, q.shape[1], slice_size):
        end = min(q.shape[1], i + slice_size)

then it can run at any step or slice_size (even higher than 64, but you'll crash later then anyhow due to other parts running out of memory)

Any-Winter-4079 commented 2 years ago

@i3oc9i I've run 3 tests on M1 64 GB. Try these in your M1 128 GB.

Test 1 Development branch

git checkout development

"banana sushi" -s50 -C7.5 -n1 -W512 -H512 50/50 [00:28<00:00, 1.74it/s] "banana sushi" -s50 -C7.5 -n1 -W896 -H896 Error: product of dimension sizes > 2**31' Note: if this test runs on your M1, try a size that fails

Test 2 Doggettx-optimizations branch with M1 changes

git checkout doggettx-optimizations

Then change ldm/modules/diffusionmodules/model.py and ldm/modules/attention.py with these two files https://github.com/lstein/stable-diffusion/issues/431#issuecomment-1241327646 "banana sushi" -s50 -C7.5 -n1 -W512 -H512 50/50 [00:28<00:00, 1.74it/s] "banana sushi" -s50 -C7.5 -n1 -W896 -H896 Error: product of dimension sizes > 2**31' Note: I assume, if it failed or run earlier (development branch), here it will be the same So with Doggettx-optimizations branch (adapted to run on M1), CUDA devices can generate larger images, while M1 should have the same performance, although we can't generate larger images (yet).

Test 3 Doggettx-optimizations branch with M1 changes setting a fixed slice_size

In attention.py and in model.py, add this line slice_size = 8191 before this line for i in range(0, q.shape[1], slice_size):

Once both files are updates, let's test again:

"banana sushi" -s50 -C7.5 -n1 -W512 -H512 50/50 [00:29<00:00, 1.72it/s] "banana sushi" -s50 -C7.5 -n1 -W896 -H896 50/50 [02:42<00:00, 3.24s/it] Note: here that size that didn't run should run.

The performance for 512x512 is pretty much the same, and voilà, it runs on 896x896. In fact, it can do much larger images, by changing the slice_size. How? Well, first you have to find the maximum slice_size your computer can use for, say, 1024x1024. In my case, it's slice_size = 8191. You find this by trial and error (if it throws Error: product of dimension sizes > 2**31, try reducing it)

Now, to generate 896x896, I can do (1024*1024) / (896*896) * 8191 = 10698.44. So either 10698 or 10699 should be the maximum slice_size for 896x896 on my M1.

If I want to generate 3200x1600, I can do (1024*1024) / (3200*1600) * 8191 = 1677.5168. So either 1677 or 1678 should be the maximum slice_size for 3200x1600 on my M1.

If you can do these tests, it would be great to know the results, to see first, if there is a loss in performance, and second what the maximum slice_size is on your M1 (say for 1024x1024).

Any-Winter-4079 commented 2 years ago

@Doggettx I will try and update with my results. So, I've tried with mem_required = tensor_size * 3 and mem_required = tensor_size * 4 and it still fails with "banana sushi" -s50 -C7.5 -n1 -W896 -H896

I'll test more thoroughly when I wake up, but when I tried earlier today, I printed the mem_required and mem_free_total, as well as the steps and slide_sizes chosen for each call to the forward function.

My understanding is because the image is downsampled (?) the array size varies in different calls to forward So sometimes, steps = 1 is chosen, while other times, steps = 2 is chosen.

The thing is, sometimes I'll have 30GB of available memory, and the operation require much less, so steps = 1 is chosen, and then slide_size is set to q.shape[1]. But if that slide_size is too large (see for example my limit of slide_size = 8191 for 896x896, I get an error, even if I have plenty free memory. It seems to be an error related to Metal (Mac related) and not a memory error. So we need a workaround.

But anyway, I'll carefully re-read your comment tomorrow and will try to apply it. I'll also update with more/better info tomorrow.

Oh, and you are super correct in that the maximum slide_size is not necessarily the best. We should probably try to divide the array as well as we can, while not passing a certain limit (if that is the problem, anyway)

ryudrigo commented 2 years ago

Here is my stab towards what's happening:

If you take this line s1 = einsum('b i d, b j d -> b i j', q[:, i:end], k) * self.scale

you can see that slice_size * q.shape[0] * k.shape[1] is the size of the tensor involved. q_shape[0] is always 16 in this repo. Since the actual batch_size is always 1, this first dimension doesn't change. q.shape[1] depends on the product of the image width and height. For 1024x1024, this product is 16384. One can use that to calculate proportionally, e.g. 512x512 is 8192.

So, that must not exceed 231. Let's take 10699 for 896x896 as an example. k.shape[1] is 12544. q.shape[0] is 16. 16*12544*10699 is 2147332096, while 16*12544*10700 is 2147532800. The latter is just a bit above 2 31.

There's some math that can be done based on that to calculate the number in advance. I'm trying to do that and the fix for the combination of my last optimization with Doggettx's in one fell swoop.

i3oc9i commented 2 years ago

@Any-Winter-4079

I did a fast try before to go to sleep with 8191 I can get larger image up to 1024 but I get a noised image. I will do more test when I waku-up

Doggettx commented 2 years ago

Here is my stab towards what's happening:

If you take this line s1 = einsum('b i d, b j d -> b i j', q[:, i:end], k) * self.scale

you can see that slice_size * q.shape[0] * k.shape[1] is the size of the tensor involved. q_shape[0] is always 16 in this repo. Since the actual batch_size is always 1, this first dimension doesn't change. q.shape[1] depends on the product of the image width and height. For 1024x1024, this product is 16384. One can use that to calculate proportionally, e.g. 512x512 is 8192.

So, that must not exceed 231. Let's take 10699 for 896x896 as an example. k.shape[1] is 12544. q.shape[0] is 16. 16*12544*10699 is 2147332096, while 16*12544*10700 is 2147532800. The latter is just a bit above 2 31.

There's some math that can be done based on that to calculate the number in advance. I'm trying to do that and the fix for the combination of my last optimization with Doggettx's in one fell swoop.

Yea that makes sense, but If I understand it correctly, does that mean the maximum tensor size on the M1 is 2gb? You'd still run into issues at higher resolutions after the CrossAttention if that's the case.

Any-Winter-4079 commented 2 years ago

@i3oc9i About noisy images, here are some experiments (I'll update this comment with more experiments) "banana sushi" -s10 -W1024 -H1024 -n10 -C7.5 -Ak_lms Setting slice_size = 3000 in attention.py and model.py

10 image(s) generated in 596.78s (~60s/image) 10/10 images generated correctly

"banana sushi" -s10 -W1024 -H1024 -n10 -C7.5 -Ak_lms Setting slice_size = 5000 in attention.py and model.py

10 image(s) generated in 559.00s (~56s/image) 0/10 images generated correctly

Okay, so (for 1024x1024), max slice_size = 8192 generates images -does not crash- but they are noisy. Since slice_size = 3000 generates good images and slice_size = 5000 doesn't, could it be that max slice_size = 8192 / 2 - 1 to run && generate non noisy images?

"banana sushi" -s10 -W1024 -H1024 -n10 -C7.5 -Ak_lms Setting slice_size = 4095 in attention.py and model.py

10 image(s) generated in 563.59s (~56s/image) 10/10 images generated correctly

Actually, it's a bit more. slice_size = 4369 generates good images. slice_size = 4370 generates noisy images in my M1. Can anyone find the logic behind this number? And @i3oc9i is 4369 your max slice_size to generate 1024x1024 non-noisy images too even though it's 128 GB RAM / more GPU cores I assume? And if I could get a test from someone with M1 and less than 64 GB, that'd be great. @milezzz Tested with "banana sushi" -s10 -W1024 -H1024 -n1 -C7.5 -Ak_lms

i3oc9i commented 2 years ago

@Any-Winter-4079

1/ 4095 Works I got nice sushi

"banana sushi" -s10 -W1024 -H1024 -n10 -C7.5 -Ak_lms
10 image(s) generated in 403.70s ( ~40s/image)

2/ 4369 Still works I got nice sushi

"banana sushi" -s10 -W1024 -H1024 -n10 -C7.5 -Ak_lms
10 image(s) generated in 408.17s ( ~40s/image)

3/ 4370 >>NOISIE IMAGE<< so this number is not related to the size of the memory at least for 64G and 128G configurations

Also this aproach using the 4369 number does not cost more speed

"banana sushi" -s10 -W896 -H512  -C7.5 -Ak_lms -n10
10 image(s) generated in 93.32s

when compared to my max image size in 1.13 release

"banana sushi" -s10 -W896 -H512  -C7.5 -Ak_lms -n10
10 image(s) generated in 95.57s

Any-Winter-4079 commented 2 years ago

That's awesome news!

@Vargol could you test with you M1 8GB? If slice_size = 4369 works and slice_size = 4370 generates noise, I assume it's going to be the same for all M1 (8GB - 128GB)

If this is the case, the only thing left would be how to consider the M1 memory so it doesn't use swap (b/c I assume that is what makes @Vargol's performance decrease in his M1 with bigger slice_size)

Vargol commented 2 years ago

I'll add it to he TODO list, just running the last of the preflight tests its got around 10 minutes more to run, then I need to re-run the last txt2img test as it came up with a blank image for me.

Then I'll test, this but I'm sure that anything over 6 was failing for me and 4396 is way bigger than 6 :-)

lstein commented 2 years ago

Glad to see you folks are working on the M1 issues. Do you have performance timings on "main" for comparison? Given the slow speeds on M1 to begin with, I don't want to make things worse when we officially release.

Any-Winter-4079 commented 2 years ago

Yes, I can add some comparisons. We've identified the issue, so it shouldn't take very long to have a great solution ready, but more testing is needed, especially from people with 8-32GB RAM.

Vargol commented 2 years ago

Well it ran, with is more than I expected :-) but I got noise at the low value.

In the 1.1.4 release test code I changed

        slice_size = q.shape[1] // steps if (q.shape[1] % steps) == 0 else q.shape[1]
        for i in range(0, q.shape[1], slice_size):

to

        slice_size = q.shape[1] // steps if (q.shape[1] % steps) == 0 else q.shape[1]
        slice_size = 4369
        for i in range(0, q.shape[1], slice_size):

and got noise I'l try a few lower values but the runs that 20 minutes per image

Any-Winter-4079 commented 2 years ago

Try to do it for less steps maybe. "banana sushi" -s1 -W1024 -H1024 -C7.5 -Ak_lms -S2792018001 This takes 5 seconds on my M1. It's not a great image, but it's clearly not noise.

Since there's not that many RAM sizes, as a last resort we could hard-code slice_sizes for the device RAM. For 1024x1024 image

8GB: slice_size = 1
16GB: ? (need tester)
32GB: ? (need tester)
64 GB: slice_size = 4095 or 4369
128 GB: slice_size = 4095 or 4369

For anyone interested in testing, Replace ldm/modules/diffusionmodules/model.py and ldm/modules/attention.py with the content of these files model 3.py.zip attention 4.py.zip And then try tinkering with slice_size (note: you have to update the value in both files)

For example, for my M1 64GB: "banana sushi" -s1 -W1024 -H1024 -C7.5 -Ak_lms -S2792018001 runs with slice_size = 8191, but fails for slice_size = 8192. Also, even if it runs with slice_size = 8191, it needs slice_size = 4369 or less to generate non noisy images.

Now, where do these values come from? 8191, thanks to @ryudrigo, we know it's from k.shape[1] * q.shape[0] * slice_size which when it passes 2**31, it throws a Metal error (even if you have enough RAM) Where 4369 comes from, to generate non noisy images, we're not sure yet

ryudrigo commented 2 years ago

Here is my stab towards what's happening: If you take this line s1 = einsum('b i d, b j d -> b i j', q[:, i:end], k) * self.scale you can see that slice_size * q.shape[0] * k.shape[1] is the size of the tensor involved. q_shape[0] is always 16 in this repo. Since the actual batch_size is always 1, this first dimension doesn't change. q.shape[1] depends on the product of the image width and height. For 1024x1024, this product is 16384. One can use that to calculate proportionally, e.g. 512x512 is 8192. So, that must not exceed 231. Let's take 10699 for 896x896 as an example. k.shape[1] is 12544. q.shape[0] is 16. 16*12544*10699 is 2147332096, while 16*12544*10700 is 2147532800. The latter is just a bit above 2 31. There's some math that can be done based on that to calculate the number in advance. I'm trying to do that and the fix for the combination of my last optimization with Doggettx's in one fell swoop.

Yea that makes sense, but If I understand it correctly, does that mean the maximum tensor size on the M1 is 2gb? You'd still run into issues at higher resolutions after the CrossAttention if that's the case.

Does that mean it would be useful to apply a similar partitioning in other parts of the code? At least for the M1?

Doggettx commented 2 years ago

@Vargol

and got noise I'l try a few lower values but the runs that 20 minutes per image

Are you making sure to use the code example I posted earlier when you're changing slice_sizes, without that you're going to give wrong array indices for the last slice.

If you don't do that it'll start inserting data from the start of the array, which will mess up the data

Vargol commented 2 years ago

Hmmm, the version in the pre-release is behaving differently to your original code @Doggettx , slice_size=1 is now really slow ?

Any-Winter-4079 commented 2 years ago

@ryudrigo With these changes alone, at least it doesn't seem to throw an error (with txt2img), but maybe there can be other places to apply these optimisations.

Any-Winter-4079 commented 2 years ago

Hmmm, the version in the pre-release is behaving differently to your original code @Doggettx , slice_size=1 is now really slow ?

You mean it's different from the branch doggettx-optimizations? Or an older version you had?

Doggettx commented 2 years ago

Here is my stab towards what's happening: If you take this line s1 = einsum('b i d, b j d -> b i j', q[:, i:end], k) * self.scale you can see that slice_size * q.shape[0] * k.shape[1] is the size of the tensor involved. q_shape[0] is always 16 in this repo. Since the actual batch_size is always 1, this first dimension doesn't change. q.shape[1] depends on the product of the image width and height. For 1024x1024, this product is 16384. One can use that to calculate proportionally, e.g. 512x512 is 8192. So, that must not exceed 231. Let's take 10699 for 896x896 as an example. k.shape[1] is 12544. q.shape[0] is 16. 16*12544*10699 is 2147332096, while 16*12544*10700 is 2147532800. The latter is just a bit above 2 31. There's some math that can be done based on that to calculate the number in advance. I'm trying to do that and the fix for the combination of my last optimization with Doggettx's in one fell swoop.

Yea that makes sense, but If I understand it correctly, does that mean the maximum tensor size on the M1 is 2gb? You'd still run into issues at higher resolutions after the CrossAttention if that's the case.

Does that mean it would be useful to apply a similar partitioning in other parts of the code? At least for the M1?

Ah no, now I think about it, I think the error is that the array size can't go over max int, so the maximum size should be 2gb * element_size, so 4gb at half and 8gb at full. Not quite sure how it's handled with the multiple dimension arrays.

Doggettx commented 2 years ago

Hmmm, the version in the pre-release is behaving differently to your original code @Doggettx , slice_size=1 is now really slow ?

Hmm that should always be incredibly slow, you're going to iterate over that loop for every element in the second dimension at 1, that's why it's easier to think in the steps one, since that's the same no matter what resolution you use. Just the amount of iterations you need to complete. While slice_size depends on how big the array is.

Are you sure you didn't use steps=1?

Vargol commented 2 years ago

current code, forcing slice_size =1 "banana sushi" -s10 -W896 -H512 -C7.5 -Ak_lms 578.68s/it for first sample

        slice_size = q.shape[1] // steps if (q.shape[1] % steps) == 0 else q.shape[1]
        slice_size = 1
        for i in range(0, q.shape[1], slice_size):
            end = i + slice_size
            s1 = einsum('b i d, b j d -> b i j', q[:, i:end], k) * self.scale

            s2 = s1.softmax(dim=-1)
            del s1

            r1[:, i:end] = einsum('b i j, b j d -> b i d', s2, v)
            del s2

The code the I took from some other comment or fork, other than the forced steps and a change of slice_size to steps in the code. slice_size(steps) = 1
"banana sushi" -s10 -W896 -H512 -C7.5 -Ak_lms - 13.22s/it

        steps=1

        for i in range(0, q.shape[0], steps):
            end = i + steps
            s1 = einsum('b i d, b j d -> b i j', q[i:end], k[i:end])
            s1 *= self.scale

            s2 = s1.softmax(dim=-1)
            del s1

            r1[i:end] = einsum('b i j, b j d -> b i d', s2, v[i:end])
            del s2

EDIT: looks out the speed comparison as was comparing apples to oranges, well 512x512 to -W896 -H512 EDIT 2: and one run later the comparison is back EDIT 3: made it clearer when I'd force the slice size to 1 as I'd deleted the line because I was testing max slice_size without noise.

lstein commented 2 years ago

Everyone, I really appreciate your work to resolve this problem. For now I think you should be working against the development branch, but if this turns out to be one of those problems in which a fix here breaks something there, then I'll be happy to make a new branch to make more aggressive changes against.

Another thing to bear in mind is that we have a very simple recent change #486 that has reduced model loading requirements to the point where I think we could load the model on a 4 GB card. The current image generation optimizations on my system use 3.90 G, which is too close to the max memory for my liking. The ancestral two-line @neonsecret optimization, on the other hand, only needs 3.60 G for a 512x512 image (on windows or linux), and even though its memory requirements go up quickly for larger images, it might be worth the hit if it lets 4 GB card owners run the model.

invoke-ai / InvokeAI

MPS support for doggettx-optimizations #431