MPS support for doggettx-optimizations

Any-Winter-4079 commented 2 years ago

Okay, so I've seen @lstein has added x = x.contiguous() if x.device.type == 'mps' else x to ldm/modules/attention.py in the doggettx-optimizations branch but there's another error happening how KeyError: 'active_bytes.all.current' and this has to do with this function in attention.py

def forward(self, x, context=None, mask=None):
        h = self.heads

        q_in = self.to_q(x)
        context = default(context, x)
        k_in = self.to_k(context)
        v_in = self.to_v(context)
        del context, x

        q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> (b h) n d', h=h), (q_in, k_in, v_in))
        del q_in, k_in, v_in

        r1 = torch.zeros(q.shape[0], q.shape[1], v.shape[2], device=q.device)

        stats = torch.cuda.memory_stats(q.device)
        mem_active = stats['active_bytes.all.current']
        mem_reserved = stats['reserved_bytes.all.current']
        mem_free_cuda, _ = torch.cuda.mem_get_info(torch.cuda.current_device())
        mem_free_torch = mem_reserved - mem_active
        mem_free_total = mem_free_cuda + mem_free_torch

        gb = 1024 ** 3
        tensor_size = q.shape[0] * q.shape[1] * k.shape[1] * 4
        mem_required = tensor_size * 2.5
        steps = 1

        if mem_required > mem_free_total:
            steps = 2**(math.ceil(math.log(mem_required / mem_free_total, 2)))
            # print(f"Expected tensor size:{tensor_size/gb:0.1f}GB, cuda free:{mem_free_cuda/gb:0.1f}GB "
            #       f"torch free:{mem_free_torch/gb:0.1f} total:{mem_free_total/gb:0.1f} steps:{steps}")

        if steps > 64:
            max_res = math.floor(math.sqrt(math.sqrt(mem_free_total / 2.5)) / 8) * 64
            raise RuntimeError(f'Not enough memory, use lower resolution (max approx. {max_res}x{max_res}). '
                               f'Need: {mem_required/64/gb:0.1f}GB free, Have:{mem_free_total/gb:0.1f}GB free')

        slice_size = q.shape[1] // steps if (q.shape[1] % steps) == 0 else q.shape[1]
        for i in range(0, q.shape[1], slice_size):
            end = i + slice_size
            s1 = einsum('b i d, b j d -> b i j', q[:, i:end], k) * self.scale

            s2 = s1.softmax(dim=-1)
            del s1

            r1[:, i:end] = einsum('b i j, b j d -> b i d', s2, v)
            del s2

        del q, k, v

        r2 = rearrange(r1, '(b h) n d -> b n (h d)', h=h)
        del r1

        return self.to_out(r2)

Which is basically the code that detects your free memory, and then splits the softmax operation in steps, to allow to generate larger images.

Now, because we are on Mac, I'm not sure @lstein can help us much (unless he has one around), but I open this issue for anyone that wants to collaborate in porting this functionality to M1

Doggettx commented 2 years ago

@Any-Winter-4079 Another small thing I saw in your code is you're always setting slice_size to the maximum possible size, that means it might actually be too high causing it to run out of memory in the loop. It's probably better to do something like:

slice_size = q.shape[1] // steps if (q.shape[1] % steps) == 0 else q.shape[1]
slice_size = min(slice_size, math.floor(2**30 / (q.shape[0] * q.shape[1])))

so that it only picks the cut off once it goes over the limit

Any-Winter-4079 commented 2 years ago

@Any-Winter-4079 Another small thing I saw in your code is you're always setting slice_size to the maximum possible size, that means it might actually be too high causing it to run out of memory in the loop. It's probably better to do something like:
slice_size = q.shape[1] // steps if (q.shape[1] % steps) == 0 else q.shape[1]
slice_size = min(slice_size, math.floor(2**30 / (q.shape[0] * q.shape[1])))
so that it only picks the cut off once it goes over the limit

Since we don't use steps, it would be slice_size = min(q.shape[1], math.floor(2**30 / (q.shape[0] * q.shape[1]))) I guess. But I don't think it affects all at. It runs without problem even when it's too large. I'll check again now with cooling though.

Doggettx commented 2 years ago

@Any-Winter-4079 Another small thing I saw in your code is you're always setting slice_size to the maximum possible size, that means it might actually be too high causing it to run out of memory in the loop. It's probably better to do something like:
slice_size = q.shape[1] // steps if (q.shape[1] % steps) == 0 else q.shape[1]
slice_size = min(slice_size, math.floor(2**30 / (q.shape[0] * q.shape[1])))
so that it only picks the cut off once it goes over the limit
Since we don't use steps, it would be slice_size = min(q.shape[1], math.floor(2**30 / (q.shape[0] * q.shape[1]))) I guess. But I don't think it affects all at. It runs without problem even when it's too large. I'll check again now with cooling though.

No that wouldn't work, that just limits it to the array size, would have the same problem as just leaving the min out (and that's already caught by doing the min on end as well)

There's no way to get the actual free memory so you can calculate steps?

Any-Winter-4079 commented 2 years ago

We can get the memory, mem_free_total = psutil.virtual_memory().available but it doesn't work well when it calculates the steps.

Doggettx commented 2 years ago

Why not use that to calculate the steps then?

lstein commented 2 years ago

Do you have the logic in place to implement the second (or first) of the two options? I am apprehensive about my having to code the conditional switch between the old code and the new code given how significantly they have diverged.

I do have a private release candidate branch which consists of the neonpixel changes plus more recent non-optimization related fixes and features. I'm leaning towards releasing this as 1.14 and getting the full doggettx-any-winter-mps optimizations into the next release (which should be a matter of days). This would also take the pressure off you. How do you feel about this?

Any-Winter-4079 commented 2 years ago

I mean, I've done a thousand different variations, so I don't remember, but I'm pretty sure it didn't work well. I'll try again.

Vargol commented 2 years ago

available seems to be not reliable in my case as it doesn't count for memory that can be swapped.

Doggettx commented 2 years ago

You didn't cap off the slice_size back then yet though, if you calculate the steps like in the original, and then only cap of the slice size after that it should work the same on M1 as on Cuda

Any-Winter-4079 commented 2 years ago

You didn't cap off the slice_size back then yet though, if you calculate the steps like in the original, and then only cap of the slice size after that it should work the same on M1 as on Cuda

Damn it! I keep trying to quote and hit edit!

I was trying to say: Let me try :)

Doggettx commented 2 years ago

so basically just replace the original:

    slice_size = q.shape[1] // steps if (q.shape[1] % steps) == 0 else q.shape[1]
    for i in range(0, q.shape[1], slice_size):
        end = i + slice_size

with:

slice_size = q.shape[1] // steps if (q.shape[1] % steps) == 0 else q.shape[1]
if mps:
    slice_size = min(slice_size, math.floor(2**30 / (q.shape[0] * q.shape[1])))
for i in range(0, q.shape[1], slice_size):
    end = min(q.shape[1], i + slice_size)

then it works on both cuda and mps (euh change that if statement to however you know it's on m1 ;)

Doggettx commented 2 years ago

@Vargol

available seems to be not reliable in my case as it doesn't count for memory that can be swapped.

Wouldn't that be better though, or else you gonna force it to swap constantly? You could always provide the useable memory via config or parameter while starting so the user can just define how much memory to use.

Vargol commented 2 years ago

attention.py.zip

Okay, this is my take on dealing with the variations on different algorithms, by replacing the functions on init with the best for the current situation . At the moment that means in deasl with the one close to heart which is not slowing 512x512 down by 200%, but still allowing 1024x1024 images to be generated on a 8Gb M1 and I guess 8Gb M2 (Ive not actually tried to push it further than 1024x1024 )

I've updated the slice size calc for everyone else to the one from https://github.com/lstein/stable-diffusion/issues/431#issuecomment-1242950003

it does the for i in range(0, q.shape[1], slice_size): for everyone not on a 8gb MPS machine and for i in range(0, q.shape[0], slice_size): for those that are whlist keeping all the mess out of the forward method.

Vargol commented 2 years ago

Without swappng your not going to get much better that 383x384 in a 8gb Mac. But it not just SD, if I've left VSC open then thats a whole chunk of memory that isn't available but can get swapped out for example.

Any-Winter-4079 commented 2 years ago

@Doggettx it seems to work (performance seems to be similar on 1024x1024) I mean, it's probably going to be a tad slower, since we're doing a few extra operations, but I'd be happy if it performs well for @Vargol, even with 1% performance drop, so we can use it from 8-128GB Here are the 2 files, in case Vargol or someone with less than 64GB wants to test. attention_doggettx.py.zip model_doggettx.py.zip

There'a also mem_free_total = psutil.virtual_memory().free instead of mem_free_total = psutil.virtual_memory().available. Not sure if that would help you with the Swap issue.

Also, @Vargol your other solution of different for loops (q.shape[0] vs q.shape[1]) sounds good too.

Doggettx commented 2 years ago

Also, @Vargol your other solution of different for loops (q.shape[0] vs q.shape[1]) sounds good too.

That's not really needed though, its the same as just settings steps=16, instead of basing it on free memory that is

Vargol commented 2 years ago

@Any-Winter-4079 Okay I'll give it a test, was just running a few big images for the lol's on 'my' version, 1280 x1280 is too much but it did run 1152x1152 at 90 s/it for a coupe of samples before I killed it:-)

Doggettx commented 2 years ago

I still think it might be best to let the user supply the free memory, if there's no reliable way to get it automatically.

so if you want it to allocate 8gb max, just set mem_free_total = 8*1024**3 and make it configurable somehow

Any-Winter-4079 commented 2 years ago

I still think it might be best to let the user supply the free memory, if there's no reliable way to get it automatically.

so if you want it to allocate 8gb max, just set mem_free_total = 8*1024**3 and make it configurable somehow

We can get our RAM with psutil.virtual_memory().total, but then it depends on the user how much of it is free.

Vargol commented 2 years ago

Very initial test, runs 512x512 but at 14.4 it/s compared to best case 5 it/s

1024x1024 fails File "/Volumes/Sabrent Media/Documents/Source/Python/lstein_1_4/stable-diffusion/ldm/modules/attention.py", line 208, in forward s1 = einsum('b i d, b j d -> b i j', q[:, i:end], k) * self.scale File "/Volumes/Sabrent Media/Documents/Source/Python/lstein_1_4/lib/python3.10/site-packages/torch/functional.py", line 360, in einsum return _VF.einsum(equation, operands) # type: ignore[attr-defined] RuntimeError: Invalid buffer size: 4.00 GB

Any-Winter-4079 commented 2 years ago

So your best result is with this, if I remember correctly.

And this works worse

What I see is that the einsum is different.

For example, k[i:end] in the first is k[i:i+1] vs k in the second. I'm not sure but it may have an impact (?)

Vargol commented 2 years ago

@Doggettx if this is what to mean by the the other code being equivalent to steps 16

        steps = 16
        slice_size = q.shape[1] // steps if (q.shape[1] % steps) == 0 else q.shape[1]
        slice_size = min(slice_size, math.floor(2**30 / (q.shape[0] * q.shape[1])))

That ran at 8.76s/it still slower than my last run of the other code which was 5.4 it/s, same command line.

Vargol commented 2 years ago

@Any-Winter-4079

Oversimplifiying it, I think it kind of like the loops are the are different as one has einsum doing thing left to right compared to the other doing it top to bottom. So the way the Tensors are indexed for the different bits has to be changed.

Like I've said before I got the code from another comment somewhere, possibly in other fork, its just stuck there in my repo

Any-Winter-4079 commented 2 years ago

@Vargol I'm trying your version of attention.py

Doggettx commented 2 years ago

@Any-Winter-4079

Oversimplifiying it, I think it kind of like the loops are the are different as one has einsum doing thing left to right compared to the other doing it top to bottom. So the way the Tensors are indexed for the different bits has to be changed.

Like I've said before I got the code from another comment somewhere, possibly in other fork, its just stuck there in my repo

The other code is from me as well ;) Was an earlier version, yea it has to do slightly more slicing on the second dimension, but the difference for the work it has to do there compared to the work it has to do for the einsum should be negligible.

The only other difference is that that this part slice_size = min(slice_size, math.floor(2**30 / (q.shape[0] * q.shape[1]))) is added which might make it loop more for you. So without that line it should be the same

Which suddenly made me wonder, why are you using 2**30 instead of 2**31 @Any-Winter-4079 ?

Vargol commented 2 years ago

I suppose I ought to by 'my' attention.py with the new model,py

ryudrigo commented 2 years ago

@Any-Winter-4079 Oversimplifiying it, I think it kind of like the loops are the are different as one has einsum doing thing left to right compared to the other doing it top to bottom. So the way the Tensors are indexed for the different bits has to be changed. Like I've said before I got the code from another comment somewhere, possibly in other fork, its just stuck there in my repo

The other code is from me as well ;) Was an earlier version, yea it has to do slightly more slicing on the second dimension, but the difference for the work it has to do there compared to the work it has to do for the einsum should be negligible.

The only other difference is that that this part slice_size = min(slice_size, math.floor(2**30 / (q.shape[0] * q.shape[1]))) is added which might make it loop more for you. So without that line it should be the same

Which suddenly made me wonder, why are you using 2**30 instead of 2**31 @Any-Winter-4079 ?

Earlier comment:

The explanation is this. q.shape[0] * q.shape[1] * slice_size must not equal or exceed 2**31. If that is met, it will run. If you take half of that slice_size (with ~5% allowance), it will not generate noise. Thus, 2**30

So it's basically to avoid the noise bug

Doggettx commented 2 years ago

Ah but the noise should be fixed with by just using the correct end? should try it with 2**31 again probably.

Vargol commented 2 years ago

I did a quick test with the min....end and it didn't seem to make a difference to where the noise output started , was a population of 1 stats wise

Doggettx commented 2 years ago

Just checking, but you're sure that was with the min on end? Cause earlier I saw a version with the min on the slice_size which doesn't help for that

Vargol commented 2 years ago

https://github.com/lstein/stable-diffusion/issues/431#issuecomment-1242771933 to quote...

If anyones still interest the image when to noise a slice_size =4042 for me, 4041 was an image.

I changed the line of the loop to... end = min(q.shape[1], i + slice_size) and 4042 was still noise.

Any-Winter-4079 commented 2 years ago

From my experience, at least on Mac, not using min doesn't seem to cause performance problems (maybe hardware stress at most, but I think when we don't use min and takes a value larger than the end of the array, it just cuts at the end of the array). This seems to happen for end && slice_size

Also, here are some results with @Vargol 's version

It's pretty reasonable so far. I thought there'd be a performance drop using the other for loop (with q.shape[0]) given that the operations inside are a bit different, but it doesn't look like that is the case.

Vrk3ds commented 2 years ago

not sure if this helps or not as I am completely ignorant to python and scripting/coding but my experience when using the latest files from doggettx on a MBA M2 with 16gb's is that 512x512 works 1.72/it took 18 seconds, 768 x768 works but slow at 77/it 404 seconds to render, and when trying to render 1024x1024, it seems to be using way to much swap, and never finished. see screen cap. Let me know if you need any other info from my system to help diagnose this.

Any-Winter-4079 commented 2 years ago

Yes, it consumes about the same RAM for me (64GB total though). Can you try @Vargol 's code? https://github.com/lstein/stable-diffusion/issues/431#issuecomment-1242959721 attention.py @Vargol Also, shouldn't you need to update model.py too?

Vrk3ds commented 2 years ago

sure! which files are his code?

Doggettx commented 2 years ago

I'd like to point out, I haven't posted any files ;)

Vargol commented 2 years ago

okay my attention.y and the last version of model posted both seem to interact well enough 512x512 in the 5-6 s/it ballpark 1024x1024 in the 46 s/it region , maybe a little slower but that size does vary a bit more (at second run just finished with 42s/it).

Vrk3ds commented 2 years ago

these are the files I tested with

Here are the 2 files, in case Vargol or someone with less than 64GB wants to test. attention_doggettx.py.zip model_doggettx.py.zip

Vargol commented 2 years ago

@Any-Winter-4079 Yes, possibly I haven't got around to that yet, doing attention at least got the speed back to mostly acceptable (for me) :-)

Any-Winter-4079 commented 2 years ago

these are the files I tested with

Here are the 2 files, in case Vargol or someone with less than 64GB wants to test. attention_doggettx.py.zip model_doggettx.py.zip

@Vrk3ds Taking this code you already have, now use the code from @Vargol here: https://github.com/lstein/stable-diffusion/issues/431#issuecomment-1242959721

Let us know if that works better. It it worked slow, it may be because @Vargol's option for less than 12GB available kicks in, which basically does the calculation in steps of one.

If that were the case, you may need an intermediate solution, like the slice_size we use at 64GB but capped at a max (e.g. min(X, our_slice_size)

ryudrigo commented 2 years ago

So, I realize there is a lot of variations to test, but I'd like to remind that there are still no results for the modifications I've made. It's hard to gauge things since I don't have a Mac. So I'd be really grateful if someone could run these two versions I'm posting now on a big image (whatever is big for your system) and post the console output and time (no need to wait for the end, I think just the estimated time is enough for me)

I can also explain what I'm trying to do or do whatever else you think helps, if there's nothing and can't test then I should probably look into some other issue.

attention_2.zip attention_1.zip

Doggettx commented 2 years ago

If my math is correct, then the changes from @Vargol should be the same as using my original code (from cuda, before any changes) and remove the cuda stuff and just setting mem_free_total = 4 * 1024**3 and leave the rest as is.

Shame I don't have a Mac or I could test both, but it would simplifty it for merging with cuda

Any-Winter-4079 commented 2 years ago

So, I realize there is a lot of variations to test, but I'd like to remind that there are still no results for the modifications I've made. It's hard to gauge things since I don't have a Mac. So I'd be really grateful if someone could run these two versions I'm posting now on a big image (whatever is big for your system) and post the console output and time (no need to wait for the end, I think just the estimated time is enough for me)

I can also explain what I'm trying to do or do whatever else you think helps, if there's nothing and can't test then I should probably look into some other issue.

attention_2.zip attention_1.zip

I'll test them. There's just so much going on with so many versions, results and opinions

Any-Winter-4079 commented 2 years ago

If my math is correct, then the changes from @Vargol should be the same as using my original code (from cuda, before any changes) and remove the cuda stuff and just setting mem_free_total = 4 * 1024**3 and leave the rest as is.

Shame I don't have a Mac or I could test both.

Oh, that's right. @Vargol's version needs your changes for CUDA. I'll try to combine them.

ryudrigo commented 2 years ago

Shame I don't have a Mac

I hear ya. =P

Vrk3ds commented 2 years ago

that code works much better for me.

Doggettx commented 2 years ago

If my math is correct, then the changes from @Vargol should be the same as using my original code (from cuda, before any changes) and remove the cuda stuff and just setting mem_free_total = 4 * 1024**3 and leave the rest as is. Shame I don't have a Mac or I could test both.

Oh, that's right. @Vargol's version needs your changes for CUDA. I'll try to combine them.

Any chance you could try what I said? just using original and locking mem to 4gb? seems to be effectively the same result when I calculate it

so also no slice adjusting or limiting or anything

Any-Winter-4079 commented 2 years ago

If my math is correct, then the changes from @Vargol should be the same as using my original code (from cuda, before any changes) and remove the cuda stuff and just setting mem_free_total = 4 * 1024**3 and leave the rest as is. Shame I don't have a Mac or I could test both.

Oh, that's right. @Vargol's version needs your changes for CUDA. I'll try to combine them.

Any chance you could try what I said? just using original and locking mem to 4gb? seems to be effectively the same result when I calculate it

so also no slice adjusting or limiting or anything

It might not be the same because @Vargol is using a different for loop, which seems to do the einsum operation a bit different, which seems to work much better for his machine, but I'll post what I understand you mean, so you can have a look if it is correct.

ryudrigo commented 2 years ago

While tests are running, @Doggettx what do you think of looking into other parts of attention? Is it worth it?

ryudrigo commented 2 years ago

For instance, I can run 2048x2048 (which is not practical, but just as an example), but then it gets cut off in other parts of the script

invoke-ai / InvokeAI

MPS support for doggettx-optimizations #431