invoke-ai / InvokeAI

InvokeAI is a leading creative engine for Stable Diffusion models, empowering professionals, artists, and enthusiasts to generate and create visual media using the latest AI-driven technologies. The solution offers an industry leading WebUI, supports terminal use through a CLI, and serves as the foundation for multiple commercial products.
https://invoke-ai.github.io/InvokeAI/
Apache License 2.0
22.87k stars 2.37k forks source link

MPS support for doggettx-optimizations #431

Closed Any-Winter-4079 closed 2 years ago

Any-Winter-4079 commented 2 years ago

Okay, so I've seen @lstein has added x = x.contiguous() if x.device.type == 'mps' else x to ldm/modules/attention.py in the doggettx-optimizations branch but there's another error happening how KeyError: 'active_bytes.all.current' and this has to do with this function in attention.py

def forward(self, x, context=None, mask=None):
        h = self.heads

        q_in = self.to_q(x)
        context = default(context, x)
        k_in = self.to_k(context)
        v_in = self.to_v(context)
        del context, x

        q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> (b h) n d', h=h), (q_in, k_in, v_in))
        del q_in, k_in, v_in

        r1 = torch.zeros(q.shape[0], q.shape[1], v.shape[2], device=q.device)

        stats = torch.cuda.memory_stats(q.device)
        mem_active = stats['active_bytes.all.current']
        mem_reserved = stats['reserved_bytes.all.current']
        mem_free_cuda, _ = torch.cuda.mem_get_info(torch.cuda.current_device())
        mem_free_torch = mem_reserved - mem_active
        mem_free_total = mem_free_cuda + mem_free_torch

        gb = 1024 ** 3
        tensor_size = q.shape[0] * q.shape[1] * k.shape[1] * 4
        mem_required = tensor_size * 2.5
        steps = 1

        if mem_required > mem_free_total:
            steps = 2**(math.ceil(math.log(mem_required / mem_free_total, 2)))
            # print(f"Expected tensor size:{tensor_size/gb:0.1f}GB, cuda free:{mem_free_cuda/gb:0.1f}GB "
            #       f"torch free:{mem_free_torch/gb:0.1f} total:{mem_free_total/gb:0.1f} steps:{steps}")

        if steps > 64:
            max_res = math.floor(math.sqrt(math.sqrt(mem_free_total / 2.5)) / 8) * 64
            raise RuntimeError(f'Not enough memory, use lower resolution (max approx. {max_res}x{max_res}). '
                               f'Need: {mem_required/64/gb:0.1f}GB free, Have:{mem_free_total/gb:0.1f}GB free')

        slice_size = q.shape[1] // steps if (q.shape[1] % steps) == 0 else q.shape[1]
        for i in range(0, q.shape[1], slice_size):
            end = i + slice_size
            s1 = einsum('b i d, b j d -> b i j', q[:, i:end], k) * self.scale

            s2 = s1.softmax(dim=-1)
            del s1

            r1[:, i:end] = einsum('b i j, b j d -> b i d', s2, v)
            del s2

        del q, k, v

        r2 = rearrange(r1, '(b h) n d -> b n (h d)', h=h)
        del r1

        return self.to_out(r2)

Which is basically the code that detects your free memory, and then splits the softmax operation in steps, to allow to generate larger images.

Now, because we are on Mac, I'm not sure @lstein can help us much (unless he has one around), but I open this issue for anyone that wants to collaborate in porting this functionality to M1

Any-Winter-4079 commented 2 years ago

For 4GB cards, I've read there's a float16 version of model.ckpt?

Screenshot 2022-09-10 at 16 36 22
neonsecret commented 2 years ago

yeah if you noticed I pushed another updates to my fork, allows me to generate 1792x1792 on 8 gb vram and 1024x1024 (maybe even more) on 4 gb vram. optimization goes not only for one file but the repo at whole. is enabled at low vram mode via low vram config

i3oc9i commented 2 years ago

@Any-Winter-4079 I have executed again a test with the new model 3.py.zip attention 4.py.zip

slice_size 4369

"banana sushi" -s10 -W1024 -H1024 -C7.5 -Ak_lms -n10
10 image(s) generated in 376.80s

so this is an improvment in speed wrt previous version. see my previous comment

Any-Winter-4079 commented 2 years ago

@i3oc9i There were a few calculations with steps, mem_required, mem_free_total and slice_size, which I just commented out, since for now we're hard-coding slice_size anyway.

A nice 8% improvement :)

Any-Winter-4079 commented 2 years ago

@Vargol Then, in this version which is the fastest for you

        steps=1

        for i in range(0, q.shape[0], steps):
            end = i + steps
            s1 = einsum('b i d, b j d -> b i j', q[i:end], k[i:end])
            s1 *= self.scale

            s2 = s1.softmax(dim=-1)
            del s1

            r1[i:end] = einsum('b i j, b j d -> b i d', s2, v[i:end])
            del s2

I understand you do for i in range(0, q.shape[0], steps) which is for i in range(0, 16, 1) So i is going to be 0, 1, 2, ..., 15

And then end = i + steps which is end = i + 1 Then you do s1 = einsum('b i d, b j d -> b i j', q[i:i+1], k[i:i+1])


Then, there's this version, which is much slower for you

        slice_size = 1
        for i in range(0, q.shape[1], slice_size):
            end = i + slice_size
            s1 = einsum('b i d, b j d -> b i j', q[:, i:end], k) * self.scale

            s2 = s1.softmax(dim=-1)
            del s1

            r1[:, i:end] = einsum('b i j, b j d -> b i d', s2, v)
            del s2

So for i in range(0, q.shape[1], slice_size) is for i in range(0, 16384, 1) So i is going to be 0, 1, 2, ..., 16383 Then end = i + slice_size is end = i + 1 And finally you do s1 = einsum('b i d, b j d -> b i j', q[:, i:i+1], k) * self.scale


Do you or anyone know why one is much faster? I see the discrepancy in loop iterations, but I'm not very familiar with einsum Maybe @Doggettx or @ryudrigo who have implemented optimisations know?

ryudrigo commented 2 years ago

The top one splits the arrays in the first (0) (batch) dimension, so you get less total splits (only 16 operations total) the bottom one splits the arrays in the second (1) dimension, so you get more total splits (16384 operations total)

What I was trying to do on that pull request is to have the option to split in any of those two dimensions.

I will have to be away for some hours, though, hope I have clarified enough

Doggettx commented 2 years ago

Do you or anyone know why one is much faster? I see the discrepancy in loop iterations, but I'm not very familiar with einsum Maybe @Doggettx or @ryudrigo who have implemented optimisations know?

It's much slower beceause in that case you are doing 16384 iterations with a slice of 1 while with the other one you only do 16 iterations. That's why you shouldn't use the slice_size but only change the steps.

In other words, slice_size=1 on the first version is the same as slice_size=1024 on the second version (considering a size of 16384)

Again it's also very important to not use the code as is, but use the changes I proposed earlier if you're going to change the slice_size without using a steps with a factor of 2, or you will cause corruption in the array.

I wouldn't adjust the slice_size at all though, but instead just control it with the steps (with a factor of 2) since then the resulting blocks are much easier to handle for torch and in the correct sizes.

Any-Winter-4079 commented 2 years ago

@Doggettx As a curiosity: with slice_size = q.shape[1] // steps if (q.shape[1] % steps) == 0 else q.shape[1] since steps are powers of 2, why not slice_size = q.shape[1] // steps ? Can it happen that (q.shape[1] % steps) == 0 is not true? Well, maybe if q.shape[1] is less than steps. Not sure if it can happen. Also, not sure if q.shape[1] could be an odd number (that would also make it not true)


About setting the steps and not the slice_size, for example for a 1024x1024 image, we need to make sure slice_size <= 4369 (else: Mac error) so I guess given the max q.shape[1] = 16384, we'd need 4 steps. slice_size = q.shape[1] // steps if (q.shape[1] % steps) == 0 else q.shape[1] evaluates to slice_size = 16384 // 4 if (16384 % 4) == 0 else 16384 which is slice_size = 4096

However, it seems @Vargol reports it slows down this way (?) Could you test with steps = 4?

steps = 4
slice_size = q.shape[1] // steps if (q.shape[1] % steps) == 0 else q.shape[1]
for i in range(0, q.shape[1], slice_size):
            end = i + slice_size
            s1 = einsum('b i d, b j d -> b i j', q[:, i:end], k) * self.scale

            s2 = s1.softmax(dim=-1)
            del s1

            r1[:, i:end] = einsum('b i j, b j d -> b i d', s2, v)
            del s2

        del q, k, v

"banana sushi" -s1 -C7.5 -Ak_lms -S2792018001 -W 1024 -H 1024

netsvetaev commented 2 years ago

MBP M1 Pro, 16gb, OS 13.0

Test 2 Doggettx-optimizations branch with M1 changes No noticable changes, around 5.5s/it

Test 3 Doggettx-optimizations branch with M1 changes setting a fixed slice_size No noticable changes, around 5.5s/it

8191 is slightly faster than 1677, but almost no difference between all of them. But I don’t understand how I need to calculate this correctly. if I can do 768x768 on RC1.14 branch without major issues without it, do I need it?

neonsecret-optimizations Very fast with 512x512, 1.58s/it

Error with 768 like before The Metal Performance Shaders operations encoded on it may not have completed. Error: (null) Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory) <AGXG13XFamilyCommandBuffer: 0x29e374750> label = device = <AGXG13XDevice: 0x15bbca600> name = Apple M1 Pro commandQueue = <AGXG13XFamilyCommandQueue: 0x126d2c000> label = device = <AGXG13XDevice: 0x15bbca600> name = Apple M1 Pro retainedReferences = 1

Any-Winter-4079 commented 2 years ago

I think I have a formula, thanks to all of the suggestions from @Doggettx and @ryudrigo . Let's see if it works...

ryudrigo commented 2 years ago

@netsvetaev or whoever is feeling kind, could you run tests with attention.py from #432? Specially if you have a mac. attention.zip

Vargol commented 2 years ago

If anyones still interest the image when to noise a slice_size =4042 for me, 4041 was an image.

I changed the last line of the loop to... end = min(q.shape[1], i + slice_size) and 4042 was still noise.

I then tried slice size of 1024 and that seems to be the closest to the speed of the 'fast for me' version of the loop with 17.00 s/it , might play and see if I can get that down.

Any-Winter-4079 commented 2 years ago

What size was the image with slice_size=4041 @Vargol ? 1024x1024?

Vargol commented 2 years ago

-W896 -H512

Vargol commented 2 years ago

Second and subsequent runs seem to be slower, ~ 20% for me, I think dreams.py is holding on to some memory somewhere. Second run of same command in same session was 20 s/it. Something to be aware off.

Any-Winter-4079 commented 2 years ago

Wait, I didn't know Collaborators could edit comments from any other person? Anyway, I meant to quote you :)

Second and subsequent runs seem to be slower, ~ 20% for me, I think dreams.py is holding on to some memory somewhere. Second run of same command in same session was 20 s/it. Something to be aware off.

I experience the same.

Vargol commented 2 years ago

slice_size = 768 seems to be the sweet spot at ~ 15.5 s/it still a little slower than the reigning champ. Gonna try a few more image sizes now.

Doggettx commented 2 years ago

@Doggettx As a curiosity: with slice_size = q.shape[1] // steps if (q.shape[1] % steps) == 0 else q.shape[1] since steps are powers of 2, why not slice_size = q.shape[1] // steps ? Can it happen that (q.shape[1] % steps) == 0 is not true? Well, maybe if q.shape[1] is less than steps. Not sure if it can happen. Also, not sure if q.shape[1] could be an odd number (that would also make it not true)

yes that's why the check is there, the function also gets called with the tokens, which has a size of 77 and lower res versions which are smaller

Doggettx commented 2 years ago

slice_size = q.shape[1] // steps if (q.shape[1] % steps) == 0 else q.shape[1] for i in range(0, q.shape[1], slice_size): end = i + slice_size s1 = einsum('b i d, b j d -> b i j', q[:, i:end], k) * self.scale

        s2 = s1.softmax(dim=-1)
        del s1

        r1[:, i:end] = einsum('b i j, b j d -> b i d', s2, v)
        del s2

    del q, k, v

I see you didn't do the end = min(q.shape[1], i + slice_size) still though, that's gonna cause corruption if you adjust slice size with the % steps == 0 check.

P.S. if you need a lower slice_size you can just multiply steps by 2

Vargol commented 2 years ago

okay, so 1024x1024 I get RuntimeError: Not enough memory, use lower resolution (max approx. 960x960). Need: 0.6GB free, Have:0.6GB free (Nope I've run 1024x1024 on this machine before on the other code)

512x512 its a little slower 7.25 compare to 5.7-6.1

256x256 1.03s/it which may be faster, it compares to 1.26, but it could be statistical noise, but either way its as fast.

Right I'll try a couple of runs with @ryudrigo 's pr attention.py before I call it a day.

Any-Winter-4079 commented 2 years ago

Maybe it's because of

if steps > 64:
            max_res = math.floor(math.sqrt(math.sqrt(mem_free_total / 2.5)) / 8) * 64
            raise RuntimeError(f'Not enough memory, use lower resolution (max approx. {max_res}x{max_res}). '
                               f'Need: {mem_required/64/gb:0.1f}GB free, Have:{mem_free_total/gb:0.1f}GB free')

I just commented out that part

netsvetaev commented 2 years ago

@netsvetaev , @ryudrigo

I think you've lost a contiguous statement as that error looks very familiar.

sorry, my mistake, was on another branch. But here is another error

Traceback (most recent call last):
  File "/Users/artur/stable-diffusion/ldm/generate.py", line 317, in prompt2image
    results = generator.generate(
  File "/Users/artur/stable-diffusion/ldm/dream/generator/base.py", line 70, in generate
    image = make_image(x_T)
  File "/Users/artur/.conda/envs/ldm3/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/Users/artur/stable-diffusion/ldm/dream/generator/txt2img.py", line 30, in make_image
    samples, _ = sampler.sample(
  File "/Users/artur/.conda/envs/ldm3/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/Users/artur/stable-diffusion/ldm/models/diffusion/ksampler.py", line 83, in sample
    K.sampling.__dict__[f'sample_{self.schedule}'](
  File "/Users/artur/.conda/envs/ldm3/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/Users/artur/stable-diffusion/src/k-diffusion/k_diffusion/sampling.py", line 187, in sample_lms
    denoised = model(x, sigmas[i] * s_in, **extra_args)
  File "/Users/artur/.conda/envs/ldm3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/artur/stable-diffusion/ldm/models/diffusion/ksampler.py", line 16, in forward
    uncond, cond = self.inner_model(x_in, sigma_in, cond=cond_in).chunk(2)
  File "/Users/artur/.conda/envs/ldm3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/artur/stable-diffusion/src/k-diffusion/k_diffusion/external.py", line 115, in forward
    eps = self.get_eps(input * c_in, self.sigma_to_t(sigma), **kwargs)
  File "/Users/artur/stable-diffusion/src/k-diffusion/k_diffusion/external.py", line 141, in get_eps
    return self.inner_model.apply_model(*args, **kwargs)
  File "/Users/artur/stable-diffusion/ldm/models/diffusion/ddpm.py", line 1440, in apply_model
    x_recon = self.model(x_noisy, t, **cond)
  File "/Users/artur/.conda/envs/ldm3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/artur/stable-diffusion/ldm/models/diffusion/ddpm.py", line 2148, in forward
    out = self.diffusion_model(x, t, context=cc)
  File "/Users/artur/.conda/envs/ldm3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/artur/stable-diffusion/ldm/modules/diffusionmodules/openaimodel.py", line 806, in forward
    h = module(h, emb, context)
  File "/Users/artur/.conda/envs/ldm3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/artur/stable-diffusion/ldm/modules/diffusionmodules/openaimodel.py", line 88, in forward
    x = layer(x, context)
  File "/Users/artur/.conda/envs/ldm3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/artur/stable-diffusion/ldm/modules/attention.py", line 322, in forward
    x = block(x, context=context)
  File "/Users/artur/.conda/envs/ldm3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/artur/stable-diffusion/ldm/modules/attention.py", line 273, in forward
    return checkpoint(self._forward, (x, context), self.parameters(), self.checkpoint)
  File "/Users/artur/stable-diffusion/ldm/modules/diffusionmodules/util.py", line 157, in checkpoint
    return func(*inputs)
  File "/Users/artur/stable-diffusion/ldm/modules/attention.py", line 276, in _forward
    x = self.attn1(self.norm1(x)) + x
  File "/Users/artur/.conda/envs/ldm3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/artur/.conda/envs/ldm3/lib/python3.10/site-packages/torch/nn/modules/normalization.py", line 190, in forward
    return F.layer_norm(
  File "/Users/artur/.conda/envs/ldm3/lib/python3.10/site-packages/torch/nn/functional.py", line 2511, in layer_norm
    return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

>> Could not generate image.
>> Usage stats:
>>   0 image(s) generated in 0.42s
>>   Max VRAM used for this generation: 0.00G
ryudrigo commented 2 years ago

Do you want to summarize the current problems, @Any-Winter-4079 ? I will likely get back to the code after the tests on my version of attention.py and would like to know what to address first.

Vargol commented 2 years ago

@netsvetaev , @ryudrigo

I think you've lost a contiguous statement as that error looks very familiar.

ryudrigo commented 2 years ago

Yeah, it's literally the first thing on this issue. I think I skipped that modification for macs. Just a moment...

ryudrigo commented 2 years ago

OK, updated the PR, added x = x.contiguous() if x.device.type == 'mps' else x at line 276. Also here is the file attention.zip

netsvetaev commented 2 years ago

OK, updated the PR, added x = x.contiguous() if x.device.type == 'mps' else x at line 276. Also here is the file attention.zip

    mem_free = psutil.virtual_memory().available
NameError: name 'psutil' is not defined
Any-Winter-4079 commented 2 years ago

@ryudrigo 1) Well, the first problem is that some values of slice_size seem to flat out fail. For example: "banana sushi" -s10 -C7.5 -Ak_lms -S2792018001 -W 1024 -H 1024

slice_size = 2047 works. slice_size = 2048 fails. Error: product of dimension sizes > 2**31 slice_size = 2049 works. I'm assuming there is some problem with some powers of 2. However, slice_size = 1024 works. Maybe some powers of 2 above some value, fail.

Then, we can see how (for the same 1024x1024 image) slice_size = 8191 works. slice_size = 8192 fails. Error: product of dimension sizes > 2**31 slice_size = 8193 fails. Error: product of dimension sizes > 2**31 slice_size = 8420 fails. Error: product of dimension sizes > 2**31

So, unless we find a slice_size bigger than 8191 that works, I'm assuming that when q.shape[0] * q.shape[1] * slice_size >= 2^31, it fails. With slice_size 8192, then 16 16384 8192 == 2^31

Up to this point, 2 hypothesis:

2) With the rules above, we can generate images (i.e. not crash). But they may be pure noise. To get an actual image, we need to decrease the slice_size. How much? It's not clear. slice_size = 4369 seems to be the highest it can go to get an actual image (not noise). But I don't see any logic behind that number. However, a good approximation may be half of the max value from the previous step. So if slice_size = 8191 was the max from the previous step, because 16 16384 8191 < 2^31 but 16 16384 8192 >= 2^31, then we halve that value 8191 / 2 = 4095.5 To be safer, we can take 4095

3) Now, the third problem is these numbers work great for 64 GB and 128 GB M1. But, people with lower RAM seem to have trouble. I guess some slice_size are too much for them, for their memory, but we haven't figured out a formula yet.


Note: Some nice math I've been able to use is, if max slice_size = 8191 for 1024x1024, then if I want to generate 3200x1600 image, we can do (1024 1024) / (3200 1600) * 8191 = 1677 (the new slice_size to use) PS: I aborted the 3200x1600 process, so no guarantees it works or not generates noise (taking 1677/2). But for smaller sizes, the formula seems to work reasonably well.

But this is for 64-128 GB. It would be awesome to find values for lower GB Mac, to see if we can make some sense.

ryudrigo commented 2 years ago

OK, updated the PR, added x = x.contiguous() if x.device.type == 'mps' else x at line 276. Also here is the file attention.zip

    mem_free = psutil.virtual_memory().available
NameError: name 'psutil' is not defined

attention.zip

Well that's what I get for not working on the same file (and not having a mac =P) Added the import.

I should merge whatever I'm doing with Doggettx's at some point.

netsvetaev commented 2 years ago

Well that's what I get for not working on the same file (and not having a mac =P) Added the import.

Exception occurred during processing of request from ('192.168.0.10', 63261)
Traceback (most recent call last):
  File "/Users/artur/.conda/envs/ldm3/lib/python3.10/socketserver.py", line 683, in process_request_thread
    self.finish_request(request, client_address)
  File "/Users/artur/.conda/envs/ldm3/lib/python3.10/socketserver.py", line 360, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/Users/artur/.conda/envs/ldm3/lib/python3.10/socketserver.py", line 747, in __init__
    self.handle()
  File "/Users/artur/.conda/envs/ldm3/lib/python3.10/http/server.py", line 432, in handle
    self.handle_one_request()
  File "/Users/artur/.conda/envs/ldm3/lib/python3.10/http/server.py", line 420, in handle_one_request
    method()
  File "/Users/artur/stable-diffusion/ldm/dream/server.py", line 221, in do_POST
    self.model.prompt2image(**vars(opt), step_callback=image_progress, image_callback=image_done)
  File "/Users/artur/stable-diffusion/ldm/generate.py", line 317, in prompt2image
    results = generator.generate(
  File "/Users/artur/stable-diffusion/ldm/dream/generator/base.py", line 70, in generate
    image = make_image(x_T)
  File "/Users/artur/.conda/envs/ldm3/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/Users/artur/stable-diffusion/ldm/dream/generator/txt2img.py", line 30, in make_image
    samples, _ = sampler.sample(
  File "/Users/artur/.conda/envs/ldm3/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/Users/artur/stable-diffusion/ldm/models/diffusion/ksampler.py", line 83, in sample
    K.sampling.__dict__[f'sample_{self.schedule}'](
  File "/Users/artur/.conda/envs/ldm3/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/Users/artur/stable-diffusion/src/k-diffusion/k_diffusion/sampling.py", line 187, in sample_lms
    denoised = model(x, sigmas[i] * s_in, **extra_args)
  File "/Users/artur/.conda/envs/ldm3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/artur/stable-diffusion/ldm/models/diffusion/ksampler.py", line 16, in forward
    uncond, cond = self.inner_model(x_in, sigma_in, cond=cond_in).chunk(2)
  File "/Users/artur/.conda/envs/ldm3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/artur/stable-diffusion/src/k-diffusion/k_diffusion/external.py", line 115, in forward
    eps = self.get_eps(input * c_in, self.sigma_to_t(sigma), **kwargs)
  File "/Users/artur/stable-diffusion/src/k-diffusion/k_diffusion/external.py", line 141, in get_eps
    return self.inner_model.apply_model(*args, **kwargs)
  File "/Users/artur/stable-diffusion/ldm/models/diffusion/ddpm.py", line 1440, in apply_model
    x_recon = self.model(x_noisy, t, **cond)
  File "/Users/artur/.conda/envs/ldm3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/artur/stable-diffusion/ldm/models/diffusion/ddpm.py", line 2148, in forward
    out = self.diffusion_model(x, t, context=cc)
  File "/Users/artur/.conda/envs/ldm3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/artur/stable-diffusion/ldm/modules/diffusionmodules/openaimodel.py", line 806, in forward
    h = module(h, emb, context)
  File "/Users/artur/.conda/envs/ldm3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/artur/stable-diffusion/ldm/modules/diffusionmodules/openaimodel.py", line 88, in forward
    x = layer(x, context)
  File "/Users/artur/.conda/envs/ldm3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/artur/stable-diffusion/ldm/modules/attention.py", line 324, in forward
    x = block(x, context=context)
  File "/Users/artur/.conda/envs/ldm3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/artur/stable-diffusion/ldm/modules/attention.py", line 274, in forward
    return checkpoint(self._forward, (x, context), self.parameters(), self.checkpoint)
  File "/Users/artur/stable-diffusion/ldm/modules/diffusionmodules/util.py", line 157, in checkpoint
    return func(*inputs)
  File "/Users/artur/stable-diffusion/ldm/modules/attention.py", line 278, in _forward
    x = self.attn1(self.norm1(x)) + x
  File "/Users/artur/.conda/envs/ldm3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/artur/stable-diffusion/ldm/modules/attention.py", line 219, in forward
    self.compute_steps(q.device, outer_limit, inner_limit, k.shape[1])
  File "/Users/artur/stable-diffusion/ldm/modules/attention.py", line 200, in compute_steps
    self.inner_step = min (self.inner_step, (2<<31) // (self.outer_step * dim_softmax[1]))
TypeError: 'int' object is not subscriptable

:-(

ryudrigo commented 2 years ago

=/ Thanks for the patience, though attention.zip

Vargol commented 2 years ago

Yes, comment out that check and 1024x1024 runs fine but slower by the looks of it, 60s/it normal 50 something at 50 samples I can do images in 35 - 45 minutes depending on the sampler on the other code (on the spread run I averaged 46 minutes but that its two 'longer' runs), obvs 60s/its would suggest 50 minutes for this version its the speed held up.

Right a couple of runs with @ryudrigo's code

ryudrigo commented 2 years ago

Yes, comment out that check and 1024x1024 runs fine but slower by the looks of it, 60s/it normal 50 something at 50 samples I can do images in 35 - 45 minutes depending on the sampler on the other code (on the spread run I averaged 46 minutes but that its two 'longer' runs), obvs 60s/its would suggest 50 minutes for this version its the speed held up.

Right a couple of runs with @ryudrigo's code

No sure if I get it, do you mean you're getting 60s/it with my code?

netsvetaev commented 2 years ago

=/ Thanks for the patience, though attention.zip

It works on rc1.14 without out of memory errors, but slower than original. ~50s/it with 768, 5.5 with 512. Original was ~35s/it.

Vargol commented 2 years ago

Yes, comment out that check and 1024x1024 runs fine but slower by the looks of it, 60s/it normal 50 something at 50 samples I can do images in 35 - 45 minutes depending on the sampler on the other code (on the spread run I averaged 46 minutes but that its two 'longer' runs), obvs 60s/its would suggest 50 minutes for this version its the speed held up. Right a couple of runs with @ryudrigo's code

No sure if I get it, do you mean you're getting 60s/it with my code?

Nope not your code, that next :0)

It was 1.1.4 with a hack for a fixed slice size of 768 as the had been the fastest it ran at -W896 -H512

Any-Winter-4079 commented 2 years ago

Here is a summary of errors. https://github.com/lstein/stable-diffusion/issues/431#issuecomment-1242786666 If you have experiences others, please comment them.

Any-Winter-4079 commented 2 years ago

8191 is slightly faster than 1677, but almost no difference between all of them. But I don’t understand how I need to calculate this correctly. if I can do 768x768 on RC1.14 branch without major issues without it, do I need it?

@netsvetaev Was this for 1024x1024 image? I mean, if I understand correctly, you did "banana sushi" -s10 -C7.5 -Ak_lms -S2792018001 -W 1024 -H 1024 with slice_size = 8191 and it worked, right? And I assume it generated a nosiy image. Question. If you set slice_size= 4369 and run the same command again. Is the image not noisy now?

All this is using doggettx-optimizations branch. With these 2 files updated. https://github.com/lstein/stable-diffusion/issues/431#issuecomment-1242749937

Vargol commented 2 years ago

Okay this is all I've got time for... @ryudrigo your #432 attention.py

All tests using"banana sushi" -s10 -Wnnn_-Hnnn -C7.5 -Ak_lms -F 256x256 1.10s/it 512x512 10.06s/it 1024x1024 41.86s/it

So you small and bigger image speeds are good but the middle is a little slow , really need to throw the spreadsheet commands at it so can compare properly

netsvetaev commented 2 years ago

8191 is slightly faster than 1677, but almost no difference between all of them. But I don’t understand how I need to calculate this correctly. if I can do 768x768 on RC1.14 branch without major issues without it, do I need it?

@netsvetaev Was this for 1024x1024 image? I mean, if I understand correctly, you did

"banana sushi" -s10 -C7.5 -Ak_lms -S2792018001 -W 1024 -H 1024 with slice_size = 8191 and it worked, right?

And I assume it generated a nosiy image.

Question. If you set slice_size= 4369 and run the same command again. Is the image not noisy now?

All this is using doggettx-optimizations branch. With these 2 files updated. https://github.com/lstein/stable-diffusion/issues/431#issuecomment-1242749937

No, it was 768x768. I will test again later.

ryudrigo commented 2 years ago

Okay this is all I've got time for... @ryudrigo your #432 attention.py

All tests using"banana sushi" -s10 -Wnnn_-Hnnn -C7.5 -Ak_lms -F 256x256 1.10s/it 512x512 10.06s/it 1024x1024 41.86s/it

So you small and bigger image speeds are good but the middle is a little slow , really need to throw the spreadsheet commands at it so can compare properly

thank you. be sure to use the latest attention.py from the PR (432) when you come back, since it reflects latest discussions. Once again, thanks!

Any-Winter-4079 commented 2 years ago

@lstein and everyone! Update

For 64GB and 128 GB: slice_size = math.floor(2**30 / (q.shape[0] * q.shape[1])) might be a generic formula that works reasonably well to make sure you don't get noise but can generate larger images.

The explanation is this. q.shape[0] * q.shape[1] * slice_size must not equal or exceed 2**31. If that is met, it will run. If you take half of that slice_size (with ~5% allowance), it will not generate noise. Thus, 2**30


slice_size = 4369 should also work reasonably well, but it's a value specifically picked for 1024x1024 (note it comes from 4096 slice_size calculated with the formula above, given that q.shape[0] is 16 and the max q.shape[1] is 16384 for 1024x1024 images, and adding 6.7% allowance -so 2**30/(16*16384)*~1.067 = 4369). But performance for lower image sizes with fixed 4369 may suffer.

For example: "banana sushi" -s50 -C7.5 -n1 -W640 -H448 1 image(s) generated in 34.72s with the first formula 1 image(s) generated in 36.74s with fixed 4369

That is because we are restricting slice_size to be 4369 instead of allowing it to take larger values.

Something to note with this formula, is that it will generate some large slice_size when q.shape[1] is small, like generating slice_size 262144 for q.shape[1] 256.

Screenshot 2022-09-11 at 02 35 38

We could always add the min function to the formula, so slice_size = min(q.shape[1], math.floor(2**30 / (q.shape[0] * q.shape[1]))). But the code seems to run perfectly fine without it (and images look good and seem identical with both versions) so I removed the extra operation.

Screenshot 2022-09-11 at 02 57 51

For people with less than 64GB, it might be a good idea to play around and share your own solution here (note @Vargol already has his solution!). I've been trying to think of a generic solution for devices with less RAM -it's 3 am here!-, but it's not easy to find correlations, specially with the amount of bugs and weird behavior in Macs (like slice_size 2047 and 2049 working but 2048 failing, or the strange 5% allowance to generate non-noisy images), and the inability to test myself on 32GB, 16GB and 8GB or the time it takes to run on M1 certainly don't make it easy!

I think a good solution might be to use psutil to get our RAM, and then do something similar to:

if device_type == 'mps':
    mem_gb_device = psutil.virtual_memory().total

    if mem_gb_device >= 64:
        optimize_64_gb()
    elif mem_gb_device >= 32:
        optimize_32_gb()
    elif mem_gb_device >= 16:
        optimize_16_gb()
    else:
        optimize_8_gb()

It's only a small for loop what we need to adapt, so it's not that crazy to run it in 2 or 3 different ways. And the most important thing is that the optimization from @Doggettx works (@Vargol has made it work for his 8GB device and @i3oc9i and myself for 128GB and 64GB). I think the added benefit of this (vs. trying to merge all of our solutions into one formula) is that we can push the update to the development branch faster, and people with CUDA devices are going to enjoy their optimization too (which they're currently waiting on, for us).

As a suggestion, you may try my solution with something like slice_size = math.floor(2**30 / (q.shape[0] * q.shape[1] * MEM_ADJUSTMENT)), where you try to find a value (MEM_ADJUSTMENT) to lower your slice_size based on your available RAM, which hopefully generalizes across the different image sizes.

Doggettx commented 2 years ago

@Any-Winter-4079

We could always add the min function to the formula, so slice_size = min(q.shape[1], math.floor(2*30 / (q.shape[0] q.shape[1]))).

Have you set the min on the end= part as well like I mentioned before? want to keep iterating how important that is, since I haven't seen the change yet ;) Could be the source of your noise issues

i3oc9i commented 2 years ago

@Any-Winter-4079

do you have another version we can test ?

Also following your good idea to adjust numbers w.r.t. memory available on the device, we can test different memory size forcing mem_gb_device to lower values on our machine

if device_type == 'mps':
    mem_gb_device = psutil.virtual_memory().total / 1024 ** 3

    if mem_gb_device >= 64:
        optimize_64_gb()
    elif mem_gb_device >= 32:
        optimize_32_gb()
    elif mem_gb_device >= 16:
        optimize_16_gb()
    else:
        optimize_8_gb()
Any-Winter-4079 commented 2 years ago

@Any-Winter-4079

We could always add the min function to the formula, so slice_size = min(q.shape[1], math.floor(2*30 / (q.shape[0] q.shape[1]))).

Have you set the min on the end= part as well like I mentioned before? want to keep iterating how important that is, since I haven't seen the change yet ;) Could be the source of your noise issues

I'll check

Any-Winter-4079 commented 2 years ago

@Any-Winter-4079

do you have another version we can test ?

Also following your good idea to adjust numbers w.r.t. memory available on the device, we can test different memory size forcing mem_gb_device to lower values on our machine

if device_type == 'mps':
    mem_gb_device = psutil.virtual_memory().total / 1024 ** 3

    if mem_gb_device >= 64:
        optimize_64_gb()
    elif mem_gb_device >= 32:
        optimize_32_gb()
    elif mem_gb_device >= 16:
        optimize_16_gb()
    else:
        optimize_8_gb()

Yes, here is the code: model 4.py.zip attention 5.py.zip

Screenshot 2022-09-11 at 11 45 25 Screenshot 2022-09-11 at 11 46 02

I think for what our computer can generate without optimization, it might be faster with the vanilla version, and then add Doggettx optimisations for larger images only. That should be the best of both worlds for 64-128GB.

Any-Winter-4079 commented 2 years ago

@Doggettx Here's the code I'm using. What change are you suggesting with end? end = min(q.shape[1], i + slice_size) if I'm understanding correctly?

Screenshot 2022-09-11 at 11 53 18
Vargol commented 2 years ago

@Any-Winter-4079 do you have another version we can test ? Also following your good idea to adjust numbers w.r.t. memory available on the device, we can test different memory size forcing mem_gb_device to lower values on our machine

if device_type == 'mps':
    mem_gb_device = psutil.virtual_memory().total / 1024 ** 3

    if mem_gb_device >= 64:
        optimize_64_gb()
    elif mem_gb_device >= 32:
        optimize_32_gb()
    elif mem_gb_device >= 16:
        optimize_16_gb()
    else:
        optimize_8_gb()

If where going that route would it be easier to set it all up in the init function something like

class CrossAttention(nn.Module):
    def __init__(self, query_dim, context_dim=None, heads=8, dim_head=64, dropout=0.):
        super().__init__()
        ...
        if 'mps' and '8gb':
            self.compute_slice_size = self.compute_slice_size_8g_mps;
            self.einsum_loop = self.ein_sum_loop_mps
        else:    
            self.compute_slice_size = self.compute_slice_size_non_mps;
            self.einsum_loop = self.ein_sum_loop_non_mps

    def compute_slice_size_8g_mps(self):
          return 1

    def compute_slice_size_non_mps(self):
          steps = 1 
          # proper calcualtion
          ...
          return slice_size

    def ein_sum_loop_non_mps(self, q, r1, slice_size):
        for i in range(0, q.shape[1], slice_size):
            # einsum calc
            ...
        return r1   

    def ein_sum_loop_8g_mps(self, q, r1, slice_size):
        for i in range(0, q.shape[0], slice_size):
            #einsum calc
            ...
        return r1    
...

    def forward(self, x, context=None, mask=None):
        ...
        r1 = torch.zeros(q.shape[0], q.shape[1], v.shape[2], device=q.device)

        slice_size = self.compute_slice_size()
        r1 = self.einsum_loop(q, r1, slice_size)

Obviously nt working code but hopefully it conveys the idea, I'm not even sure you can patch a method from inside init

EDIT yes looks like you can patch from within init

Replaced the slice_size calc with two noddy functions

class CrossAttention(nn.Module):
    def __init__(self, query_dim, context_dim=None, heads=8, dim_head=64, dropout=0.):
        super().__init__()
        inner_dim = dim_head * heads
        context_dim = default(context_dim, query_dim)

        self.scale = dim_head ** -0.5
        self.heads = heads

        self.to_q = nn.Linear(query_dim, inner_dim, bias=False)
        self.to_k = nn.Linear(context_dim, inner_dim, bias=False)
        self.to_v = nn.Linear(context_dim, inner_dim, bias=False)

        self.to_out = nn.Sequential(
            nn.Linear(inner_dim, query_dim),
            nn.Dropout(dropout)
        )

        if torch.backends.mps.is_available():
            self.compute_slice_size = self.compute_slice_size_8g_mps
        else:
            self.compute_slice_size = self.compute_slice_size_non_mps

    def compute_slice_size_8g_mps(self):
          return 768

    def compute_slice_size_non_mps(self):
          steps = 1 
          # proper calcualtion
          return 2000

changed forward too

      r1 = torch.zeros(q.shape[0], q.shape[1], v.shape[2], device=q.device)

        slice_size = self.compute_slice_size()
        print(slice_size);

        for i in range(0, q.shape[1], slice_size):

and it prints out the expected value and no change to the output image.

Then change the condition to 'not mps' and got the alternative slice slice.

Any-Winter-4079 commented 2 years ago

I'm doing tests with @Doggettx end fix if I understood it correctly

Screenshot 2022-09-11 at 13 47 00

Also, added some cooling, which is helping.

Screenshot 2022-09-11 at 13 47 59

Results (will be updated).

Screenshot 2022-09-11 at 14 05 25 Screenshot 2022-09-11 at 14 05 49
lstein commented 2 years ago

I'm preparing another release candidate today, and since you seem to be converging on a solution I will happily wait until you give the thumbs up. If and when you feel these fixes are stable, could you post a PR against current HEAD of the development branch? Alternatively if it looks like there is still a lot of work to do, let me know so that I can release a version containing the original CompViz code, neonpixel changes to attention.py, and MPS fixes.

Any-Winter-4079 commented 2 years ago

We have a very simple solution for 64-128GB, and it's currently working. The 2 best options I can see are:

Eventually, we should have fixes for all mem sizes, which can be added as a PR.

I'll post the 'final' code for 64-128GB soon. As an almost final version for 64-128GB, this is the code with the end fix from @Doggettx. Performance seems pretty similar, but I still decided to leave the fix in. attention_end_fix.py.zip model_end_fix.py.zip