Closed Any-Winter-4079 closed 2 years ago
For 4GB cards, I've read there's a float16 version of model.ckpt?
yeah if you noticed I pushed another updates to my fork, allows me to generate 1792x1792 on 8 gb vram and 1024x1024 (maybe even more) on 4 gb vram. optimization goes not only for one file but the repo at whole. is enabled at low vram mode via low vram config
@Any-Winter-4079 I have executed again a test with the new model 3.py.zip attention 4.py.zip
slice_size 4369
"banana sushi" -s10 -W1024 -H1024 -C7.5 -Ak_lms -n10
10 image(s) generated in 376.80s
so this is an improvment in speed wrt previous version. see my previous comment
@i3oc9i There were a few calculations with steps
, mem_required
, mem_free_total
and slice_size
, which I just commented out, since for now we're hard-coding slice_size
anyway.
A nice 8% improvement :)
@Vargol Then, in this version which is the fastest for you
steps=1
for i in range(0, q.shape[0], steps):
end = i + steps
s1 = einsum('b i d, b j d -> b i j', q[i:end], k[i:end])
s1 *= self.scale
s2 = s1.softmax(dim=-1)
del s1
r1[i:end] = einsum('b i j, b j d -> b i d', s2, v[i:end])
del s2
I understand you do
for i in range(0, q.shape[0], steps)
which is for i in range(0, 16, 1)
So i
is going to be 0, 1, 2, ..., 15
And then end = i + steps
which is end = i + 1
Then you do s1 = einsum('b i d, b j d -> b i j', q[i:i+1], k[i:i+1])
Then, there's this version, which is much slower for you
slice_size = 1
for i in range(0, q.shape[1], slice_size):
end = i + slice_size
s1 = einsum('b i d, b j d -> b i j', q[:, i:end], k) * self.scale
s2 = s1.softmax(dim=-1)
del s1
r1[:, i:end] = einsum('b i j, b j d -> b i d', s2, v)
del s2
So for i in range(0, q.shape[1], slice_size)
is for i in range(0, 16384, 1)
So i
is going to be 0, 1, 2, ..., 16383
Then end = i + slice_size
is end = i + 1
And finally you do s1 = einsum('b i d, b j d -> b i j', q[:, i:i+1], k) * self.scale
Do you or anyone know why one is much faster? I see the discrepancy in loop iterations, but I'm not very familiar with einsum Maybe @Doggettx or @ryudrigo who have implemented optimisations know?
The top one splits the arrays in the first (0) (batch) dimension, so you get less total splits (only 16 operations total) the bottom one splits the arrays in the second (1) dimension, so you get more total splits (16384 operations total)
What I was trying to do on that pull request is to have the option to split in any of those two dimensions.
I will have to be away for some hours, though, hope I have clarified enough
Do you or anyone know why one is much faster? I see the discrepancy in loop iterations, but I'm not very familiar with einsum Maybe @Doggettx or @ryudrigo who have implemented optimisations know?
It's much slower beceause in that case you are doing 16384 iterations with a slice of 1 while with the other one you only do 16 iterations. That's why you shouldn't use the slice_size but only change the steps.
In other words, slice_size=1 on the first version is the same as slice_size=1024 on the second version (considering a size of 16384)
Again it's also very important to not use the code as is, but use the changes I proposed earlier if you're going to change the slice_size without using a steps with a factor of 2, or you will cause corruption in the array.
I wouldn't adjust the slice_size at all though, but instead just control it with the steps (with a factor of 2) since then the resulting blocks are much easier to handle for torch and in the correct sizes.
@Doggettx
As a curiosity: with
slice_size = q.shape[1] // steps if (q.shape[1] % steps) == 0 else q.shape[1]
since steps are powers of 2, why not
slice_size = q.shape[1] // steps
?
Can it happen that (q.shape[1] % steps) == 0
is not true?
Well, maybe if q.shape[1]
is less than steps
. Not sure if it can happen. Also, not sure if q.shape[1] could be an odd number (that would also make it not true)
About setting the steps
and not the slice_size
, for example for a 1024x1024 image, we need to make sure slice_size <= 4369
(else: Mac error) so I guess given the max q.shape[1] = 16384
, we'd need 4 steps.
slice_size = q.shape[1] // steps if (q.shape[1] % steps) == 0 else q.shape[1]
evaluates to slice_size = 16384 // 4 if (16384 % 4) == 0 else 16384
which is slice_size = 4096
However, it seems @Vargol reports it slows down this way (?) Could you test with steps = 4?
steps = 4
slice_size = q.shape[1] // steps if (q.shape[1] % steps) == 0 else q.shape[1]
for i in range(0, q.shape[1], slice_size):
end = i + slice_size
s1 = einsum('b i d, b j d -> b i j', q[:, i:end], k) * self.scale
s2 = s1.softmax(dim=-1)
del s1
r1[:, i:end] = einsum('b i j, b j d -> b i d', s2, v)
del s2
del q, k, v
"banana sushi" -s1 -C7.5 -Ak_lms -S2792018001 -W 1024 -H 1024
MBP M1 Pro, 16gb, OS 13.0
Test 2 Doggettx-optimizations branch with M1 changes No noticable changes, around 5.5s/it
Test 3 Doggettx-optimizations branch with M1 changes setting a fixed slice_size No noticable changes, around 5.5s/it
8191 is slightly faster than 1677, but almost no difference between all of them. But I don’t understand how I need to calculate this correctly. if I can do 768x768 on RC1.14 branch without major issues without it, do I need it?
neonsecret-optimizations Very fast with 512x512, 1.58s/it
Error with 768 like before
The Metal Performance Shaders operations encoded on it may not have completed.
Error:
(null)
Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
<AGXG13XFamilyCommandBuffer: 0x29e374750>
label =
I think I have a formula, thanks to all of the suggestions from @Doggettx and @ryudrigo . Let's see if it works...
@netsvetaev or whoever is feeling kind, could you run tests with attention.py from #432? Specially if you have a mac. attention.zip
If anyones still interest the image when to noise a slice_size =4042 for me, 4041 was an image.
I changed the last line of the loop to... end = min(q.shape[1], i + slice_size) and 4042 was still noise.
I then tried slice size of 1024 and that seems to be the closest to the speed of the 'fast for me' version of the loop with 17.00 s/it , might play and see if I can get that down.
What size was the image with slice_size=4041 @Vargol ? 1024x1024?
-W896 -H512
Second and subsequent runs seem to be slower, ~ 20% for me, I think dreams.py is holding on to some memory somewhere. Second run of same command in same session was 20 s/it. Something to be aware off.
Wait, I didn't know Collaborators could edit comments from any other person? Anyway, I meant to quote you :)
Second and subsequent runs seem to be slower, ~ 20% for me, I think dreams.py is holding on to some memory somewhere. Second run of same command in same session was 20 s/it. Something to be aware off.
I experience the same.
slice_size = 768 seems to be the sweet spot at ~ 15.5 s/it still a little slower than the reigning champ. Gonna try a few more image sizes now.
@Doggettx As a curiosity: with
slice_size = q.shape[1] // steps if (q.shape[1] % steps) == 0 else q.shape[1]
since steps are powers of 2, why notslice_size = q.shape[1] // steps
? Can it happen that(q.shape[1] % steps) == 0
is not true? Well, maybe ifq.shape[1]
is less thansteps
. Not sure if it can happen. Also, not sure if q.shape[1] could be an odd number (that would also make it not true)
yes that's why the check is there, the function also gets called with the tokens, which has a size of 77 and lower res versions which are smaller
slice_size = q.shape[1] // steps if (q.shape[1] % steps) == 0 else q.shape[1] for i in range(0, q.shape[1], slice_size): end = i + slice_size s1 = einsum('b i d, b j d -> b i j', q[:, i:end], k) * self.scale
s2 = s1.softmax(dim=-1) del s1 r1[:, i:end] = einsum('b i j, b j d -> b i d', s2, v) del s2 del q, k, v
I see you didn't do the end = min(q.shape[1], i + slice_size) still though, that's gonna cause corruption if you adjust slice size with the % steps == 0 check.
P.S. if you need a lower slice_size you can just multiply steps by 2
okay, so 1024x1024 I get
RuntimeError: Not enough memory, use lower resolution (max approx. 960x960). Need: 0.6GB free, Have:0.6GB free
(Nope I've run 1024x1024 on this machine before on the other code)
512x512 its a little slower 7.25 compare to 5.7-6.1
256x256 1.03s/it which may be faster, it compares to 1.26, but it could be statistical noise, but either way its as fast.
Right I'll try a couple of runs with @ryudrigo 's pr attention.py before I call it a day.
Maybe it's because of
if steps > 64:
max_res = math.floor(math.sqrt(math.sqrt(mem_free_total / 2.5)) / 8) * 64
raise RuntimeError(f'Not enough memory, use lower resolution (max approx. {max_res}x{max_res}). '
f'Need: {mem_required/64/gb:0.1f}GB free, Have:{mem_free_total/gb:0.1f}GB free')
I just commented out that part
@netsvetaev , @ryudrigo
I think you've lost a contiguous statement as that error looks very familiar.
sorry, my mistake, was on another branch. But here is another error
Traceback (most recent call last):
File "/Users/artur/stable-diffusion/ldm/generate.py", line 317, in prompt2image
results = generator.generate(
File "/Users/artur/stable-diffusion/ldm/dream/generator/base.py", line 70, in generate
image = make_image(x_T)
File "/Users/artur/.conda/envs/ldm3/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/Users/artur/stable-diffusion/ldm/dream/generator/txt2img.py", line 30, in make_image
samples, _ = sampler.sample(
File "/Users/artur/.conda/envs/ldm3/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/Users/artur/stable-diffusion/ldm/models/diffusion/ksampler.py", line 83, in sample
K.sampling.__dict__[f'sample_{self.schedule}'](
File "/Users/artur/.conda/envs/ldm3/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/Users/artur/stable-diffusion/src/k-diffusion/k_diffusion/sampling.py", line 187, in sample_lms
denoised = model(x, sigmas[i] * s_in, **extra_args)
File "/Users/artur/.conda/envs/ldm3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/Users/artur/stable-diffusion/ldm/models/diffusion/ksampler.py", line 16, in forward
uncond, cond = self.inner_model(x_in, sigma_in, cond=cond_in).chunk(2)
File "/Users/artur/.conda/envs/ldm3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/Users/artur/stable-diffusion/src/k-diffusion/k_diffusion/external.py", line 115, in forward
eps = self.get_eps(input * c_in, self.sigma_to_t(sigma), **kwargs)
File "/Users/artur/stable-diffusion/src/k-diffusion/k_diffusion/external.py", line 141, in get_eps
return self.inner_model.apply_model(*args, **kwargs)
File "/Users/artur/stable-diffusion/ldm/models/diffusion/ddpm.py", line 1440, in apply_model
x_recon = self.model(x_noisy, t, **cond)
File "/Users/artur/.conda/envs/ldm3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/Users/artur/stable-diffusion/ldm/models/diffusion/ddpm.py", line 2148, in forward
out = self.diffusion_model(x, t, context=cc)
File "/Users/artur/.conda/envs/ldm3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/Users/artur/stable-diffusion/ldm/modules/diffusionmodules/openaimodel.py", line 806, in forward
h = module(h, emb, context)
File "/Users/artur/.conda/envs/ldm3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/Users/artur/stable-diffusion/ldm/modules/diffusionmodules/openaimodel.py", line 88, in forward
x = layer(x, context)
File "/Users/artur/.conda/envs/ldm3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/Users/artur/stable-diffusion/ldm/modules/attention.py", line 322, in forward
x = block(x, context=context)
File "/Users/artur/.conda/envs/ldm3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/Users/artur/stable-diffusion/ldm/modules/attention.py", line 273, in forward
return checkpoint(self._forward, (x, context), self.parameters(), self.checkpoint)
File "/Users/artur/stable-diffusion/ldm/modules/diffusionmodules/util.py", line 157, in checkpoint
return func(*inputs)
File "/Users/artur/stable-diffusion/ldm/modules/attention.py", line 276, in _forward
x = self.attn1(self.norm1(x)) + x
File "/Users/artur/.conda/envs/ldm3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/Users/artur/.conda/envs/ldm3/lib/python3.10/site-packages/torch/nn/modules/normalization.py", line 190, in forward
return F.layer_norm(
File "/Users/artur/.conda/envs/ldm3/lib/python3.10/site-packages/torch/nn/functional.py", line 2511, in layer_norm
return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.
>> Could not generate image.
>> Usage stats:
>> 0 image(s) generated in 0.42s
>> Max VRAM used for this generation: 0.00G
Do you want to summarize the current problems, @Any-Winter-4079 ? I will likely get back to the code after the tests on my version of attention.py and would like to know what to address first.
@netsvetaev , @ryudrigo
I think you've lost a contiguous statement as that error looks very familiar.
Yeah, it's literally the first thing on this issue. I think I skipped that modification for macs. Just a moment...
OK, updated the PR, added x = x.contiguous() if x.device.type == 'mps' else x
at line 276. Also here is the file
attention.zip
OK, updated the PR, added
x = x.contiguous() if x.device.type == 'mps' else x
at line 276. Also here is the file attention.zip
mem_free = psutil.virtual_memory().available
NameError: name 'psutil' is not defined
@ryudrigo
1) Well, the first problem is that some values of slice_size
seem to flat out fail. For example:
"banana sushi" -s10 -C7.5 -Ak_lms -S2792018001 -W 1024 -H 1024
slice_size = 2047
works.
slice_size = 2048
fails. Error: product of dimension sizes > 2**31
slice_size = 2049
works.
I'm assuming there is some problem with some powers of 2. However,
slice_size = 1024
works.
Maybe some powers of 2 above some value, fail.
Then, we can see how (for the same 1024x1024 image)
slice_size = 8191
works.
slice_size = 8192
fails. Error: product of dimension sizes > 2**31
slice_size = 8193
fails. Error: product of dimension sizes > 2**31
slice_size = 8420
fails. Error: product of dimension sizes > 2**31
So, unless we find a slice_size
bigger than 8191 that works, I'm assuming that when q.shape[0] * q.shape[1] * slice_size >= 2^31
, it fails. With slice_size 8192, then 16 16384 8192 == 2^31
Up to this point, 2 hypothesis:
2) With the rules above, we can generate images (i.e. not crash). But they may be pure noise. To get an actual image, we need to decrease the slice_size
. How much? It's not clear.
slice_size = 4369
seems to be the highest it can go to get an actual image (not noise). But I don't see any logic behind that number.
However, a good approximation may be half of the max value from the previous step.
So if slice_size = 8191
was the max from the previous step, because 16 16384 8191 < 2^31 but 16 16384 8192 >= 2^31, then we halve that value
8191 / 2 = 4095.5
To be safer, we can take 4095
3) Now, the third problem is these numbers work great for 64 GB and 128 GB M1. But, people with lower RAM seem to have trouble. I guess some slice_size are too much for them, for their memory, but we haven't figured out a formula yet.
Note: Some nice math I've been able to use is, if max slice_size = 8191 for 1024x1024, then if I want to generate 3200x1600 image, we can do
(1024 1024) / (3200 1600) * 8191 = 1677 (the new slice_size
to use)
PS: I aborted the 3200x1600 process, so no guarantees it works or not generates noise (taking 1677/2). But for smaller sizes, the formula seems to work reasonably well.
But this is for 64-128 GB. It would be awesome to find values for lower GB Mac, to see if we can make some sense.
OK, updated the PR, added
x = x.contiguous() if x.device.type == 'mps' else x
at line 276. Also here is the file attention.zipmem_free = psutil.virtual_memory().available NameError: name 'psutil' is not defined
Well that's what I get for not working on the same file (and not having a mac =P) Added the import.
I should merge whatever I'm doing with Doggettx's at some point.
Well that's what I get for not working on the same file (and not having a mac =P) Added the import.
Exception occurred during processing of request from ('192.168.0.10', 63261)
Traceback (most recent call last):
File "/Users/artur/.conda/envs/ldm3/lib/python3.10/socketserver.py", line 683, in process_request_thread
self.finish_request(request, client_address)
File "/Users/artur/.conda/envs/ldm3/lib/python3.10/socketserver.py", line 360, in finish_request
self.RequestHandlerClass(request, client_address, self)
File "/Users/artur/.conda/envs/ldm3/lib/python3.10/socketserver.py", line 747, in __init__
self.handle()
File "/Users/artur/.conda/envs/ldm3/lib/python3.10/http/server.py", line 432, in handle
self.handle_one_request()
File "/Users/artur/.conda/envs/ldm3/lib/python3.10/http/server.py", line 420, in handle_one_request
method()
File "/Users/artur/stable-diffusion/ldm/dream/server.py", line 221, in do_POST
self.model.prompt2image(**vars(opt), step_callback=image_progress, image_callback=image_done)
File "/Users/artur/stable-diffusion/ldm/generate.py", line 317, in prompt2image
results = generator.generate(
File "/Users/artur/stable-diffusion/ldm/dream/generator/base.py", line 70, in generate
image = make_image(x_T)
File "/Users/artur/.conda/envs/ldm3/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/Users/artur/stable-diffusion/ldm/dream/generator/txt2img.py", line 30, in make_image
samples, _ = sampler.sample(
File "/Users/artur/.conda/envs/ldm3/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/Users/artur/stable-diffusion/ldm/models/diffusion/ksampler.py", line 83, in sample
K.sampling.__dict__[f'sample_{self.schedule}'](
File "/Users/artur/.conda/envs/ldm3/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/Users/artur/stable-diffusion/src/k-diffusion/k_diffusion/sampling.py", line 187, in sample_lms
denoised = model(x, sigmas[i] * s_in, **extra_args)
File "/Users/artur/.conda/envs/ldm3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/Users/artur/stable-diffusion/ldm/models/diffusion/ksampler.py", line 16, in forward
uncond, cond = self.inner_model(x_in, sigma_in, cond=cond_in).chunk(2)
File "/Users/artur/.conda/envs/ldm3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/Users/artur/stable-diffusion/src/k-diffusion/k_diffusion/external.py", line 115, in forward
eps = self.get_eps(input * c_in, self.sigma_to_t(sigma), **kwargs)
File "/Users/artur/stable-diffusion/src/k-diffusion/k_diffusion/external.py", line 141, in get_eps
return self.inner_model.apply_model(*args, **kwargs)
File "/Users/artur/stable-diffusion/ldm/models/diffusion/ddpm.py", line 1440, in apply_model
x_recon = self.model(x_noisy, t, **cond)
File "/Users/artur/.conda/envs/ldm3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/Users/artur/stable-diffusion/ldm/models/diffusion/ddpm.py", line 2148, in forward
out = self.diffusion_model(x, t, context=cc)
File "/Users/artur/.conda/envs/ldm3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/Users/artur/stable-diffusion/ldm/modules/diffusionmodules/openaimodel.py", line 806, in forward
h = module(h, emb, context)
File "/Users/artur/.conda/envs/ldm3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/Users/artur/stable-diffusion/ldm/modules/diffusionmodules/openaimodel.py", line 88, in forward
x = layer(x, context)
File "/Users/artur/.conda/envs/ldm3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/Users/artur/stable-diffusion/ldm/modules/attention.py", line 324, in forward
x = block(x, context=context)
File "/Users/artur/.conda/envs/ldm3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/Users/artur/stable-diffusion/ldm/modules/attention.py", line 274, in forward
return checkpoint(self._forward, (x, context), self.parameters(), self.checkpoint)
File "/Users/artur/stable-diffusion/ldm/modules/diffusionmodules/util.py", line 157, in checkpoint
return func(*inputs)
File "/Users/artur/stable-diffusion/ldm/modules/attention.py", line 278, in _forward
x = self.attn1(self.norm1(x)) + x
File "/Users/artur/.conda/envs/ldm3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/Users/artur/stable-diffusion/ldm/modules/attention.py", line 219, in forward
self.compute_steps(q.device, outer_limit, inner_limit, k.shape[1])
File "/Users/artur/stable-diffusion/ldm/modules/attention.py", line 200, in compute_steps
self.inner_step = min (self.inner_step, (2<<31) // (self.outer_step * dim_softmax[1]))
TypeError: 'int' object is not subscriptable
:-(
=/ Thanks for the patience, though attention.zip
Yes, comment out that check and 1024x1024 runs fine but slower by the looks of it, 60s/it normal 50 something at 50 samples I can do images in 35 - 45 minutes depending on the sampler on the other code (on the spread run I averaged 46 minutes but that its two 'longer' runs), obvs 60s/its would suggest 50 minutes for this version its the speed held up.
Right a couple of runs with @ryudrigo's code
Yes, comment out that check and 1024x1024 runs fine but slower by the looks of it, 60s/it normal 50 something at 50 samples I can do images in 35 - 45 minutes depending on the sampler on the other code (on the spread run I averaged 46 minutes but that its two 'longer' runs), obvs 60s/its would suggest 50 minutes for this version its the speed held up.
Right a couple of runs with @ryudrigo's code
No sure if I get it, do you mean you're getting 60s/it with my code?
=/ Thanks for the patience, though attention.zip
It works on rc1.14 without out of memory errors, but slower than original. ~50s/it with 768, 5.5 with 512. Original was ~35s/it.
Yes, comment out that check and 1024x1024 runs fine but slower by the looks of it, 60s/it normal 50 something at 50 samples I can do images in 35 - 45 minutes depending on the sampler on the other code (on the spread run I averaged 46 minutes but that its two 'longer' runs), obvs 60s/its would suggest 50 minutes for this version its the speed held up. Right a couple of runs with @ryudrigo's code
No sure if I get it, do you mean you're getting 60s/it with my code?
Nope not your code, that next :0)
It was 1.1.4 with a hack for a fixed slice size of 768 as the had been the fastest it ran at -W896 -H512
Here is a summary of errors. https://github.com/lstein/stable-diffusion/issues/431#issuecomment-1242786666 If you have experiences others, please comment them.
8191 is slightly faster than 1677, but almost no difference between all of them. But I don’t understand how I need to calculate this correctly. if I can do 768x768 on RC1.14 branch without major issues without it, do I need it?
@netsvetaev Was this for 1024x1024 image? I mean, if I understand correctly, you did
"banana sushi" -s10 -C7.5 -Ak_lms -S2792018001 -W 1024 -H 1024
with slice_size = 8191
and it worked, right?
And I assume it generated a nosiy image.
Question. If you set slice_size= 4369
and run the same command again. Is the image not noisy now?
All this is using doggettx-optimizations branch. With these 2 files updated. https://github.com/lstein/stable-diffusion/issues/431#issuecomment-1242749937
Okay this is all I've got time for... @ryudrigo your #432 attention.py
All tests using"banana sushi" -s10 -Wnnn_-Hnnn -C7.5 -Ak_lms -F
256x256 1.10s/it
512x512 10.06s/it
1024x1024 41.86s/it
So you small and bigger image speeds are good but the middle is a little slow , really need to throw the spreadsheet commands at it so can compare properly
8191 is slightly faster than 1677, but almost no difference between all of them. But I don’t understand how I need to calculate this correctly. if I can do 768x768 on RC1.14 branch without major issues without it, do I need it?
@netsvetaev Was this for 1024x1024 image? I mean, if I understand correctly, you did
"banana sushi" -s10 -C7.5 -Ak_lms -S2792018001 -W 1024 -H 1024
withslice_size = 8191
and it worked, right?And I assume it generated a nosiy image.
Question. If you set
slice_size= 4369
and run the same command again. Is the image not noisy now?All this is using doggettx-optimizations branch. With these 2 files updated. https://github.com/lstein/stable-diffusion/issues/431#issuecomment-1242749937
No, it was 768x768. I will test again later.
Okay this is all I've got time for... @ryudrigo your #432 attention.py
All tests using
"banana sushi" -s10 -Wnnn_-Hnnn -C7.5 -Ak_lms -F
256x256 1.10s/it 512x512 10.06s/it 1024x1024 41.86s/itSo you small and bigger image speeds are good but the middle is a little slow , really need to throw the spreadsheet commands at it so can compare properly
thank you. be sure to use the latest attention.py from the PR (432) when you come back, since it reflects latest discussions. Once again, thanks!
@lstein and everyone! Update
For 64GB and 128 GB:
slice_size = math.floor(2**30 / (q.shape[0] * q.shape[1]))
might be a generic formula that works reasonably well to make sure you don't get noise but can generate larger images.
The explanation is this. q.shape[0] * q.shape[1] * slice_size
must not equal or exceed 2**31
. If that is met, it will run. If you take half of that slice_size (with ~5% allowance), it will not generate noise. Thus, 2**30
slice_size = 4369
should also work reasonably well, but it's a value specifically picked for 1024x1024 (note it comes from 4096 slice_size calculated with the formula above, given that q.shape[0] is 16 and the max q.shape[1] is 16384 for 1024x1024 images, and adding 6.7% allowance -so 2**30/(16*16384)*~1.067 = 4369
). But performance for lower image sizes with fixed 4369 may suffer.
For example:
"banana sushi" -s50 -C7.5 -n1 -W640 -H448
1 image(s) generated in 34.72s with the first formula
1 image(s) generated in 36.74s with fixed 4369
That is because we are restricting slice_size to be 4369 instead of allowing it to take larger values.
Something to note with this formula, is that it will generate some large slice_size when q.shape[1] is small, like generating slice_size 262144 for q.shape[1] 256.
We could always add the min
function to the formula, so slice_size = min(q.shape[1], math.floor(2**30 / (q.shape[0] * q.shape[1])))
. But the code seems to run perfectly fine without it (and images look good and seem identical with both versions) so I removed the extra operation.
For people with less than 64GB, it might be a good idea to play around and share your own solution here (note @Vargol already has his solution!). I've been trying to think of a generic solution for devices with less RAM -it's 3 am here!-, but it's not easy to find correlations, specially with the amount of bugs and weird behavior in Macs (like slice_size 2047 and 2049 working but 2048 failing, or the strange 5% allowance to generate non-noisy images), and the inability to test myself on 32GB, 16GB and 8GB or the time it takes to run on M1 certainly don't make it easy!
I think a good solution might be to use psutil
to get our RAM, and then do something similar to:
if device_type == 'mps':
mem_gb_device = psutil.virtual_memory().total
if mem_gb_device >= 64:
optimize_64_gb()
elif mem_gb_device >= 32:
optimize_32_gb()
elif mem_gb_device >= 16:
optimize_16_gb()
else:
optimize_8_gb()
It's only a small for loop
what we need to adapt, so it's not that crazy to run it in 2 or 3 different ways. And the most important thing is that the optimization from @Doggettx works (@Vargol has made it work for his 8GB device and @i3oc9i and myself for 128GB and 64GB). I think the added benefit of this (vs. trying to merge all of our solutions into one formula) is that we can push the update to the development branch faster, and people with CUDA devices are going to enjoy their optimization too (which they're currently waiting on, for us).
As a suggestion, you may try my solution with something like slice_size = math.floor(2**30 / (q.shape[0] * q.shape[1] * MEM_ADJUSTMENT))
, where you try to find a value (MEM_ADJUSTMENT) to lower your slice_size based on your available RAM, which hopefully generalizes across the different image sizes.
@Any-Winter-4079
We could always add the min function to the formula, so slice_size = min(q.shape[1], math.floor(2*30 / (q.shape[0] q.shape[1]))).
Have you set the min on the end= part as well like I mentioned before? want to keep iterating how important that is, since I haven't seen the change yet ;) Could be the source of your noise issues
@Any-Winter-4079
do you have another version we can test ?
Also following your good idea to adjust numbers w.r.t. memory available on the device, we can test different memory size forcing mem_gb_device
to lower values on our machine
if device_type == 'mps':
mem_gb_device = psutil.virtual_memory().total / 1024 ** 3
if mem_gb_device >= 64:
optimize_64_gb()
elif mem_gb_device >= 32:
optimize_32_gb()
elif mem_gb_device >= 16:
optimize_16_gb()
else:
optimize_8_gb()
@Any-Winter-4079
We could always add the min function to the formula, so slice_size = min(q.shape[1], math.floor(2*30 / (q.shape[0] q.shape[1]))).
Have you set the min on the end= part as well like I mentioned before? want to keep iterating how important that is, since I haven't seen the change yet ;) Could be the source of your noise issues
I'll check
@Any-Winter-4079
do you have another version we can test ?
Also following your good idea to adjust numbers w.r.t. memory available on the device, we can test different memory size forcing
mem_gb_device
to lower values on our machineif device_type == 'mps': mem_gb_device = psutil.virtual_memory().total / 1024 ** 3 if mem_gb_device >= 64: optimize_64_gb() elif mem_gb_device >= 32: optimize_32_gb() elif mem_gb_device >= 16: optimize_16_gb() else: optimize_8_gb()
Yes, here is the code: model 4.py.zip attention 5.py.zip
I think for what our computer can generate without optimization, it might be faster with the vanilla version, and then add Doggettx optimisations for larger images only. That should be the best of both worlds for 64-128GB.
@Doggettx Here's the code I'm using. What change are you suggesting with end
?
end = min(q.shape[1], i + slice_size)
if I'm understanding correctly?
@Any-Winter-4079 do you have another version we can test ? Also following your good idea to adjust numbers w.r.t. memory available on the device, we can test different memory size forcing
mem_gb_device
to lower values on our machineif device_type == 'mps': mem_gb_device = psutil.virtual_memory().total / 1024 ** 3 if mem_gb_device >= 64: optimize_64_gb() elif mem_gb_device >= 32: optimize_32_gb() elif mem_gb_device >= 16: optimize_16_gb() else: optimize_8_gb()
If where going that route would it be easier to set it all up in the init function something like
class CrossAttention(nn.Module):
def __init__(self, query_dim, context_dim=None, heads=8, dim_head=64, dropout=0.):
super().__init__()
...
if 'mps' and '8gb':
self.compute_slice_size = self.compute_slice_size_8g_mps;
self.einsum_loop = self.ein_sum_loop_mps
else:
self.compute_slice_size = self.compute_slice_size_non_mps;
self.einsum_loop = self.ein_sum_loop_non_mps
def compute_slice_size_8g_mps(self):
return 1
def compute_slice_size_non_mps(self):
steps = 1
# proper calcualtion
...
return slice_size
def ein_sum_loop_non_mps(self, q, r1, slice_size):
for i in range(0, q.shape[1], slice_size):
# einsum calc
...
return r1
def ein_sum_loop_8g_mps(self, q, r1, slice_size):
for i in range(0, q.shape[0], slice_size):
#einsum calc
...
return r1
...
def forward(self, x, context=None, mask=None):
...
r1 = torch.zeros(q.shape[0], q.shape[1], v.shape[2], device=q.device)
slice_size = self.compute_slice_size()
r1 = self.einsum_loop(q, r1, slice_size)
Obviously nt working code but hopefully it conveys the idea, I'm not even sure you can patch a method from inside init
EDIT yes looks like you can patch from within init
Replaced the slice_size calc with two noddy functions
class CrossAttention(nn.Module):
def __init__(self, query_dim, context_dim=None, heads=8, dim_head=64, dropout=0.):
super().__init__()
inner_dim = dim_head * heads
context_dim = default(context_dim, query_dim)
self.scale = dim_head ** -0.5
self.heads = heads
self.to_q = nn.Linear(query_dim, inner_dim, bias=False)
self.to_k = nn.Linear(context_dim, inner_dim, bias=False)
self.to_v = nn.Linear(context_dim, inner_dim, bias=False)
self.to_out = nn.Sequential(
nn.Linear(inner_dim, query_dim),
nn.Dropout(dropout)
)
if torch.backends.mps.is_available():
self.compute_slice_size = self.compute_slice_size_8g_mps
else:
self.compute_slice_size = self.compute_slice_size_non_mps
def compute_slice_size_8g_mps(self):
return 768
def compute_slice_size_non_mps(self):
steps = 1
# proper calcualtion
return 2000
changed forward too
r1 = torch.zeros(q.shape[0], q.shape[1], v.shape[2], device=q.device)
slice_size = self.compute_slice_size()
print(slice_size);
for i in range(0, q.shape[1], slice_size):
and it prints out the expected value and no change to the output image.
Then change the condition to 'not mps' and got the alternative slice slice.
I'm doing tests with @Doggettx end fix if I understood it correctly
Also, added some cooling, which is helping.
Results (will be updated).
I'm preparing another release candidate today, and since you seem to be converging on a solution I will happily wait until you give the thumbs up. If and when you feel these fixes are stable, could you post a PR against current HEAD of the development branch? Alternatively if it looks like there is still a lot of work to do, let me know so that I can release a version containing the original CompViz code, neonpixel changes to attention.py, and MPS fixes.
We have a very simple solution for 64-128GB, and it's currently working. The 2 best options I can see are:
Eventually, we should have fixes for all mem sizes, which can be added as a PR.
I'll post the 'final' code for 64-128GB soon.
As an almost final version for 64-128GB, this is the code with the end
fix from @Doggettx. Performance seems pretty similar, but I still decided to leave the fix in.
attention_end_fix.py.zip
model_end_fix.py.zip
Okay, so I've seen @lstein has added
x = x.contiguous() if x.device.type == 'mps' else x
to ldm/modules/attention.py in the doggettx-optimizations branch but there's another error happening howKeyError: 'active_bytes.all.current'
and this has to do with this function in attention.pyWhich is basically the code that detects your free memory, and then splits the softmax operation in steps, to allow to generate larger images.
Now, because we are on Mac, I'm not sure @lstein can help us much (unless he has one around), but I open this issue for anyone that wants to collaborate in porting this functionality to M1