Closed mgcrea closed 2 years ago
using the updated copy of forward in this link https://github.com/basujindal/stable-diffusion/pull/117#issuecomment-1236422783 is spectacular,
Apple 8GB Mac Mini M1 (so 8Gb unified memory)
I can do 512x576 without swapping (after the model's are loaded, loading them alway causes a bit of swapping) I've gone up to 704x704 without any of the massive slow down I got from the split softmax version when I started after to use swap. 704x640 running at the same speed 512x512 was before the optimisations (swapping is a killer).
How does one test this refactoring-simplet2i
for the optimized code?
I did git switch -c test origin/refactoring-simplet2i
and ran that dream.py but i get CUDA errors if going over 512x768 (it's the same as main)
Hey just too add here, maybe its of help: "Fork for automatic memory allocation"
https://www.reddit.com/r/StableDiffusion/comments/x6dhks/comment/in6gpow/ https://github.com/Doggettx/stable-diffusion/tree/main/ldm
Didnt get it yet to work on my system (3070ti - 8G Vram, refactoring-simplet2i branch [4e45dec])
Hey just too add here, maybe its of help: "Fork for automatic memory allocation"
https://www.reddit.com/r/StableDiffusion/comments/x6dhks/comment/in6gpow/ https://github.com/Doggettx/stable-diffusion/tree/main/ldm
Didnt get it yet to work on my system (3070ti - 8G Vram, refactoring-simplet2i branch [4e45dec])
Yep it's the same [4e45dec] and doesn't work
How does one test this
refactoring-simplet2i
for the optimized code? I didgit switch -c test origin/refactoring-simplet2i
and ran that dream.py but i get CUDA errors if going over 512x768 (it's the same as main)
I usually do a checkout
% git checkout refactoring-simplet2i branch 'refactoring-simplet2i' set up to track 'origin/refactoring-simplet2i'. Switched to a new branch 'refactoring-simplet2i'
talking of refactoring-simplet2i @lstein I think you've lost a mps fix in attention.py, need an extra line at line 215
237d215
< x = x.contiguous() if x.device.type == 'mps' else x
its in
class BasicTransformerBlock(nn.Module):
the second def of forward should look like
def _forward(self, x, context=None):
x = x.contiguous() if x.device.type == 'mps' else x
x = self.attn1(self.norm1(x)) + x
x = self.attn2(self.norm2(x), context=context) + x
x = self.ff(self.norm3(x)) + x
return x
One of the things no one seems to be discussing is the quality of the images produced. The larger I go, the more and more distorted and convoluted the results become for me. Especially in prompts that involve humanoid characters. Another pattern I've been noticing is that the images tend to start producing duplicates of prompts in the same image as the resolution grows bigger.
I'm guessing this is because the initial model was trained at 512?
Another pattern I've been noticing is that the images tend to start producing duplicates of prompts in the same image as the resolution grows bigger.
I'm guessing this is because the initial model was trained at 512?
Yes, thats the reason for the duplicates
Yes, thats the reason for the duplicates
I've realized that actually generating at high res is only good when your prompt has subjects that are completely abstract and lack any form of coherency. But if it is something more definitive, you are far better of doing resolutions closer to the model and then trying to upscale them to get better results.
However these memory optimizations are really good if and when they release a larger model. Even if it takes more memory, we should be able to run it off the bat.
Sorry I missed that; I guess it got overwritten when I brought over the second round of attention optimizations. I'll fix it.
Lincoln
On Mon, Sep 5, 2022 at 11:34 AM Vargol @.***> wrote:
talking of refactoring-simplet2i @lstein https://github.com/lstein I think you've lost a mps fix in attention.py, need an extra line at line 215
237d215 < x = x.contiguous() if x.device.type == 'mps' else x
its in class BasicTransformerBlock(nn.Module): the second def of forward should look like
def _forward(self, x, context=None): x = x.contiguous() if x.device.type == 'mps' else x x = self.attn1(self.norm1(x)) + x x = self.attn2(self.norm2(x), context=context) + x x = self.ff(self.norm3(x)) + x return x
— Reply to this email directly, view it on GitHub https://github.com/lstein/stable-diffusion/issues/364#issuecomment-1237217368, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA3EVLTQGVZWZX62JUOZK3V4YHJZANCNFSM6AAAAAAQEHYBTA . You are receiving this because you were mentioned.Message ID: @.***>
--
Lincoln Stein
Head, Adaptive Oncology, OICR
Senior Principal Investigator, OICR
Professor, Department of Molecular Genetics, University of Toronto
Tel: 416-673-8514
Cell: 416-817-8240
@.***
*E*xecutive Assistant
Michelle Xin
Tel: 647-260-7927
@. @.>*
Ontario Institute for Cancer Research
MaRS Centre, 661 University Avenue, Suite 510, Toronto, Ontario, Canada M5G 0A3
Collaborate. Translate. Change lives.
This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization.
Just pushed a fix for the @vargol's bug.
Just pushed a fix for the @Vargol's bug.
I managed to just increase to 512x832, but any further increase it still gives CUDA memory errors, is that normal @lstein ?
git switch -c test origin/refactoring-simplet2i
is the command i used to get that branch (and seems good)
How did you test those high resolutions with 8GB @blessedcoolant ?
How did you test those high resolutions with 8GB @blessedcoolant ?
I think is this rewrite of the forward function that isn't in refactoring-simplet2i yet that's the difference https://github.com/basujindal/stable-diffusion/pull/117#issuecomment-1236422783
using the updated copy of forward in this link basujindal#117 (comment) is spectacular,
Apple 8GB Mac Mini M1 (so 8Gb unified memory)
I can do 512x576 without swapping (after the model's are loaded, loading them alway causes a bit of swapping) I've gone up to 704x704 without any of the massive slow down I got from the split softmax version when I started after to use swap. 704x640 running at the same speed 512x512 was before the optimisations (swapping is a killer).
M1 with 64 GB RAM.
With the version of def forward(self, x, context=None, mask=None):
from the comment you reference:
Peak memory usage (from Activity Monitor) 12.39 GB
Time 00:40. 1.23it/s
Without that version: Peak memory usage (from Activity Monitor) 14.84 GB Time 00:29. 1.70it/s
dream > Anubis the Ancient Egyptian God of Death riding a motorbike in Grand Theft Auto V cover, with palm trees in the background, cover art by Stephen Bliss, artstation, high quality -m ddim -S 1469565
So code performance (measured in time) seems to be dependant on RAM. Meaning someone may not suffer a performance hit at all running a modified version with 8GB RAM but someone else may see a massive performance hit at x8 RAM (like I am seeing).
Maybe we should start creating code versions for 8GB, 32GB, 64GB, etc (I assume more RAM can take more aggressive approaches). Or, as a second option, allow for many flags to be set to customize things for your own machine specs.
The best implementation of the new optimization can be found here now, in this branch by @Doggettx : https://github.com/Doggettx/stable-diffusion/commits/main
@lstein so that is the version you should add to this repo too I think.
The best implementation of the new optimization can be found here now, in this branch by @Doggettx : https://github.com/Doggettx/stable-diffusion/commits/main
@lstein so that is the version you should add to this repo too I think.
it is utilized in my repo too https://github.com/neonsecret/stable-diffusion
it is utilized in my repo too https://github.com/neonsecret/stable-diffusion
Your repo does not yet have the improvements @Doggettx made for automatically selecting the best amount of steps for the resolution you want to render based on the available memory, right? What he explained here: https://github.com/neonsecret/stable-diffusion/commit/52660098a5fdfba824d498d56a32c0733543cd47#commitcomment-83093054
it is utilized in my repo too https://github.com/neonsecret/stable-diffusion
Your repo does not yet have the improvements @Doggettx made for automatically selecting the best amount of steps for the resolution you want to render based on the available memory, right? What he explained here: neonsecret@5266009#commitcomment-83093054
it doesn't matter as the algorithm by @Doggettx is being tested an may not work as expected (may not determine the needed memory correctly).
stats = torch.cuda.memory_stats(q.device)
mem_total = torch.cuda.get_device_properties(0).total_memory
mem_active = stats['active_bytes.all.current']
mem_free = mem_total - mem_active
Thats not going to work well in non cuda devices
Thats not going to work well in non cuda devices
Yeah, but there's probably also a way to do that that works on non cuda devices. If not, then that optimization should simply only be enabled on cuda devices.
Just to let everyone know, I'm going to wait for things to settle down on the memory optimization front and spend my coding time on getting the refactor branch into development cleanly. There seem to be a number of tradeoffs with regards to execution speed, peak memory usage, and average memory utilisation. I'm hoping that someone will do systematic benchmarking on the various solutions so that we can understand the tradeoffs.
Just pushed a fix for the @Vargol's bug.
I managed to just increase to 512x832, but any further increase it still gives CUDA memory errors, is that normal @lstein ?
git switch -c test origin/refactoring-simplet2i
is the command i used to get that branch (and seems good) How did you test those high resolutions with 8GB @blessedcoolant ?
The confounding factor is that video cards installed in personal computers are also doing stuff for the system, such as displaying the desktop windows and running any 3D graphic programs you have running. So the maximum image size you can achieve depends both on how much GB you have as well as on what else the GPU happens to be doing at the time. This is why I suggest that you use the peak VRAM usage statistics that are printed after every generation (on CUDA devices) in order to understand what memory is required by stable diffusion.
The VRAM usage does not scale proportional to the area of the desired image, but seems to follow a cubic polynomial. So a 1024x1024 image needs about 8 times more free memory than 512x512 does. (You need to correct for a constant memory consumption of about 5G just to load the model.)
I just make a test on my Mac Studio M1 Ultra and I found a huge difference in the execution time between the development and the refactoring-simplet2 branch.
The refactoring-simplet2 branch is 66% slover.
1/ git checkout refactoring-simplet2i (commit 3879270e57195df3ff153e58b6475b0cd07a5c78)
dream> "test" -s50 -W704 -H704
100% 50/50 [01:18<00:00, 1.58s/it]
Generating: 100% 1/1 [01:19<00:00, 79.18s/it]
>> Usage stats:
>> 1 image(s) generated in 79.10s
>> Max VRAM used for this generation: 0.00G
Also remark that image generated stats seems to be inconsistent with the execution time in this case.
2/ git checkout development (commit 52d8bb2836cf05994ee5e2c5cf9c8d190dac0524)
dream> "test" -s50 -W704 -H704
100% 50/50 [00:47<00:00, 1.04it/s]
Generating: 100% 1/1 [00:49<00:00, 49.24s/it]
>> Usage stats:
>> 1 image(s) generated in 49.27s
>> Max VRAM used for this generation: 0.00G
@lstein said;
video cards installed in personal computers are also doing stuff for the system
Yep, and it's much worse on Windows single-GPU systems, because there's no text-only mode like you could get by shutting down your GUI session on Linux.
Also: this is literally the first time I'm hugging Optimus on my laptop - the shitty internal Intel "GPU" handles all of the desktop crap, leaving max mem free on the NVIDIA device 😁
stats = torch.cuda.memory_stats(q.device) mem_total = torch.cuda.get_device_properties(0).total_memory mem_active = stats['active_bytes.all.current'] mem_free = mem_total - mem_active
Thats not going to work well in non cuda devices
Author’s answer on reddit about m1/mps: I don't see any way in the torch documentation on how to get memory stats from mps. You could just remove the whole auto scaling and set it at a static steps. If you set it at something like 16 you should be able to go really high res, but you'll lose some performance on lower resolutions too. So basically replace this:
stats = torch.cuda.memory_stats(q.device)
mem_total = torch.cuda.get_device_properties(0).total_memory
mem_active = stats['active_bytes.all.current']
mem_free = mem_total - mem_active
mem_required = q.shape[0] * q.shape[1] * k.shape[1] * 4 * 2.5
steps = 1
if mem_required > mem_free:
steps = 2**(math.ceil(math.log(mem_required / mem_free, 2)))
with just
steps = 16
Valid values for steps are 1/2/4/8/16/32/64 the higher you go the bigger resolution you can render but also the slower it will run. steps = 1 basically disables the optimization which is what usually happens at lower resolutions.
the first few steps don't really lower performance much, so if you want a nice balance you could just set it at 4 or 8.
But I get an error after that.
Author’s answer on reddit about m1/mps:
Probably helpful to link that reddit post here, as it got a lot of good discussion: https://www.reddit.com/r/StableDiffusion/comments/x6dhks/fork_for_automatic_memory_allocation_allows_for/
@netsvetaev @Vargol
stats = torch.cuda.memory_stats(q.device) mem_total = torch.cuda.get_device_properties(0).total_memory mem_active = stats['active_bytes.all.current'] mem_free = mem_total - mem_active
Thats not going to work well in non cuda devices
This is may be because on M1 threre is not VRAM associated with the device, indeed the memory is unified between CPU and GPU on the M1 architecture.
May be we can add the amount of memory used by the dream process to help to compare amog architecture
something like this Max RAM used by python process: Max VRAM used for this generation:
About recording memory usage on Mac, yesterday I tried a couple of libraries (psutil
and resource
) but wasn't super satisfied with the memory values I got (seemed too low).
The best value I've got is with top -l 1 | grep "python"
, which shows the same value shown in Activity Monitor. This includes all of the memory being used (including model initialization too).
The problem is this command seems to take a snapshot, so we'd need to run it mid-execution (?) Not sure if someone knows of an alternative to replace torch.cuda.max_memory_allocated(), etc. on Mac.
psutil.Process(os.getpid()).memory_info().rss / 1024 resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
are quite consistent does not fit the need ?
Edited
may be the Activity Monitor is showing
psutil.Process(os.getpid()).memory_info().vms / 1024
but if it is the case vms il less accurate of rss, because it does not reflec the acuatl usage of the memory, but instead it is the size of memory that OS has given to a process, but it doesn’t necessarily mean that the process is using all of that memory
You could just change it from mem_free to a definable amount of memory it's allowed to use. The calculation in that version was wrong too btw, I've changed it too
stats = torch.cuda.memory_stats(q.device)
mem_active = stats['active_bytes.all.current']
mem_reserved = stats['reserved_bytes.all.current']
mem_free_cuda, _ = torch.cuda.mem_get_info(torch.cuda.current_device())
mem_free_torch = mem_reserved - mem_active
mem_free_total = mem_free_cuda + mem_free_torch
Maybe interesting to know, you can also calculate the max res you can do at a certain amount of free memory:
max res for a square = sqrt(sqrt(max_bytes_to_use / 4 / 2.5 / 16 * steps) * 64)
for example, I have about 16gb free so that's:
floor(sqrt(sqrt(17000000000 / 10 / 16 * 64) * 64) / 64) * 64 = 2240x2240
or calculate how much memory you need to complete the loop
bytes needed = 16 * (width/8 * height/8) ^ 2 * 4 * 2.5 / steps
that is of course free memory needed at the point the loop starts...
P.S. that's assuming the loop from my fork
but if it is the case vms il less accurate of rss, because it does not reflec the acuatl usage of the memory, but instead it is the size of memory that OS has given to a process, but it doesn’t necessarily mean that the process is using all of that memory
Plus going by Activity Monitor Python is frequently using 12 Gb of my 8Gb unified memory when running bigger resolutions (i.e. is using swap)
@i3oc9i
The issue I get with resource
is it always displays the same value?
test -W 256 -H 256 -m ddim -s 40 -S 1919095582
Usage stats: 1 image(s) generated in 8.50s Max VRAM used for this generation: 10.61G
test -W 512 -H 512 -m ddim -s 40 -S 1919095582
Usage stats: 1 image(s) generated in 23.67s Max VRAM used for this generation: 10.61G
I basically changed
print(
f'>> Max VRAM used for this generation:',
'%4.2fG' % (torch.cuda.max_memory_allocated() / 1e9),
)
to
if self.device.type == 'mps':
print(
f'>> Max VRAM used for this generation:',
'%4.2fG' % (resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1e9),
)
else:
print(
f'>> Max VRAM used for this generation:',
'%4.2fG' % (torch.cuda.max_memory_allocated() / 1e9),
)
Maybe that's post execution and need to record it earlier (?)
Ok I see, I dont think there is a solution Indeed when the OS give memory to the process, that memory is allocate for all the duration of the run, so the memory will increase when the process require more, but it is never released back to the OS, at least in python as far I know.
anyways guys I pushed an update on https://github.com/neonsecret/stable-diffusion with both modes, changes are made through config, speed not affected and no stupid memory compromises
does it works on macs?
probably, can't test it I've heard people used it on macs
Just want to add that high res generations are not necessarily incoherent.
https://ps.reddit.com/r/StableDiffusion/comments/x6yeam/1024x576_with_6gb_nice/
Yes, landscapes work great. I've generated some wonderful photos of mountains in the winter.
Lincoln
On Tue, Sep 6, 2022 at 11:40 AM Muhammad Usama @.***> wrote:
Just want to add that high res generations are not necessarily incoherent.
https://ps.reddit.com/r/StableDiffusion/comments/x6yeam/1024x576_with_6gb_nice/ [image: 8m3rf4wje5m91] https://user-images.githubusercontent.com/385283/188677457-a862e396-2720-4ddc-bfdf-cbd58b5b8ea3.png
— Reply to this email directly, view it on GitHub https://github.com/lstein/stable-diffusion/issues/364#issuecomment-1238324341, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA3EVNS2ZSF5M5CR6O3SUDV45QX7ANCNFSM6AAAAAAQEHYBTA . You are receiving this because you were mentioned.Message ID: @.***>
--
Lincoln Stein
Head, Adaptive Oncology, OICR
Senior Principal Investigator, OICR
Professor, Department of Molecular Genetics, University of Toronto
Tel: 416-673-8514
Cell: 416-817-8240
@.***
*E*xecutive Assistant
Michelle Xin
Tel: 647-260-7927
@. @.>*
Ontario Institute for Cancer Research
MaRS Centre, 661 University Avenue, Suite 510, Toronto, Ontario, Canada M5G 0A3
Collaborate. Translate. Change lives.
This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization.
anyways guys I pushed an update on https://github.com/neonsecret/stable-diffusion with both modes, changes are made through config, speed not affected and no stupid memory compromises
Is this the one to use, ie. on the main branch? https://github.com/neonsecret/stable-diffusion/blob/main/ldm/modules/attention.py
anyways guys I pushed an update on https://github.com/neonsecret/stable-diffusion with both modes, changes are made through config, speed not affected and no stupid memory compromises
Is this the one to use, ie. on the main branch? https://github.com/neonsecret/stable-diffusion/blob/main/ldm/modules/attention.py
yes but it's a bit more complicated now because there are many modules affected in the optimization, also new config, so.. well you can read last commit changes
@lstein I think it makes more sense for you to look at the implementation of the optimization from @Doggettx: https://github.com/Doggettx/stable-diffusion/commits/main
And for non-CUDA devices, make it possible to set the amount of memory that should be used manually, like he explained in this comment: https://github.com/lstein/stable-diffusion/issues/364#issuecomment-1237964924
Strangley enough in my one private clone I'm using the optimisation from @Doggettx 's for hacked back to manually select the number of steps/slices (cos I'm on a 8Gb Mac) and I seem to get better performance for step=1 than 2,4,8,16 and I'd swear I get the most reduced memory usage at 1 step too.
@lstein I think it makes more sense for you to look at the implementation of the optimization from @Doggettx: https://github.com/Doggettx/stable-diffusion/commits/main
And for non-CUDA devices, make it possible to set the amount of memory that should be used manually, like he explained in this comment: #364 (comment)
I can second these. I've been testing these changes and they're quite stable offering significant performance improvements.
On an RTX 3080 that was capable of rendering 512x768 at max-res using this repo, I am now able to generate 960x1408. That's nearly double the size.
A few things to be noted.
@blessedcoolant While going much above 512x512 often breaks coherency with txt2img, it does not break coherency if you do img2img. So what you can and should do now with these high resolutions is first generate a coherent image at close to 512x512, scale that image to a higher resolution like 1024x1024 with whatever software you like, and then use that 1024x1024 image as the input for the img2img script, with the exact same prompt, and a strength around 0.5. You will get something then that looks like a natively generated 1024x1024 image, without the duplication issues that it would have if you would directly generate a 1024x1024 image. So then it's not just great for landscapes, but also for all other types of images.
So if you have a GPU that can go up to 2048x2048 now with these optimizations, you can generate something that looks like a native 2048x2048 SD image through that procedure, while still having the same coherency like a 512x512 image.
@blessedcoolant While going much above 512x512 often breaks coherency with txt2img, it does not break coherency if you do img2img. So what you can and should do now with these high resolutions is first generate a coherent image at close to 512x512, scale that image to a higher resolution like 1024x1024 with whatever software you like, and then use that 1024x1024 image as the input for the img2img script, with the exact same prompt, and a strength around 0.5. You will get something then that looks like a natively generated 1024x1024 image, without the duplication issues that it would have if you would directly generate a 1024x1024 image. So then it's not just great for landscapes, but also for all other types of images.
So if you have a GPU that can go up to 2048x2048 now with these optimizations, you can generate something that looks like a native 2048x2048 SD image through that procedure.
That's actually a nice hack. Didn't think of it that way. Let me try some examples out and see how it works.
Edit: It seems to be producing some great results. Will try across a variety of prompts and see how it goes.
I'd love to test out https://github.com/Doggettx/stable-diffusion/commits/main , but I tried just replacing the files that were changed from that repo into my lstein install, and it breaks.
Are you testing it seperately from the lstein fork ?
Are you testing it seperately from the lstein fork?
Seconded. I poked at it briefly last night but got nowhere. Can someone post of a summary of how to test this?
The easiest way to test the fork from @Doggettx is by just running it on its own. Running it works same like running the official default SD repo, it just additionally has those optimizations that allow for much higher resolutions.
Copying over only the attention.py from @Doggettx fork to this lstein fork also works I think if you really need to test on this fork, but then you get only part of the optimizations and not all of them.
- The SD model is meant to work at 512x512. Any resolution above that begins to start creating repeating executions of the prompt in various parts of the image. This is fine when you are trying to render something abstract
It also seems to work well with anything that is easy to get from noise and be seamless. It may be similar to 3d rendering, where realistic clouds, grass textures, water and bark textures are made with noise textures. Any nature pictures and textures would probably look good. Anyway, like you said, it's more useful with img2img and post-processing.
Seen on HN, might be interesting to pull in this repo? (PR looks a bit dirty with a lot of extra changes though).
https://github.com/basujindal/stable-diffusion/pull/103