Stable Diffusion PR optimizes VRAM, generate 576x1280 images with 6 GB VRAM

mgcrea commented 2 years ago

Seen on HN, might be interesting to pull in this repo? (PR looks a bit dirty with a lot of extra changes though).

https://github.com/basujindal/stable-diffusion/pull/103

Vargol commented 2 years ago

using the updated copy of forward in this link https://github.com/basujindal/stable-diffusion/pull/117#issuecomment-1236422783 is spectacular,

Apple 8GB Mac Mini M1 (so 8Gb unified memory)

I can do 512x576 without swapping (after the model's are loaded, loading them alway causes a bit of swapping) I've gone up to 704x704 without any of the massive slow down I got from the split softmax version when I started after to use swap. 704x640 running at the same speed 512x512 was before the optimisations (swapping is a killer).

theRolento commented 2 years ago

How does one test this refactoring-simplet2i for the optimized code? I did git switch -c test origin/refactoring-simplet2i and ran that dream.py but i get CUDA errors if going over 512x768 (it's the same as main)

Dschogo commented 2 years ago

Hey just too add here, maybe its of help: "Fork for automatic memory allocation"

https://www.reddit.com/r/StableDiffusion/comments/x6dhks/comment/in6gpow/ https://github.com/Doggettx/stable-diffusion/tree/main/ldm

Didnt get it yet to work on my system (3070ti - 8G Vram, refactoring-simplet2i branch [4e45dec])

theRolento commented 2 years ago

Hey just too add here, maybe its of help: "Fork for automatic memory allocation"

https://www.reddit.com/r/StableDiffusion/comments/x6dhks/comment/in6gpow/ https://github.com/Doggettx/stable-diffusion/tree/main/ldm

Didnt get it yet to work on my system (3070ti - 8G Vram, refactoring-simplet2i branch [4e45dec])

Yep it's the same [4e45dec] and doesn't work

Vargol commented 2 years ago

How does one test this refactoring-simplet2i for the optimized code? I did git switch -c test origin/refactoring-simplet2i and ran that dream.py but i get CUDA errors if going over 512x768 (it's the same as main)

I usually do a checkout

% git checkout refactoring-simplet2i branch 'refactoring-simplet2i' set up to track 'origin/refactoring-simplet2i'. Switched to a new branch 'refactoring-simplet2i'

Vargol commented 2 years ago

talking of refactoring-simplet2i @lstein I think you've lost a mps fix in attention.py, need an extra line at line 215

237d215
<         x = x.contiguous() if x.device.type == 'mps' else x

its in class BasicTransformerBlock(nn.Module): the second def of forward should look like

    def _forward(self, x, context=None):
        x = x.contiguous() if x.device.type == 'mps' else x
        x = self.attn1(self.norm1(x)) + x
        x = self.attn2(self.norm2(x), context=context) + x
        x = self.ff(self.norm3(x)) + x
        return x

blessedcoolant commented 2 years ago

One of the things no one seems to be discussing is the quality of the images produced. The larger I go, the more and more distorted and convoluted the results become for me. Especially in prompts that involve humanoid characters. Another pattern I've been noticing is that the images tend to start producing duplicates of prompts in the same image as the resolution grows bigger.

I'm guessing this is because the initial model was trained at 512?

Vargol commented 2 years ago

Another pattern I've been noticing is that the images tend to start producing duplicates of prompts in the same image as the resolution grows bigger.

I'm guessing this is because the initial model was trained at 512?

Yes, thats the reason for the duplicates

blessedcoolant commented 2 years ago

Yes, thats the reason for the duplicates

I've realized that actually generating at high res is only good when your prompt has subjects that are completely abstract and lack any form of coherency. But if it is something more definitive, you are far better of doing resolutions closer to the model and then trying to upscale them to get better results.

However these memory optimizations are really good if and when they release a larger model. Even if it takes more memory, we should be able to run it off the bat.

lstein commented 2 years ago

Sorry I missed that; I guess it got overwritten when I brought over the second round of attention optimizations. I'll fix it.

Lincoln

On Mon, Sep 5, 2022 at 11:34 AM Vargol @.***> wrote:

talking of refactoring-simplet2i @lstein https://github.com/lstein I think you've lost a mps fix in attention.py, need an extra line at line 215

237d215 < x = x.contiguous() if x.device.type == 'mps' else x

its in class BasicTransformerBlock(nn.Module): the second def of forward should look like
def _forward(self, x, context=None):
    x = x.contiguous() if x.device.type == 'mps' else x
    x = self.attn1(self.norm1(x)) + x
    x = self.attn2(self.norm2(x), context=context) + x
    x = self.ff(self.norm3(x)) + x
    return x
— Reply to this email directly, view it on GitHub https://github.com/lstein/stable-diffusion/issues/364#issuecomment-1237217368, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA3EVLTQGVZWZX62JUOZK3V4YHJZANCNFSM6AAAAAAQEHYBTA . You are receiving this because you were mentioned.Message ID: @.***>

--

Lincoln Stein

Head, Adaptive Oncology, OICR

Senior Principal Investigator, OICR

Professor, Department of Molecular Genetics, University of Toronto

Tel: 416-673-8514

Cell: 416-817-8240

@.***

*E*xecutive Assistant

Michelle Xin

Tel: 647-260-7927

@. @.>*

Ontario Institute for Cancer Research

MaRS Centre, 661 University Avenue, Suite 510, Toronto, Ontario, Canada M5G 0A3

@OICR_news https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftwitter.com%2Foicr_news&data=04%7C01%7CMichelle.Xin%40oicr.on.ca%7C9fa8636ff38b4a60ff5a08d926dd2113%7C9df949f8a6eb419d9caa1f8c83db674f%7C0%7C0%7C637583553462287559%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=PS9KzggzFoecbbt%2BZQyhkWkQo9D0hHiiujsbP7Idv4s%3D&reserved=0 | www.oicr.on.ca

Collaborate. Translate. Change lives.

This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization.

lstein commented 2 years ago

Just pushed a fix for the @vargol's bug.

theRolento commented 2 years ago

Just pushed a fix for the @Vargol's bug.

I managed to just increase to 512x832, but any further increase it still gives CUDA memory errors, is that normal @lstein ? git switch -c test origin/refactoring-simplet2i is the command i used to get that branch (and seems good) How did you test those high resolutions with 8GB @blessedcoolant ?

blessedcoolant commented 2 years ago

How did you test those high resolutions with 8GB @blessedcoolant ?

I've been testing the changes made by Doggettx on his fork.

Vargol commented 2 years ago

How did you test those high resolutions with 8GB @blessedcoolant ?

I've been testing the changes made by Doggettx on his fork.

I think is this rewrite of the forward function that isn't in refactoring-simplet2i yet that's the difference https://github.com/basujindal/stable-diffusion/pull/117#issuecomment-1236422783

Any-Winter-4079 commented 2 years ago

using the updated copy of forward in this link basujindal#117 (comment) is spectacular,

Apple 8GB Mac Mini M1 (so 8Gb unified memory)

I can do 512x576 without swapping (after the model's are loaded, loading them alway causes a bit of swapping) I've gone up to 704x704 without any of the massive slow down I got from the split softmax version when I started after to use swap. 704x640 running at the same speed 512x512 was before the optimisations (swapping is a killer).

M1 with 64 GB RAM.

With the version of def forward(self, x, context=None, mask=None): from the comment you reference: Peak memory usage (from Activity Monitor) 12.39 GB Time 00:40. 1.23it/s

Without that version: Peak memory usage (from Activity Monitor) 14.84 GB Time 00:29. 1.70it/s

dream > Anubis the Ancient Egyptian God of Death riding a motorbike in Grand Theft Auto V cover, with palm trees in the background, cover art by Stephen Bliss, artstation, high quality -m ddim -S 1469565

So code performance (measured in time) seems to be dependant on RAM. Meaning someone may not suffer a performance hit at all running a modified version with 8GB RAM but someone else may see a massive performance hit at x8 RAM (like I am seeing).

Maybe we should start creating code versions for 8GB, 32GB, 64GB, etc (I assume more RAM can take more aggressive approaches). Or, as a second option, allow for many flags to be set to customize things for your own machine specs.

JohnAlcatraz commented 2 years ago

The best implementation of the new optimization can be found here now, in this branch by @Doggettx : https://github.com/Doggettx/stable-diffusion/commits/main

@lstein so that is the version you should add to this repo too I think.

neonsecret commented 2 years ago

The best implementation of the new optimization can be found here now, in this branch by @Doggettx : https://github.com/Doggettx/stable-diffusion/commits/main

@lstein so that is the version you should add to this repo too I think.

it is utilized in my repo too https://github.com/neonsecret/stable-diffusion

JohnAlcatraz commented 2 years ago

it is utilized in my repo too https://github.com/neonsecret/stable-diffusion

Your repo does not yet have the improvements @Doggettx made for automatically selecting the best amount of steps for the resolution you want to render based on the available memory, right? What he explained here: https://github.com/neonsecret/stable-diffusion/commit/52660098a5fdfba824d498d56a32c0733543cd47#commitcomment-83093054

neonsecret commented 2 years ago

it is utilized in my repo too https://github.com/neonsecret/stable-diffusion

Your repo does not yet have the improvements @Doggettx made for automatically selecting the best amount of steps for the resolution you want to render based on the available memory, right? What he explained here: neonsecret@5266009#commitcomment-83093054

it doesn't matter as the algorithm by @Doggettx is being tested an may not work as expected (may not determine the needed memory correctly).

Vargol commented 2 years ago

         stats = torch.cuda.memory_stats(q.device)
         mem_total = torch.cuda.get_device_properties(0).total_memory
         mem_active = stats['active_bytes.all.current']
         mem_free = mem_total - mem_active

Thats not going to work well in non cuda devices

JohnAlcatraz commented 2 years ago

Thats not going to work well in non cuda devices

Yeah, but there's probably also a way to do that that works on non cuda devices. If not, then that optimization should simply only be enabled on cuda devices.

lstein commented 2 years ago

Just to let everyone know, I'm going to wait for things to settle down on the memory optimization front and spend my coding time on getting the refactor branch into development cleanly. There seem to be a number of tradeoffs with regards to execution speed, peak memory usage, and average memory utilisation. I'm hoping that someone will do systematic benchmarking on the various solutions so that we can understand the tradeoffs.

lstein commented 2 years ago

Just pushed a fix for the @Vargol's bug.

I managed to just increase to 512x832, but any further increase it still gives CUDA memory errors, is that normal @lstein ? git switch -c test origin/refactoring-simplet2i is the command i used to get that branch (and seems good) How did you test those high resolutions with 8GB @blessedcoolant ?

The confounding factor is that video cards installed in personal computers are also doing stuff for the system, such as displaying the desktop windows and running any 3D graphic programs you have running. So the maximum image size you can achieve depends both on how much GB you have as well as on what else the GPU happens to be doing at the time. This is why I suggest that you use the peak VRAM usage statistics that are printed after every generation (on CUDA devices) in order to understand what memory is required by stable diffusion.

The VRAM usage does not scale proportional to the area of the desired image, but seems to follow a cubic polynomial. So a 1024x1024 image needs about 8 times more free memory than 512x512 does. (You need to correct for a constant memory consumption of about 5G just to load the model.)

i3oc9i commented 2 years ago

I just make a test on my Mac Studio M1 Ultra and I found a huge difference in the execution time between the development and the refactoring-simplet2 branch.

The refactoring-simplet2 branch is 66% slover.

1/ git checkout refactoring-simplet2i (commit 3879270e57195df3ff153e58b6475b0cd07a5c78)

dream> "test"  -s50 -W704 -H704          
100%   50/50 [01:18<00:00,  1.58s/it]
Generating: 100%  1/1 [01:19<00:00, 79.18s/it]

>> Usage stats:
>>   1 image(s) generated in 79.10s
>>   Max VRAM used for this generation: 0.00G

Also remark that image generated stats seems to be inconsistent with the execution time in this case.

2/ git checkout development (commit 52d8bb2836cf05994ee5e2c5cf9c8d190dac0524)

dream> "test"  -s50 -W704 -H704        
100%  50/50 [00:47<00:00,  1.04it/s]
Generating: 100%  1/1 [00:49<00:00, 49.24s/it]

>> Usage stats:
>>   1 image(s) generated in 49.27s
>>   Max VRAM used for this generation: 0.00G

tildebyte commented 2 years ago

@lstein said;

video cards installed in personal computers are also doing stuff for the system

Yep, and it's much worse on Windows single-GPU systems, because there's no text-only mode like you could get by shutting down your GUI session on Linux.

Also: this is literally the first time I'm hugging Optimus on my laptop - the shitty internal Intel "GPU" handles all of the desktop crap, leaving max mem free on the NVIDIA device 😁

netsvetaev commented 2 years ago

         stats = torch.cuda.memory_stats(q.device)
         mem_total = torch.cuda.get_device_properties(0).total_memory
         mem_active = stats['active_bytes.all.current']
         mem_free = mem_total - mem_active

Thats not going to work well in non cuda devices

Author’s answer on reddit about m1/mps: I don't see any way in the torch documentation on how to get memory stats from mps. You could just remove the whole auto scaling and set it at a static steps. If you set it at something like 16 you should be able to go really high res, but you'll lose some performance on lower resolutions too. So basically replace this:

    stats = torch.cuda.memory_stats(q.device)
    mem_total = torch.cuda.get_device_properties(0).total_memory
    mem_active = stats['active_bytes.all.current']
    mem_free = mem_total - mem_active

    mem_required = q.shape[0] * q.shape[1] * k.shape[1] * 4 * 2.5
    steps = 1

    if mem_required > mem_free:
        steps = 2**(math.ceil(math.log(mem_required / mem_free, 2)))

with just steps = 16 Valid values for steps are 1/2/4/8/16/32/64 the higher you go the bigger resolution you can render but also the slower it will run. steps = 1 basically disables the optimization which is what usually happens at lower resolutions. the first few steps don't really lower performance much, so if you want a nice balance you could just set it at 4 or 8.

But I get an error after that.

JohnAlcatraz commented 2 years ago

Author’s answer on reddit about m1/mps:

Probably helpful to link that reddit post here, as it got a lot of good discussion: https://www.reddit.com/r/StableDiffusion/comments/x6dhks/fork_for_automatic_memory_allocation_allows_for/

i3oc9i commented 2 years ago

@netsvetaev @Vargol

         stats = torch.cuda.memory_stats(q.device)
         mem_total = torch.cuda.get_device_properties(0).total_memory
         mem_active = stats['active_bytes.all.current']
         mem_free = mem_total - mem_active

Thats not going to work well in non cuda devices

This is may be because on M1 threre is not VRAM associated with the device, indeed the memory is unified between CPU and GPU on the M1 architecture.

May be we can add the amount of memory used by the dream process to help to compare amog architecture

something like this Max RAM used by python process: Max VRAM used for this generation:

Any-Winter-4079 commented 2 years ago

About recording memory usage on Mac, yesterday I tried a couple of libraries (psutil and resource) but wasn't super satisfied with the memory values I got (seemed too low).

The best value I've got is with top -l 1 | grep "python", which shows the same value shown in Activity Monitor. This includes all of the memory being used (including model initialization too).

The problem is this command seems to take a snapshot, so we'd need to run it mid-execution (?) Not sure if someone knows of an alternative to replace torch.cuda.max_memory_allocated(), etc. on Mac.

i3oc9i commented 2 years ago

psutil.Process(os.getpid()).memory_info().rss / 1024 resource.getrusage(resource.RUSAGE_SELF).ru_maxrss

are quite consistent does not fit the need ?

Edited may be the Activity Monitor is showing
psutil.Process(os.getpid()).memory_info().vms / 1024

but if it is the case vms il less accurate of rss, because it does not reflec the acuatl usage of the memory, but instead it is the size of memory that OS has given to a process, but it doesn’t necessarily mean that the process is using all of that memory

Doggettx commented 2 years ago

You could just change it from mem_free to a definable amount of memory it's allowed to use. The calculation in that version was wrong too btw, I've changed it too

    stats = torch.cuda.memory_stats(q.device)
    mem_active = stats['active_bytes.all.current']
    mem_reserved = stats['reserved_bytes.all.current']
    mem_free_cuda, _ = torch.cuda.mem_get_info(torch.cuda.current_device())
    mem_free_torch = mem_reserved - mem_active
    mem_free_total = mem_free_cuda + mem_free_torch

Maybe interesting to know, you can also calculate the max res you can do at a certain amount of free memory:

max res for a square = sqrt(sqrt(max_bytes_to_use / 4 / 2.5 / 16 * steps) * 64)

for example, I have about 16gb free so that's:

floor(sqrt(sqrt(17000000000 / 10 / 16 * 64) * 64) / 64) * 64 = 2240x2240

or calculate how much memory you need to complete the loop

bytes needed = 16 * (width/8 * height/8) ^ 2 * 4 * 2.5 / steps

that is of course free memory needed at the point the loop starts...

P.S. that's assuming the loop from my fork

Vargol commented 2 years ago

but if it is the case vms il less accurate of rss, because it does not reflec the acuatl usage of the memory, but instead it is the size of memory that OS has given to a process, but it doesn’t necessarily mean that the process is using all of that memory

Plus going by Activity Monitor Python is frequently using 12 Gb of my 8Gb unified memory when running bigger resolutions (i.e. is using swap)

Any-Winter-4079 commented 2 years ago

@i3oc9i The issue I get with resource is it always displays the same value?

test -W 256 -H 256 -m ddim -s 40 -S 1919095582

Usage stats: 1 image(s) generated in 8.50s Max VRAM used for this generation: 10.61G

test -W 512 -H 512 -m ddim -s 40 -S 1919095582

Usage stats: 1 image(s) generated in 23.67s Max VRAM used for this generation: 10.61G

I basically changed

print(
                f'>>   Max VRAM used for this generation:',
                '%4.2fG' % (torch.cuda.max_memory_allocated() / 1e9),
            )

to

if self.device.type == 'mps':
            print(
                f'>>   Max VRAM used for this generation:',
                '%4.2fG' % (resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1e9),
            )
        else:
            print(
                f'>>   Max VRAM used for this generation:',
                '%4.2fG' % (torch.cuda.max_memory_allocated() / 1e9),
            )

Maybe that's post execution and need to record it earlier (?)

i3oc9i commented 2 years ago

Ok I see, I dont think there is a solution Indeed when the OS give memory to the process, that memory is allocate for all the duration of the run, so the memory will increase when the process require more, but it is never released back to the OS, at least in python as far I know.

neonsecret commented 2 years ago

anyways guys I pushed an update on https://github.com/neonsecret/stable-diffusion with both modes, changes are made through config, speed not affected and no stupid memory compromises

netsvetaev commented 2 years ago

https://github.com/neonsecret/stable-diffusion

does it works on macs?

neonsecret commented 2 years ago

probably, can't test it I've heard people used it on macs

SMUsamaShah commented 2 years ago

Just want to add that high res generations are not necessarily incoherent.

https://ps.reddit.com/r/StableDiffusion/comments/x6yeam/1024x576_with_6gb_nice/ 8m3rf4wje5m91

lstein commented 2 years ago

Yes, landscapes work great. I've generated some wonderful photos of mountains in the winter.

Lincoln

On Tue, Sep 6, 2022 at 11:40 AM Muhammad Usama @.***> wrote:

Just want to add that high res generations are not necessarily incoherent.

https://ps.reddit.com/r/StableDiffusion/comments/x6yeam/1024x576_with_6gb_nice/ [image: 8m3rf4wje5m91] https://user-images.githubusercontent.com/385283/188677457-a862e396-2720-4ddc-bfdf-cbd58b5b8ea3.png

— Reply to this email directly, view it on GitHub https://github.com/lstein/stable-diffusion/issues/364#issuecomment-1238324341, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA3EVNS2ZSF5M5CR6O3SUDV45QX7ANCNFSM6AAAAAAQEHYBTA . You are receiving this because you were mentioned.Message ID: @.***>

--

Lincoln Stein

Head, Adaptive Oncology, OICR

Senior Principal Investigator, OICR

Professor, Department of Molecular Genetics, University of Toronto

Tel: 416-673-8514

Cell: 416-817-8240

@.***

*E*xecutive Assistant

Michelle Xin

Tel: 647-260-7927

@. @.>*

Ontario Institute for Cancer Research

MaRS Centre, 661 University Avenue, Suite 510, Toronto, Ontario, Canada M5G 0A3

@OICR_news https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftwitter.com%2Foicr_news&data=04%7C01%7CMichelle.Xin%40oicr.on.ca%7C9fa8636ff38b4a60ff5a08d926dd2113%7C9df949f8a6eb419d9caa1f8c83db674f%7C0%7C0%7C637583553462287559%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=PS9KzggzFoecbbt%2BZQyhkWkQo9D0hHiiujsbP7Idv4s%3D&reserved=0 | www.oicr.on.ca

Collaborate. Translate. Change lives.

This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization.

lstein commented 2 years ago

anyways guys I pushed an update on https://github.com/neonsecret/stable-diffusion with both modes, changes are made through config, speed not affected and no stupid memory compromises

Is this the one to use, ie. on the main branch? https://github.com/neonsecret/stable-diffusion/blob/main/ldm/modules/attention.py

neonsecret commented 2 years ago

anyways guys I pushed an update on https://github.com/neonsecret/stable-diffusion with both modes, changes are made through config, speed not affected and no stupid memory compromises

Is this the one to use, ie. on the main branch? https://github.com/neonsecret/stable-diffusion/blob/main/ldm/modules/attention.py

yes but it's a bit more complicated now because there are many modules affected in the optimization, also new config, so.. well you can read last commit changes

JohnAlcatraz commented 2 years ago

@lstein I think it makes more sense for you to look at the implementation of the optimization from @Doggettx: https://github.com/Doggettx/stable-diffusion/commits/main

And for non-CUDA devices, make it possible to set the amount of memory that should be used manually, like he explained in this comment: https://github.com/lstein/stable-diffusion/issues/364#issuecomment-1237964924

Vargol commented 2 years ago

Strangley enough in my one private clone I'm using the optimisation from @Doggettx 's for hacked back to manually select the number of steps/slices (cos I'm on a 8Gb Mac) and I seem to get better performance for step=1 than 2,4,8,16 and I'd swear I get the most reduced memory usage at 1 step too.

blessedcoolant commented 2 years ago

@lstein I think it makes more sense for you to look at the implementation of the optimization from @Doggettx: https://github.com/Doggettx/stable-diffusion/commits/main

And for non-CUDA devices, make it possible to set the amount of memory that should be used manually, like he explained in this comment: #364 (comment)

I can second these. I've been testing these changes and they're quite stable offering significant performance improvements.

On an RTX 3080 that was capable of rendering 512x768 at max-res using this repo, I am now able to generate 960x1408. That's nearly double the size.

A few things to be noted.

Inference time largely remains the same for 512x512. I did not see any significant changes from our current implementation to these changes.
The inference time goes up as you increase the output resolution size as expected.
The SD model is meant to work at 512x512. Any resolution above that begins to start creating repeating executions of the prompt in various parts of the image. This is fine when you are trying to render something abstract but if your prompt contains something organic with a definite shape, you will end up finding that you get a lot worse results at higher res than you would at lower res and scaling. So keep that in mind. The trigger factor is actually the width. You can keep width at 512 and go really high on the height and you will get some distortions but it's really all that bad.

JohnAlcatraz commented 2 years ago

@blessedcoolant While going much above 512x512 often breaks coherency with txt2img, it does not break coherency if you do img2img. So what you can and should do now with these high resolutions is first generate a coherent image at close to 512x512, scale that image to a higher resolution like 1024x1024 with whatever software you like, and then use that 1024x1024 image as the input for the img2img script, with the exact same prompt, and a strength around 0.5. You will get something then that looks like a natively generated 1024x1024 image, without the duplication issues that it would have if you would directly generate a 1024x1024 image. So then it's not just great for landscapes, but also for all other types of images.

So if you have a GPU that can go up to 2048x2048 now with these optimizations, you can generate something that looks like a native 2048x2048 SD image through that procedure, while still having the same coherency like a 512x512 image.

blessedcoolant commented 2 years ago

@blessedcoolant While going much above 512x512 often breaks coherency with txt2img, it does not break coherency if you do img2img. So what you can and should do now with these high resolutions is first generate a coherent image at close to 512x512, scale that image to a higher resolution like 1024x1024 with whatever software you like, and then use that 1024x1024 image as the input for the img2img script, with the exact same prompt, and a strength around 0.5. You will get something then that looks like a natively generated 1024x1024 image, without the duplication issues that it would have if you would directly generate a 1024x1024 image. So then it's not just great for landscapes, but also for all other types of images.

So if you have a GPU that can go up to 2048x2048 now with these optimizations, you can generate something that looks like a native 2048x2048 SD image through that procedure.

That's actually a nice hack. Didn't think of it that way. Let me try some examples out and see how it works.

Edit: It seems to be producing some great results. Will try across a variety of prompts and see how it goes.

smoke2007 commented 2 years ago

I'd love to test out https://github.com/Doggettx/stable-diffusion/commits/main , but I tried just replacing the files that were changed from that repo into my lstein install, and it breaks.

Are you testing it seperately from the lstein fork ?

tildebyte commented 2 years ago

Are you testing it seperately from the lstein fork?

Seconded. I poked at it briefly last night but got nowhere. Can someone post of a summary of how to test this?

JohnAlcatraz commented 2 years ago

The easiest way to test the fork from @Doggettx is by just running it on its own. Running it works same like running the official default SD repo, it just additionally has those optimizations that allow for much higher resolutions.

Copying over only the attention.py from @Doggettx fork to this lstein fork also works I think if you really need to test on this fork, but then you get only part of the optimizations and not all of them.

netsvetaev commented 2 years ago

The SD model is meant to work at 512x512. Any resolution above that begins to start creating repeating executions of the prompt in various parts of the image. This is fine when you are trying to render something abstract

It also seems to work well with anything that is easy to get from noise and be seamless. It may be similar to 3d rendering, where realistic clouds, grass textures, water and bark textures are made with noise textures. Any nature pictures and textures would probably look good. Anyway, like you said, it's more useful with img2img and post-processing.

invoke-ai / InvokeAI

Stable Diffusion PR optimizes VRAM, generate 576x1280 images with 6 GB VRAM #364