invoke-ai / InvokeAI

InvokeAI is a leading creative engine for Stable Diffusion models, empowering professionals, artists, and enthusiasts to generate and create visual media using the latest AI-driven technologies. The solution offers an industry leading WebUI, supports terminal use through a CLI, and serves as the foundation for multiple commercial products.
https://invoke-ai.github.io/InvokeAI/
Apache License 2.0
22.42k stars 2.32k forks source link

CUDA out of memory #42

Closed thezveroboy closed 1 year ago

thezveroboy commented 1 year ago

still trying to use this version of SD and when i had type: python scripts\dream.py i got an error: RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 4.00 GiB total capacity; 3.46 GiB already allocated; 0 bytes free; 3.52 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

is there the way to use optimization for low cards or this version of SD is not about it?

bmaltais commented 1 year ago

Anything with less than 12GB of GPU Memory will be had to get working...

morganavr commented 1 year ago

Even optimized version of SD that works partially on CPU from https://github.com/basujindal/stable-diffusion used around 6GB of RAM I believe on my RTX 2080. This branch uses 6.8/8GB to generate a single 512x512 image -s 50

tildebyte commented 1 year ago

Anything with less than 12GB of GPU Memory will be had to get working...

I have a mobile RTX 2070 w/ 8G VRAM, and I have no issues generating 512x768 images.

OTOH, this is an Optimus system, and the NVIDIA sysinfo shows "Total available graphics memory" of 24G, although I've never heard of "real" VRAM being shared on systems like these... 🤷

Either way, 4G VRAM is almost certainly too little, although I did see mentions of a fork which is somehow "paging" the SD model into and out of VRAM (obviously generations take longer) - wish I could find a link to it.

thezveroboy commented 1 year ago

thanks to all comments but optimized SD works for me perfectly let's wait for an answer from the respected author of this version

morganavr commented 1 year ago

is there the way to use optimization for low cards or this version of SD is not about it?

I would extend this request not only for low-end cards, but for 8GB VRAM cards as well. For example, on my RTX 2080 I can generate only 512x512 images and it would be fantastic if it will be possible to generate 512x768 because this size (and 512x704) is used very frequently for creating portraits.

JigenD commented 1 year ago

I just noticed there has been a regression in memory usage in lstein specifically. I can no longer create 640x640 images with the half model, whereas I could before. I can still do so utilizing k_euler_ancestor sampling in my own script.

BlueAmulet commented 1 year ago

RTX 3050 8GB here, still able to easily do 768x512. Try to get your VRAM as low as possible before trying to generate images. In Task Manager's Details view, you can right click the headers (where it says Name, CPU, Memory, etc) and hit Select Columns. Add in the "Dedicated GPU memory" column and now you can see what's taking up your VRAM.

EDIT: NVM, after pulling in the latest commits, this resolution no longer works for me, something did get worse in terms of memory usage.

Narrowed it down to the following commit: 56f7b0f434c52ae58c8fd5d9e79f3a1e69d9adba After reverting it, I no longer have issues generating 768x512

JigenD commented 1 year ago

On 10GB I could still do 768x512, but after commit c24a16ccb02343e0fe4565f3cf1ca99113f9cc31 I am no longer able to do 640x640 (slightly more pixels).

I know for absolute certain that is the commit, all commits before that work fine, all commits after do not.

Edit: to be clear it seems completely independent of loaded applications and is 100% consistent for me. So I can simply use the commit just prior to the above commit for now!

BlueAmulet commented 1 year ago

That commit seems to encompass the same changes as the one I linked, probably some merge funkiness but definitely seems like #44 has issues

morganavr commented 1 year ago

Did some experimentations. I reverted commits c24a16c and 56f7b0f. Before launching dream.py VRAM usage was 0.2/8 GB After launching it: 4.8/8 GB

Image generation. After each generation I closed the script to release VRAM to 0.2/8 GB 512x512px, VRAM increased from 4.8 to 6 GB 512x576px, VRAM increased from 4.8 to 6.5 GB 512x640px, VRAM increased from 4.8 to 7.1 GB 512x704px <--- FAILED, CUDA NOT ENOUGH RAM

If following this VRAM pattern of taking +0.5GB per each +64px then VRAM usage should be increased to 7.6GB in Task Manager but it failed saying that only 7.34GB from 8GB was available. Is it possible that some hidden motherboard bios process or driver reserved those 640MB of VRAM?

I have no idea @BlueAmulet how you are able to generate 512x768 images with 8GB of VRAM...

BlueAmulet commented 1 year ago

Seems related to the removal of model.cuda() from _load_model_from_config If the model is moved to GPU before model.half(), then 768x512 generates fine If the model is moved after model.half(), then 768x512 runs out of memory Even the funky model.to(device=self.device, dtype=torch.float16) to do both at the same time didn't work out

Perhaps this line should be restored as model.to(self.device), to accommodate the changes fixing forced cuda usage

morganavr commented 1 year ago

For 512x768 I use https://github.com/basujindal/stable-diffusion fork - works like a charm!

JigenD commented 1 year ago

Just verified BlueAmulet's comment. Replacing model.cuda() in _load_model_from_config fixed the issue. That seems more like a pytorch bug than an incorrect implementation, but it really is saving actual VRAM...

abloch0 commented 1 year ago

I can also confirm that adding model.cuda() above model.eval() fixed the issue in _load_model_from_config