OOM When switching models

lllyasviel / stable-diffusion-webui-forge

GNU Affero General Public License v3.0

8.7k stars 864 forks source link

OOM When switching models #2127

Open shaun-ba opened 1 month ago

shaun-ba commented 1 month ago

24GB VRAM 3090, 32GB RAM

Is this almost expected behavior when changing a base model, as it happens 99% of the time? I've tried so many different combinations of settings and every one crashes. This is particularly annoying because I use (or would like to) Forge remotely, and every time this happens I can't use it again until I'm back at PC

Specifically the error is torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 160.00 MiB. GPU

I've tried the Never OOM for UNET and VAE, this does seem to work but unsure of the downsides? I've tried to google but cannot find any information on this, and I'm unsure why it wouldn't be baked into the backend if there weren't downsides to it.

Shouldn't Forge just unload any models before loading in a new one? I mean you don't offer queuing like Comfy so why would two base models ever need to be loaded?

I love the simplicity of Forge but I have never once had an OOM with Comfy, so surely something is wrong here.

DenOfEquity commented 1 month ago

There are settings related to number of models that can be loaded at once. Settings > Stable diffusion for checkpoints; > Extra networks for loras; > Controlnet. Also, commandline option --always-offload-from-vram.

shaun-ba commented 1 month ago

@DenOfEquity Upon testing other things, could this also be related to me trying to use Async and Shared for swap? I've read the explanation on this a few times and I'm still not 100% sure I should be using on my setup.

However, I then found a resource mentioning shared GPU memory and found that on Linux Nvidia drivers cannot do this at all, so it seems this code was tailored for windows users is that right?

I have 64GB RAM and 24GB VRAM and just don't think I'm using it as efficiently as possible.

DenOfEquity commented 1 month ago

I can't claim any special deep knowledge in this area. Async and Shared can both give improved performance, but could go awry. Safest settings are Queue and CPU. Best explanation is lllyasviel's, in #981. Also, make sure you have a large swapfile. Could be worth trying to lower GPU Weights, a little at a time, to make sure there is enough VRAM for inference. BTW, Never OOM for Unet forces the lowest VRAM usage - model layers are loaded to GPU one at a time - useful for users with very low VRAM or if trying to upscale very large images, but big performance hit if the model could have been fully loaded.

shaun-ba commented 1 month ago

I also set 32GB of SWAP but this isn't being used at all, as I have 64GB of RAM that is only being half utilised. VRAM is 100%.

Things aren't slow, I just want them faster if that's possible, and as above the system isn't being utilised to it's full potential I don't feel.

shaun-ba commented 1 month ago

Strangely I still get the occasional OOM running the same model, one lora, and a batch size of 4.