huggingface / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
25.62k stars 5.29k forks source link

Massive SD XL memory spike #4164

Closed patrickvonplaten closed 11 months ago

patrickvonplaten commented 1 year ago

Issue with sd-xl massive memory spike on second run first run is fine, but second run spikes memory to full vram and then it stays constant

eliphatfs commented 1 year ago

I met a similar behavior in one of my models and I found it to be a memory fragmentation issue. Sometimes PyTorch CUDA cache manager fails to recognize the allocation patterns and produces a lot of fragmentation even if the control flow is unchanged between runs. I don't know if this is relevant in the case.

adhikjoshi commented 1 year ago

for me, this is not the case, i load base and refiner in memory and it takes total 12GB of VRAM, which stays same unless more sample are passed.

patrickvonplaten commented 1 year ago

Thanks for the comments! @eliphatfs do you have a reproducible code snippet by any chance?

eliphatfs commented 1 year ago

I may try to shrink and put up one but it was not SD-XL, and my solution was simply adding torch.cuda.empty_cache() before returning forward, which I think is not the correct way to go and probably has some performance impact.

eliphatfs commented 1 year ago

That said, perhaps you could check whether the memory spike is caused by a similar problem by adding the empty_cache call.

vladmandic commented 1 year ago

in my case, empty_cache does not help. i basically have it set to run if vram reaches 95% threshold, but once it does, it never goes down again until entire model is deleted.

vladmandic commented 1 year ago

Issue is still open.

patrickvonplaten commented 1 year ago

I think I need to check this by going into SDNext - it probably only happens when using medvram or does it happen also when just using in normal mode?

vladmandic commented 1 year ago

good question, need to double-check as now sdnext has two separate offloading mechanisms - one from diffusers (model/sequential) and one native (move base/refiner/vae).

patrickvonplaten commented 1 year ago

I looked a bit into it. I have not looked into any offloading - just used the default setting where all components (text encoder, unet, vae) are kept on GPU.

I'm getting exactly the expected memory usage (~15GB RAM) when using the full official model for images of size (1024, 1024) and (~12GB RAM) when using the full model where the vae is replaced with the fp16 vae fix - see here.

But, when monitoring nvidia-smi, it does seem like the memory spikes after the first generation (e.g. the first generation presumably consumes only 8GB and then it gets stuck at 12GB) - however it should always be 12GB (even at the first generation). We can get the memory consumption as low as 6GB, but for this we need to use model offloading.

vladmandic commented 1 year ago

That makes sense, but question is how first gen succeeds at only 8gb? Anyhow, this issues has simmered down with offloading now working.

patrickvonplaten commented 1 year ago

Actually I looked a bit more into it just now and even during the first run there is a peak at 12GB, but it then goes down to 8GB. Only after the second run it's stuck at 12GB

patrickvonplaten commented 1 year ago

PyTorch does have a tendency to just not free / cache GPU memory if there is enough

vladmandic commented 1 year ago

Yup, that's sure. But users reported that first gen works and then second fails with OOM as it nudges it over the limit. And zero changes inbetween.

patrickvonplaten commented 1 year ago

Hmm interesting, keeping this issue open then. I might have to play around with https://pytorch.org/docs/stable/generated/torch.cuda.set_per_process_memory_fraction.html

Any idea where the additional memory consumption might come from? When I run diffusers code multiple times as follows, I don't see the same pattern:

from diffusers import StableDiffusionXLPipeline, AutoencoderKL
import torch

path = "stabilityai/stable-diffusion-xl-base-1.0"
vae_path = "madebyollin/sdxl-vae-fp16-fix"

vae = AutoencoderKL.from_pretrained(vae_path, torch_dtype=torch.float16)
pipe = StableDiffusionXLPipeline.from_pretrained(path, torch_dtype=torch.float16, vae=vae, variant="fp16", use_safetensors=True, local_files_only=True, add_watermarker=False)
pipe.to("cuda")

prompt = "An astronaut riding a green horse on Mars"
steps = 20

for _ in range(5):
    image = pipe(prompt=prompt, num_inference_steps=steps).images[0]

Here memory usage is at ~12 GB after the first generation and stays there as expected

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.