thecooltechguy / ComfyUI-Stable-Video-Diffusion

ComfyUI nodes for Stable Video Diffusion
319 stars 25 forks source link

Out of Memory in SVD Sampler? #6

Open PaulFidika opened 10 months ago

PaulFidika commented 10 months ago

The SVD sampler is giving the error:

Error occurred when executing SVDSampler:

Allocation on device 0 would exceed allowed memory. (out of memory)
Currently allocated : 15.96 GiB
Requested : 2.26 GiB
Device limit : 23.65 GiB
Free (according to CUDA): 25.19 MiB
PyTorch limit (set by user-supplied memory fraction)
: 17179869184.00 GiB

I'm confused; I have 24GBs of VRAM, and yet it's saying that is' running out of VRAM beyond 16GBs. Am I doing something wrong? This is the only process running on my GPU.

thecooltechguy commented 10 months ago

Ah, can you lmk the model checkpoint you're testing, and how many frames you're putting as input?

Kaz03 commented 10 months ago

try to decrease decoder frames

PaulFidika commented 10 months ago

(1) the error is happening in the sampler, rather than the decoder. Reducing decoder's decoder_t param to 1 fixed the decoder running out of memory, but the sampler is still running out of memory

(2) I'm running 25-frames with 36 steps for the svd_xt model.

if I reduce the number of frames, I'm able to get past and generate a successful video. But if I wnat 25-frames, I still get the error:

Allocation on device 0 would exceed allowed memory. (out of memory)
Currently allocated : 18.31 GiB
Requested : 1.92 GiB
Device limit : 23.65 GiB
Free (according to CUDA): 33.19 MiB
PyTorch limit (set by user-supplied memory fraction)
: 17179869184.00 GiB

what confuses me is that it says I have 24GBs, it wants 2 GBs, on top of 18GBs it already has, and it says it's out of memory? Why? I should have enough VRAM for this.

PaulFidika commented 10 months ago

Is it possible that between generations, VRAM is being left over and not freed up after the job is completed? On the lesser frame requirements, I do get the generation to run, but then it won't work a second time.

Nebuluss commented 10 months ago

I have exactly the same issue, if you find a fix let me know please !

sborys-ai commented 10 months ago

I'm facing the same problem! No matter which parameters I set, even 1 frame per second and a single step, I get an error:

'Allocated 16.67 GB, need an additional 1.92 GB, 23.69 GB available - not enough memory.'

I'm using Ubuntu 23.10, with 64GB RAM and an RTX3090 with 24GB. It's so frustrating...

sborys-ai commented 10 months ago

Update! The issue with insufficient memory was resolved by disabling the FreeU Advanced node.

However, this led to a new problem, an xformer error:

Error occurred when executing SVDSampler:

No operator found for memory_efficient_attention_forward with inputs: query : shape=(1, 64512, 1, 512) (torch.float32) key : shape=(1, 64512, 1, 512) (torch.float32) value : shape=(1, 64512, 1, 512) (torch.float32) attn_bias : p : 0.0 decoderF is not supported because: max(query.shape[-1] != value.shape[-1]) > 128 xFormers wasn't build with CUDA support attn_bias type is operator wasn't built - see python -m xformers.info for more info flshattF@0.0.0 is not supported because: max(query.shape[-1] != value.shape[-1]) > 256 xFormers wasn't build with CUDA support dtype=torch.float32 (supported: {torch.float16, torch.bfloat16}) operator wasn't built - see python -m xformers.info for more info tritonflashattF is not supported because: max(query.shape[-1] != value.shape[-1]) > 128 xFormers wasn't build with CUDA support dtype=torch.float32 (supported: {torch.float16, torch.bfloat16}) operator wasn't built - see python -m xformers.info for more info triton is not available cutlassF is not supported because: xFormers wasn't build with CUDA support operator wasn't built - see python -m xformers.info for more info smallkF is not supported because: max(query.shape[-1] != value.shape[-1]) > 32 xFormers wasn't build with CUDA support operator wasn't built - see python -m xformers.info for more info unsupported embed per head: 512

By the way, I tried using the simplified svd-fp16 models, but I received a response that the configuration lacks a yaml file for this model. It would be good to add it, as it would reduce the model size by half.

emourdavid commented 10 months ago

I have same issue. Use 4090 24Gb. It consumes 15 GB only but inform Out of Memory to use. I use the default setting of SVD: 14 frames and fps of video is 8 even 6 only

emourdavid commented 10 months ago

I update more screenshot for this issue Screenshot 2023-11-26 132258-issues-svd Screenshot 2023-11-26 132401-issue

mrmeseeks23 commented 10 months ago

Thought I would drop in here to say that I have everything up and running. I am actually using the automatic 1111 extension for Comfy UI (I run CUI in my A1111). Everything works well except I too am running into CUDA out of memory issues. I am utilizing an NVIDIA L40 with 48 GB VRAM too, so I should have plenty for the job at least I thought so.

filsanet commented 10 months ago

I also am seeing this issue, on a 3090 w 24Gb ram.

torch.cuda.OutOfMemoryError: Allocation on device 0 would exceed allowed memory. (out of memory)
Currently allocated     : 18.23 GiB
Requested               : 1.09 GiB
Device limit            : 23.48 GiB
Free (according to CUDA): 33.00 MiB
PyTorch limit (set by user-supplied memory fraction)
                        : 17179869184.00 GiB

I note that there was a similar bug in ComfyUI reported, but apparently the bug has been fixed. Perhaps might be a clue here: https://github.com/comfyanonymous/ComfyUI/issues/1918#issuecomment-1806871089

yhyu13 commented 10 months ago

@thecooltechguy @PaulFidika @ALL

(1) the error is happening in the sampler, rather than the decoder. Reducing decoder's decoder_t param to 1 fixed the decoder running out of memory, but the sampler is still running out of memory

Let me conclude solutions so far

TLDR; No need to reduce your frames, just reducing decoding_t and scale down image size. You can alwasy upscale every frame or interpolate more frame in postprocessing stage

I scan throught this repo code real fast, and find many cases just assume using CUDA devices (e.g. torch.autocast('cuda')), so this repo so far would not respect accerate offloading or low vram, no vram, even cpu only. Just CUDA at this moment!

(1) Sampler out of memory: Only Model size and image size matters, so either waiting for smarker people to create better quantization model or try to scale down your image with "UpScaleImageBy" node with scale < 1.0. Adjust scale until your are on longer OOM!

Comfy_svd1

(2) Decoder out of memory : the author has put a obfucated parameter called decoding_t, under the hood, in python it is called en_and_decode_n_samples_a_time, which divides your decoding into frame / decoding_t phases, each phase running a small batch and finally concate these batches together

https://github.com/thecooltechguy/ComfyUI-Stable-Video-Diffusion/blob/2b2891c8046ade10d32f59d0178a500fd925de4c/libs/sgm/models/diffusion.py#L125

So you would set decoding_t from [1, frames], try find a number that you would not OOM

Comfy_svd2

Romanio1997 commented 9 months ago

you're talking about video, but I can't zoom picture in ComfyUI and I also get this error.

torch.cuda.OutOfMemoryError: Allocation on device 0 would exceed allowed memory. (out of memory) Currently allocated : 6.98 GiB Requested : 50.00 MiB Device limit : 11.00 GiB

superintendent2521 commented 9 months ago

try to decrease decoder frames

This doesnt do anything to vram requirement.

dnalbach commented 6 months ago

Lowering decoding_t to 2 fixed it for me per @yhyu13 comment.