Latest dependencies seem to increase VRam requirements and cause OOMs

mayoisugi commented 1 day ago

First off thanks so much for working on all these nodes, it's been really fun getting to experiment with so many new models!

I'm not sure if this is the right place for this issue, since it's likely the dependencies themselves that need to be changed, but I thought I'd bring this up here since at the very least it might be useful information for someone.

Overview

Basically I've tried updating to the latest wrapper, and in doing so updated ComfyUI and Diffusers. The new dependencies seem to increase VRam requirements to the point where even the first CogVideoX model can't run at good quality on an RTX 4090 anymore, and I was interested to know if this is known/documented anywhere and what the likelihood this will improve is.

Details

Running this all on Win10 through the WSL (so Linux really), with sage attention installed.

The setup that worked for me was: ComfyUI: the commit right before v0.1.3 (#9230f658232fd94d0beeddb94aed) Diffusers: 0.30.3 ComfyUI-CogVideoXWrapper: Support 5b controlnet (#9e488568b2005156fdb922250e00)

On this it takes ~16gb to generate using the (old) default T2V workflow, spiking to ~18gb during the decode. With a few tweaks it can also run at ~5gb (worked on RTX2060 for those interested to know).

The problems occur when updating:

Updating Diffusers to 0.31.0 causes the VAE decode to run out of memory - sampling works fine, and downgrading to 0.30.3 makes things work again.
Updating ComfyUI to v0.2.7 causes sampling to run out of memory. Stranger still, downgrading alone doesn't fix this - it requires downgrading + a full restart before I can generate videos again without OOMs. Whatever it does in the latest version seems to have a lasting effect, even after restarting the WSL.

This is all with the same workflow and CogVideoXWrapper version, so I wouldn't expect any differences. I did briefly update the wrapper to the latest version as well but that didn't help either. Using an FP8 transformer or sequential CPU offloading makes it work again (might have needed tiled VAE decode as well, can't recall sorry), though with a quality/speed hit.

Questions...

I suppose I was wondering if anyone else has observed this, or am I just doing something wrong?

I was also wondering if there was any way to get the 1.5 model running without updating ComfyUI/Diffusers (since that might be easier than fixing both of those). Thanks!

kijai commented 1 day ago

The VAE decode generally needs to use the tiling option, that should generally never run out of memory with that.

ComfyUI version would not change anything about the sampling as none of the ComfyUI code itself is used by it, as this is a wrapper. The only thing it may affect is the text encoder memory use, and based on your description it sounds like it's maybe not getting offloaded. I do have the force_offload in those nodes that should do it regardless of what ComfyUI normally would, so make sure that's enabled.

Personally I have not noticed these issues, and I always run with memory monitor on. With 4090 in Windows I can do the 81 frames at 1360x768 just fine, using about ~21GB VRAM during sampling.

And 1.0 with it's default 49 frames at 720x480 is using the 16GB like you said.

zazoum-art commented 1 day ago

I only care for 1.5 Model, so I can test these things out on it. I have the lastest comfyui/cogRequirements. Just give me your 1.5 image and prompt so we can test the same thing together, I'm interested to see how can I manage the max setting possible for these issues. I have win11, + 4090(entirely where ComfyUI runs) + 3060(all the programms of my disk loaded in this one). So, I'll test it on a clean 4090 max.

mayoisugi commented 21 hours ago

Wow thanks so much for the quick replies everyone! Sorry for the delay, wanted to make sure I tested things thoroughly.

@kijai I agree changing the ComfyUI version shouldn't have any effect, I think this part was an error on my side/funkiness with the WSL. I re-upgraded everything and this time around, after force-killing and starting the WSL, things are working a bit better.

I2V with the 1.5 model works well! However T2V hard crashes the WSL unless quantization is enabled, and leaves no error so it's hard to tell exactly why (but I'm guessing an OOM of some sort).

1.0 also has a regression and OOMs, but from what I can tell this is due to diffusers so I should probably make an issue there regarding it 😃

Just for reference here are the results I get:

	Original	Updated Diffusers (0.31.0)	All updated
`cogvideox_5b_example_01.json`	Sampler: ~16gb, Decode: ~16gb	Sampler: ~16gb, Decode: ramps up to 17gb then OOM	-
`cogvideox_5b_example_01.json` w/ auto-tiled	Sampler: ~16gb, Decode: ~6gb	Sampler: ~16gb, Decode: ~7gb	-
`cogvideox_1_0_5b_T2V_02.json` (sdpa & sageattn are ~same)	-	-	Sampler: ~16gb, Decode: OOM or "Failed to decode, retrying with tiling" ~17gb [1]
`cogvideox_1_0_5b_T2V_02.json` w/ auto-tiled	-	-	Sampler: ~16gb, Decode: ~7gb
`cogvideox_1_5_5b_I2V_01.json`	-	-	Sampler: ~18gb, Decode: ~7gb
`cogvideox_1_5_5b_I2V_01.json` with empty latent and 1.5 T2V model [2]	-	-	Sampler: ~18gb, Decode: WSL hard crashes
`cogvideox_1_5_5b_I2V_01.json` with empty latent and 1.5 T2V model, quantization: fp8_e4m3fn	-	-	Sampler: ~16gb, Decode: ~7

* all set to 2 steps to speed up testing [1] In case it's useful here's the full message:

Allocated memory: memory=0.033 GB
Max allocated memory: max_memory=13.067 GB
Max reserved memory: max_reserved=14.719 GB
Failed to decode, retrying with tiling

[2] Here's the workflow I'm using that fails to decode: 1.5T2VTest.json - have I just done something dumb in it? 😅

@zazoum-art I'm interested in the 1.5 model as well haha. Thanks offering to help, if you have time I'd be really interested to know if that T2V workflow above works for you!

Also that's a nice setup! I have a spare 2060 but no space on the motherboard to put it sadly 😅

kijai commented 21 hours ago

[2] Here's the workflow I'm using that fails to decode: 1.5T2VTest.json - have I just done something dumb in it? 😅

Ran this to see and it only uses this much for me to decode:

zazoum-art commented 20 hours ago

t2v_stag

Works fine, load the VAE to the Decode. You were missing this.

zazoum-art commented 20 hours ago

There are a lot of errors in the inputs and outputs of nodes. Everyone, in order to solve this, right click on the first problematic -> select: Set Node -> a new node pops up -> select the new to the problematic one.

@kijai there are errors on the nodes. It can be bypassed with the way above but needs to be fixed. People use the wrong workflow because of this.

SET_NODE

kijai commented 20 hours ago

There are a lot of errors in the inputs and outputs of nodes. Everyone, in order to solve this, right click on the first problematic -> select: Set Node -> a new node pops up -> select the new to the problematic one.

@kijai there are errors on the nodes. It can be bypassed with the way above but needs to be fixed. People use the wrong workflow because of this.

There are no errors, that is just old workflow loading old node configuration which does not match the current, new version of the nodes. Those nodes need to be recreated, or fixed with the right click menu option "Fix node".

zazoum-art commented 20 hours ago

It works for me. There is no problem indeed. But people come here and share workflows. But not all people read it somewhere, they need to come to issues.

My advice: Update the readme with notification on this fix.

kijai commented 19 hours ago

It works for me. There is no problem indeed. But people come here and share workflows. But not all people read it somewhere, they need to come to issues.

My advice: Update the readme with notification on this fix.

It is included in the update notice on the last big update already.

zazoum-art commented 19 hours ago

I mean like an image, there are like 20 issues opened for this, saying broke update and stuff.

KrakeyMTL commented 16 hours ago

zazzle - There are NO issues - just you bitching. Again. I closed my own thread because you were being a clown in it, so you make a new one to complain?

ALL the workflows work 100% I've tested and used them all yesterday building a new H100 server. I'm on @Kijai's side here.

All work. There is no OOM as the new models itself needs 34-38GB vram to properly self-load with no options enabled AND another 15GB+ vram for the vae swap IF you are not using tiling. IF your cards only have 24Gbvram - welcome to fiddling with YOUR options on YOUR machine - this isn't a workflow issue - it's a server issue. [Unleashed no options the entire 1.5 model takes 64.3GB vram/swap to run a 1920x1080 video production on my L40 server and doesn't OOM because I configured the swap properly. It sits at 39GB while running, and then when vae spikes it will jump into the 60GB+ range.]

Everyone complaining seems to have 24gb vram and not reading that the 1.5 needs vastly more OR you need to use switches to reduce and add defects to the video output in doing so due to compression.

That's how it is right now. Buy a better video card with more vram, rent one in the cloud, or fix your own damn workflows for YOUR setup. 100% you this one zazzle. On nobody else.

I'm really ticked off about your attitude here after some incredible coding work was done on the wrapper. <_<

this thread should be closed.

mayoisugi commented 14 hours ago

Thanks for your help everyone! Not entirely sure what happened above. I agree the workflows are fine, as are mine (they are directly based on the examples for the respective version I was testing on, I'm well aware of the breaking changes).

I'm going to close this due to lack of reproducibility, and (as mentioned) a likelihood it's setup specific. I've found that if I get an OOM in ComfyUI, sometimes by repeatedly queuing the same prompt (10+ times) it eventually just magically works (with a 50/50 chance WSL hard crashes when it gets to VAE decoding, sometimes with a BSOD as well).

No idea what causes these OOMs in the first place, as I should have about 6gb of headroom when using tiling, and always start each run with <0.5gb used. I'll probably just write some custom scripts instead so I can hopefully get more consistent behavior, evidently my setup is just messed up.

I'm still interested why 1.0 OOMs consistently without tiled VAE decoding in the latest diffusers version (this is one thing I can reproduce without fail), but I suppose most people are moving to 1.5 now anyway so it probably doesn't matter. I just wonder if whatever change caused that has an effect on 1.5 as well, and is making it use more vram than it needs.

Thanks again for the wrapper Kijai, and all your work across this and the other nodes!

kijai / ComfyUI-CogVideoXWrapper