kijai / ComfyUI-CogVideoXWrapper

402 stars 24 forks source link

It this normal speed? #66

Open pondloso opened 2 days ago

pondloso commented 2 days ago

cog 5b i2v 16 vram 33.67s/it

30 step took me like 18-19 min

if i remember cog 5b t2i much quicker than this.

and for some reason Cog-fun always want to fine closes size from image not real image size input it always in wrong size .

and how much vram need for 768 base resolution?

I just want to check that every thing is right for my install,it seem take to much longer than i expect.

Thank you for you great work as always.

kijai commented 2 days ago

Which GPU? Doesn't sound normal at all, check your VRAM usage to see how close it is to the max, the nvidia system memory fallback might be triggering and slowing it down.

CogVideoX-Fun is designed like that, it finds the closest resolution it was trained with.

I don't know the exact VRAM requirements for each dimension, but I'd guess 768 is something like 20GB+

rhystz commented 2 days ago

I'm getting the same it/s on a 16gb rtx a4500 mobile. I've got another comfy install with different pytorch that's almost twice as slow.

I'm also finding that differences in overall image quality and facial and finger cohesion are pretty small between 9 and 50 steps. At 50 steps tho i'd say its right on par with the current 5b-i2v hugging space.

pondloso commented 2 days ago

Which GPU? Doesn't sound normal at all, check your VRAM usage to see how close it is to the max, the nvidia system memory fallback might be triggering and slowing it down.

CogVideoX-Fun is designed like that, it finds the closest resolution it was trained with.

I don't know the exact VRAM requirements for each dimension, but I'd guess 768 is something like 20GB+

4061ti 16gb after open up comfy is will use around 1.3-1.5 gb of vram when it process it use nearly all of it and when it decode it will had to use share vram so i had to wait like a minute extra just for decode .

I'm also think this is not normal but i cant find what i do wrong.

pondloso commented 2 days ago

i want to report another test. I test CogVideoX-Fun 2b it very fast 10 step look almost same with 20 step,10 steps it finish in just 1 minute.

when is process it use just 6 gb of vram but when it decode it even overflow my 16 gb of vram. CogVideoX-Fun 2b it look like had a great potential with speed but had to see about quality.

ArtemKo7v commented 2 days ago

CogVideoX 5b i2v 4070Ti 12gb ~25s/it

It takes about 20 minutes to generate i2v with 50 steps with enable_sequential_cpu_offload=true Besides it uses only 2-5 gb of vram during the sampling, and gpu utilization does not exceed 30%, which tells me that there is room for optimization and it could potentially run noticeably faster.

morganavr commented 2 days ago

CogVideoX 5b i2v 3080 10gb

10 minutes already passed and terminal in Windows 10 still shows 0/50 I used this workflow: examples/cogvideox_I2V_example_01.json

image

kijai commented 2 days ago

CogVideoX 5b i2v 3080 10gb

10 minutes already passed and terminal in Windows 10 still shows 0/50 I used this workflow: examples/cogvideox_I2V_example_01.json

image

10GB is probably not gonna be enough without sequential_cpu_offloading

morganavr commented 2 days ago

Thanks for suggestion but it did not help.

I tried a different workflow (cogvidex_fun_i2v_example_01) but result is the same - there is no progress for 10 minutes. Still 0/30

sequential_cpu_offloading is enabled

image

morganavr commented 2 days ago

12 minutes later only one iteration is done. 728.32s/it. GPU is used on 100%.

3%|██▋ | 1/30 [12:08<5:52:01, 728.32s/it]

CambridgeComputing commented 2 days ago

For most people, the speed issue can probably be traced down to VRAM capacity. When I run the 5B I2V model and T5 fp8, I use around 20GB VRAM. If I try to raise the resolution beyond the standard 512px, I spill over into system RAM and the speed is over 10x slower.

tl;dr: watch your VRAM and look for it to spill over into system RAM. If it does, it's going to be very slow.

alphatest8422 commented 2 days ago

The CogVideoX authors mention that cpu offloading is incredibly slow. They added in there to support lower end machines on colab, but mention in their repository that the byproduct is incredibly slow generation times and that there is currently no workaround for this. So if you're using enable_sequential_cpu_offload=true, expect some painfully long generations.

KrakeyMTL commented 2 days ago

About 10min here using the comfyui interface on an A6000 gpu. This is on par with using comfy and the 2B text to video model so I would say I2V same speeds. So if you test the 2B and get a certain speed - copy the same switches (all off really) and re-run to match in the i2v

Note: running this command line non-comfyui nets you roughly 8min if all configured properly. I posted all the various GPU speeds in the 2B main question thread (still opened in that forum)

image to video I2V sample 1

synystersocks commented 2 days ago

the speed loss for cpu off loading is because of transfer of data back and forth, aswel as read/write operations. vram capacity isnt really the issue, its getting the data to the cores fast enough, big vram is just our best solution currently (the us department of energy recently released a paper on supercluster parrelization in which they retimed the data flow to compensate/optimaise for this loss, strange paper tho cough cough nvidia bias :P), when you look at gpu design you can see how close the vram is to the cache and how cache scales down as it gets closer to the cores, this is because charge "electrons" are bound by the speed of quanta (speed of light), the distance data has to travel cost time, apu's with shared cpu/gpus kinda get around this, yet the cycle speeds arent enought to really benifit them yet. its the same as lag from servers, its all physics baby :D

KrakeyMTL commented 2 days ago

Disable all of these things in fuchsia color - NOTE I have 48GB vram on the a6000 so I am able to do this. IF YOU HAVE LESS THAN 24GB vram this probably will break! and you'll need to at least enable one of the memory saving switches.

The 4 switches will swap portions of the model and VAE from ram/vram just like the --medvram switch in sdxl does. so disabling them makes it try to shove everything into the vram (5gb sys ram for me and 22-26gb vram on spikes. under load it only uses 16.7gb vram until it parses the vae into the mix near the end.)

2nd note: this I2V model is terrible compared to SDV2 LOL. almost no movements overall and fixed sizing. While i'm happy it's running the model just needs a ton more training. The 5B text to video is vastly better!

edit: 3rd note: that image was from my own custom sdxl model, and is actually 1200x1200 natural gen sizing which SDV2 can easily animate without sizing restrictions like this model has. fyi!

image to video I2V sample 2

https://github.com/user-attachments/assets/0fcf6b81-ee11-4860-8b58-e26b353bb7da

KrakeyMTL commented 2 days ago

CROSS POST LINKS FOR THE INFO ON SPEEDS AND VRAM

NOTE: these links will take you to the same company 5B model where we just recently had a huge discussion on all this. It will help shed light on the vram and swapping switches stuff above!

https://huggingface.co/THUDM/CogVideoX-5b/discussions/7 (open. not mine. info near bottom last post)

https://huggingface.co/THUDM/CogVideoX-5b/discussions/8 (closed by me. completed)

pondloso commented 1 day ago

Disable all of these things in fuchsia color - NOTE I have 48GB vram on the a6000 so I am able to do this. IF YOU HAVE LESS THAN 24GB vram this probably will break! and you'll need to at least enable one of the memory saving switches.

edit: 3rd note: that image was from my own custom sdxl model, and is actually 1200x1200 natural gen sizing which SDV2 can easily animate without sizing restrictions like this model has. fyi!

What is SDV2 ?