kijai / ComfyUI-LivePortraitKJ

ComfyUI nodes for LivePortrait
MIT License
763 stars 53 forks source link

CPU High #17

Open xllusion-dong opened 1 week ago

xllusion-dong commented 1 week ago

When runing sample video d6, the cpu reach 100% for long time.

1720194306609

Any improvment for that?

Celtmant commented 1 week ago

I agree heavily loads the processor, as well as after RAM! I would like to note here is another thing, if you load not such a large image and do not choose a fully long video will not be much easier, perhaps reduce upscale

funwithforks commented 1 week ago

pip install onnxruntime-gpu

xllusion-dong commented 1 week ago

Already install onnxruntime-gpu, but it seems sometime it consume cpu,sometime it cost GPU. I will keep on watching it to find out the reason.

wandrzej commented 1 week ago

Overall I wonder about the performance. The paper claims to achieve 12.8 ms per frame, but in my case it's far from it and for 3/4 of the time it's not even utilizing the GPU, nor a CPU (both are at 20 and 10% respectively), so wonder apart from the onnx issue, is there anything else that could be a bottleneck - looks like a single core process is running and blocking the whole thing.

funwithforks commented 1 week ago

I don't have the issue so I have no input beyond onnx, but after getting that going my 4090 is at 55% steadily during the run. For reference. CPU on process is %345. Without GPU it was much higher CPU.

LubuLubu2 commented 1 week ago

For me it uses 100% cpu + 100% gpu and around 2.5GB v-ram the entier generation time with a 832x1152 resolution image. but it generates pretty quickly, 3060ti.

Celtmant commented 1 week ago

For me it uses 100% cpu + 100% gpu and around 2.5GB v-ram the entier generation time with a 832x1152 resolution image. but it generates pretty quickly, 3060ti.

And there's even more to it than that. Video with longer duration sucked a lot of CPU and RAM resources. My RAM was eating up almost all 29 gigabytes and the computer was freezing. I used "pip install onnxruntime-gpu" and it became not much easier, but with long videos, the RAM clogged and I'm afraid there may be everything. I have a 360rtx/12 video card, 32 RAM.

LubuLubu2 commented 1 week ago

For me it uses 100% cpu + 100% gpu and around 2.5GB v-ram the entier generation time with a 832x1152 resolution image. but it generates pretty quickly, 3060ti.

And there's even more to it than that. Video with longer duration sucked a lot of CPU and RAM resources. My RAM was eating up almost all 29 gigabytes and the computer was freezing. I used "pip install onnxruntime-gpu" and it became not much easier, but with long videos, the RAM clogged and I'm afraid there may be everything. I have a 360rtx/12 video card, 32 RAM.

Yep, 35seconds example video or i even tried 1minuted can eat all resources that you have and if you don't have enough your pc will freeze for minutes :)) mine was frozen for 15 minutes for a 1min video, again generation is fine, but at the end every single frame of lets say 1minutes video at 24fps have to be proccessed, i mean that created more than a thousand images and eats all your ram. 20second or less is fine, for longer videos we have to limited frame cap and generate couple of videos and join them together later.

kosmicdream commented 1 week ago

Same problem here, I've been trying to run the example video on an A40 instance on Runpod and everything freezes.

wandrzej commented 1 week ago

On my side, I think it's not really a matter of some bottleneck, I have 128GB RAM, so some frame off-loading is not the problem. Same with vram - 24GB. I do have onnx-gpu installed, but it's 1.5 I believe, maybe there's a version mismatch, but even that wouldn't explain the low load on both CPU and GPU in the pre-processing phase.

Anyway this could work way more efficiently, and given the low utilization numbers provided by others, I think with proper use of both CPU and GPU the claimed 12.8ms per frame is possible, regardless of the length of the video. It could be that this is an issue with comfy itself, that it needs to finish one 'block' from the pre-process node, before moving to the generation.

kijai commented 1 week ago

On my side, I think it's not really a matter of some bottleneck, I have 128GB RAM, so some frame off-loading is not the problem. Same with vram - 24GB. I do have onnx-gpu installed, but it's 1.5 I believe, maybe there's a version mismatch, but even that wouldn't explain the low load on both CPU and GPU in the pre-processing phase.

Anyway this could work way more efficiently, and given the low utilization numbers provided by others, I think with proper use of both CPU and GPU the claimed 12.8ms per frame is possible, regardless of the length of the video. It could be that this is an issue with comfy itself, that it needs to finish one 'block' from the pre-process node, before moving to the generation.

Their code has a lot of inefficiencies, I don't know if the their speed claim is about the whole process or part of it. For example skipping the pasteback gives ~30% speed boost.

For reference the numbers I'm currently getting for video editing in the develop branch with 4090, for the detection/cropping part, using CUDA for onnx: 33it/s

And the rest, which uses mostly GPU but there's also lots of CV2/numpy operations that are done on CPU, I'm getting 12it/s on Ryzen 7950x

So something like ~14 fps without pasteback and ~11 fps with.

kijai commented 1 week ago

Oh, and about the memory issue...that's common in Comfy when the frame count gets really high, it's not really designed to handle that in general as everything is kept in memory with no disk caching.