GPU not fully utilized - Githubissues

Dave04O4 commented 1 year ago

Hello, I use my GPU and it fits up to the first 10%. After that it drops to a permanent 10-20%.

I once had a version from July/June where the GPU was fully utilized, but somehow not in the current version.

I have already tested with the batch size, but was not really successful 2022-10-03 (1)

tfaehse commented 1 year ago

Hi!

This took me quite a while to even look at, apologies for that. The answer isn't very straightforward, but I can try:

Pure inference just doesn't phase the GPU for long, so the utilisation graph (per time unit) might just not be granular enough to show the (short) spikes
Reading and writing video files introduces a pretty significant overhead, and every inference batch can only start when all frames for the batch are ready. In my case, with a fairly old CPU, just reading and writing back the frames (no detection, no blurring) already results in <30fps performance for a 4k video
The current blurring algorithm results in way nicer results that try to fade blurred areas into the unblurred image. The downside is that this costs quite a bit of performance - yet again a process that runs sequentially, so the next inference batch has to wait for this to be done

Generally, you want to choose the biggest batch size your GPU memory allows for to optimise inference time. But you seem to have done that already.... do note that the 1080p_medium weights result in a pretty massive network that takes a lot of memory and performance. If your GPU can't fit more than one image per batch, you might want to reduce the inference size and/or choose a smaller model, e.g. 720p_small_mosaic at 720p.

There are a few things in the pipeline to improve this though!

A feature is currently being implemented to run blurring in multiple processes within each batch. This makes it quite a bit faster
I'm rewriting the blurring algorithm to be a bit faster and less memory hungry, which should also make it possible to use more processes for blurring without running into memory limits
Newer, faster models are currently in training. That should also help with inference time

Dave04O4 commented 1 year ago

Hi, Thanks for the feedback. I try everything possible and also the other versions of them.

I had a version in June/July/August where a 1-minute clip took about 3 minutes and now it's about 10 minutes with the latest version.

With so much different hardware, it's sometimes really difficult to keep things running well.

Thanks for the great tool.

Intel Core i7-9700K ASUS ROG Strix GeForce RTX 2080 OC

tfaehse commented 1 year ago

I'd assume that's due to the blurring. The new version (currently being developed here: https://github.com/tfaehse/DashcamCleaner/tree/feature/yolov8) addresses this somewhat:

blurring is a bit faster
you can set how many processes you want to use for blurring in the GUI
the GUI shows which "stage" the tool is currently in (for each batch: get frames -> detect plates/faces -> blur and write frames)

With this version, the workflow for users is a bit more simple:

choose the weights file (maybe try 720p_small_v8 to see what that would look like)
choose the batch size (for best performance: as big as possible without gettig CUDA out of memory errors)
choose the amount of blurring workers (also as large as possible, but on Windows you run into memory errors very quickly)

For the second step I want to look into how to automate that, but that's for another day.

tfaehse / DashcamCleaner

GPU not fully utilized #52