tfaehse / DashcamCleaner

Censor identifiable information in videos, in particular dashcam recordings in Germany.
GNU Affero General Public License v3.0
131 stars 27 forks source link

CPU not fully utilized #36

Open joshinils opened 2 years ago

joshinils commented 2 years ago

image I own an AMD Ryzen 3800xt 8-core CPU with 16 Threads. When blurring a video, not all of the CPU is being used.

My guess is this has something to do with reading frames from memory, then detecting info, then blurring it, then storing the blurred frame, then repeating the loop. My profiling skills in python are non-existent, so I can only guess that this is the reason why CPU performance is not at 100% utilization. weaving/shuffling IO-bound tasks and CPU-bound ones to do them concurrently would be ideal, so reading and writing happens in the background while detections take place

tfaehse commented 2 years ago

Lots of bottlenecks unfortunately. Detection can run in almost any batch size, especially on GPUs the higher the better. But there's two issues with that, you need quite a bit of memory and frames need to arrive quickly enough. So far, that's not really given.

I've tried offloading reading and writing frames to separate processes, but surprisingly that didn't help much - the additional overhead, at least on my system, ate the gains from parallelisation. So far, the best approach I could find was to massively speed up reading/writing frames. https://github.com/tfaehse/DashcamCleaner/pull/32 contains a very much WIP draft, but the performance improvements are already quite large. Basically, it makes sure that extracting frames, reading frames and detection (given a high enough batch size) all attempt to saturate the given resources. As long as each of the steps use near 100% of your CPU, it doesn't matter if they're properly pipelined or not. So far, that has lead to really nice results on my laptop at least - up to 20fps for inference at 360p, with small weights. Maybe not the most realistic workload, but it's my default test case...

I still have to fix a lot of things there though, and check how this works when using a GPU.

TL;DR: Increasing the batch size and improving frame extraction should help.

joshinils commented 2 years ago

hm, I found when using CPU that increasing batchsize does not really help, rather the opposite.

with batch_size==5: image the dips are where the progress-bar updates, so when the batch of frames is saved and the next are loaded... so even then the detection does not use more cpu% :-/

joshinils commented 2 years ago

almost makes me think having two batches run detections in parallel would be beneficial, however weird that may sound. since one batch of detections does not utilize the cpu well.

tfaehse commented 2 years ago

Hmm.. it's not perfect for me, but much better. batch size 16:

image

I'll do some profiling soon.