Speed: Very slow processing and GPU hardly used?

nadermx / backgroundremover

Background Remover lets you Remove Background from images and video using AI with a simple command line interface that is free and open source.

https://www.backgroundremoverai.com

MIT License

6.46k stars 538 forks source link

Speed: Very slow processing and GPU hardly used? #18

Closed rcoenen closed 1 year ago

rcoenen commented 2 years ago

Hi, OK I have successfully processed a 8 minute clip. It was a real slow process, and took a couple of hours.

How is it that Zoom en MS Teams can do basically the same job in real-time?

I have tried this on my M1 Mac but also on my PC with a GeFOrce GTX 1060 / 6 GB. (Yah I know this isn't mind-blowing but still, it's a proper GPU I'd reckon?) CUDA drivers all the latest. (V11.x)

In both cases, processing speeds were slow as molasses M1 and PC alike.

On PC I monitored the GPU usage and "Python" was never taking more than 1% of GPU resources available. Is that even right? It feels it hardly use dteh CPU at all?

In any case, it feels that the whole process could be done much faster, near real time no, given Zoom, Teams etc can do it? WHere is the bottleneck?

nadermx commented 2 years ago

Zoom and MS do something along these lines, https://github.com/nadermx/BackgroundMattingV2, which I'm working at some point in trying to integrate into this.

Did you try changing the batch size or workers? https://github.com/nadermx/backgroundremover#advance-usage-for-video

For images, it barely does us GPU, but for video it does.

rcoenen commented 2 years ago

Yes I was doing video, hardly any GPU used (I was not expecting any GPU activity on the M1 due to difference in architecture, but on PC it's bog standard NVidia stuff so... still hardly any activity in process monitor. Percentage points. So some activity was happening, I guess it goes to show that it did not 'fail' to detect the GPU or anything, but it simply did seem to use it very much at all)
re BackgroundMatting. Interesting. So the video I was processing is one of me talking into my webcam. Exact same setup and background as when I am on zoom. I think this is a very standard use case for many vlogger type stuff. Good idea to implement that 'shortcut' technique and I realize that approach might not work in other scenes than someone in front of a static background. But that use case is still very valid.
Yes I messed around with workers (up to 4) and also experimented with GPU batch size. The only difference I noticed is that it took very long for the process to startup, freeze my machine without any noticeable increase in FPS being processed. TBH documentation on what these settings do is limited so I was just messing about to see what differences it made without really exactly knowing what to expect.

rcoenen commented 2 years ago

Screenshot of what I am getting here. Actually no GPU seems to be used .. or is it? (updated screenshot with Techpower GPU-Z info tool vs MS own Process Monitor) quite odd how MS Process Monitor shows 0.2% GPU use but the GPU-Z tool says 100% CPU load... either way ~4 fps seems the fastest I can get

backgroundremover -i .\me_recorded.mov -fr 30 -wn 2 -mk -o output.mp4

Update: I have tried loads of different values for both worker nodes vs GPU Batch size

Default settings over 1st 100 frames: 4.0 FPS backgroundremover -i .\me_recorded.mov -fr 30 -wn 1 -gb 1 -mk -o output.mp4

Increasing GPU batch size from 2 to 3 to 4 to 5 (6 and above results in "RuntimeError: CUDA out of memory") slightly increased FPS to 4.3 max (measured over 1st 100 frames) backgroundremover -i .\me_recorded.mov -fr 30 -wn 1 -gb 5 -mk -o output.mp4

So there you go: 4.3 fps max performance on my machine. (video is 23095 frames, so 23095/4.3 = 5371 seconds = 90 minutes processing time for a 8 minute video)

lue commented 2 years ago

Had the same problem with no GPU use on Windows. Essentially pytorch was not installed properly. Was able to fix it by creating a clean conda environment with CUDA=11.3 using the command as recommended here: https://pytorch.org/get-started/locally/

conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch

nadermx commented 2 years ago

I have only used this on ubuntu. I am slowly moving through some of these bugs and things i noticed where wrong. I will see what I can do in order to try and speed it up more. I am open to ideas as well or patches

nadermx commented 1 year ago

Try the newest version, I am getting consistently 30 fps with default options pip install --upgrade backgroundremover

mindsocket commented 1 year ago

The new Mac M1/M2 architecture doesn't use Nvidia CUDA. With pytorch 2.01 and I confirmed that "mps" is available...

>>> import torch
>>> torch.cuda.is_available()
False
>>> torch.backends.mps.is_available()
True

Knowing that I hacked bg.py and I was able to get my machine to use the Mac GPU and run significantly faster:

#DEVICE = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
DEVICE = torch.device('mps')

I'd submit a pull request to make this a 3-way check (cuda, mps or cpu) but am not sure I understand the version dependencies and nuances of pytorch to confidently say I've got it right for everyone.

nadermx commented 1 year ago

@mindsocket just applied a fix. i do 't have a mac, but did it fail before? Since I put a try/except. If not i can do another validation