ogmkp commented 1 year ago

Hi, I'm testing on Debian 12 with OBS 29.1.3 with the pre-set parameters, my 4-thread CPU grinds and I get a randomly generated sentence with a huge delay. I've looked at Whisper.cpp but I can't correlate the parameters. Do you have any recommended settings for fast, resource-efficient transcription?

Thanks a lot!

Destroy666x commented 1 year ago

Confirming here, even on clean launch (after laptop restart) the CPU jumps from like 8% to 40-50% whenever I speak. Then sentences gets generated very late and often not too accurately. Using default settings. With bigger models it's way worse in terms of latency and CPU usage, but ofc better with accuracy.

Intel i7 7820 CPU, Windows 10, OBS 29.1.3

Destroy666x commented 1 year ago

Would it be possible to allow GPU usage instead? In general, my GPU is more free, as I'm mainly streaming games relying more on CPU. I see Whisper can run on GPU. This also shows better performance with GPU: https://github.com/MiscellaneousStuff/openai-whisper-cpu#results if I understand correctly.

royshil commented 1 year ago

Yes I'm working on acceleration for Whisper.cpp build and I'll release a pull request as soon as I got it working on my PC

There are several options.. but the general goal of GGML is to enable running on CPUs and their inherent acceleration e.g. SIMD

I'm still unpacking this, but it's important to get it

royshil commented 1 year ago

@Destroy666x can you try the build in https://github.com/royshil/obs-localvocal/actions/runs/6142210185#artifacts ?

it should be much faster and more performant

Destroy666x commented 1 year ago

For me CPU usage seems still rather high with that, maybe a few % lower on average.

royshil commented 1 year ago

@Destroy666x so there is improvement! this is a good thing for me it improves 100% e.g. x2 faster

were you able to benchmark whisper.cpp separately?

i think i will merge this in anyway, since it's an improvement

Destroy666x commented 1 year ago

Well, I think it is, but I don't quite know how to check it consistently, as it ran under different conditions. Similar, but different, as Windows definitel had different random processes like indexers and what not launched. But ye, improvement was like 35-45% compared to previous 40-50% reported by OBS.

As for separately, do you mean me checking Whisper's different options outside of this plugin? I can check that when I'll have time.

Destroy666x commented 1 year ago

I see there's bench.exe.

for a single run on non-BLAS with tiny model it tells me around 1-1.5 second
for a single run on BLAS with tiny model it tells me around 1-1.5 second
for a single run on non-BLAS with small model it tells me around 11-13 seconds
for a single run on BLAS with with small model it tells me around 10-12 seconds

Haven't found how to do more runs for consistent test.

According to this it should work better in OBS with tiny model at least as I also had bigger delay with that.

And, interestingly, after increasing threads from 4 to 8, small went up to ~15 seconds 🤔

royshil commented 1 year ago

thanks for this research @Destroy666x i'm looking into CLBlast acceleration next. it should be supported on many platforms and will be able to use the GPU

royshil commented 1 year ago

here are some timings i get consistently

No BLAS

whisper_print_timings:     load time =   137.18 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   encode time =  1551.64 ms /     1 runs ( 1551.64 ms per run)
whisper_print_timings:   decode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =  1724.93 ms

OpenBLAS

whisper_print_timings:     load time =   145.05 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   encode time =  1107.12 ms /     1 runs ( 1107.12 ms per run)
whisper_print_timings:   decode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =  1287.19 ms

I conclude OpenBLAS brings the most performance on my PC

CLBlast

whisper_print_timings:     load time =  1163.69 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   encode time =  2474.72 ms /     1 runs ( 2474.72 ms per run)
whisper_print_timings:   decode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =  3670.20 ms

Destroy666x commented 1 year ago

With what model and CPU/GPU, out of curiosity?

royshil commented 1 year ago

This is with an Intel i7-8700T It has an NVidia GPU but it's not being used. The Intel GPU is UHD Graphics 630, which is being used by CLBlast, but as you can see it doesn't bring any performance boost.

Destroy666x commented 1 year ago

And tiny model I assume? Weird that it doesn't use the "real" GPU.

royshil commented 1 year ago

yes this is the tiny model Nvidia/CUDA GPU is not being used since Whisper wasn't built to use them. I'm trying Whisper w CUDA now to see if it makes a difference...

Destroy666x commented 1 year ago

Oh, so there's yet another cuBLAS just for CUDA, I see that now: https://github.com/ggerganov/whisper.cpp/pull/834 I'll test it on my machine too, assuming compilation is as easy as shown there.

royshil commented 1 year ago

this is the timing for whisper with CUDA

whisper_print_timings:     load time =  1227.32 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   encode time =   728.45 ms /     1 runs (  728.45 ms per run)
whisper_print_timings:   decode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =  1991.31 ms

it is faster than the rest, but not a huge gain over OpenBLAS

the downside with CUDA is that it's so big there's no hope to ship it with the plugin. and the compatibility is horrendous, e.g. if i compile vs CUDA v12.2 and the client has v11.1 then it doesn't work.

Destroy666x commented 1 year ago

For me it was 1.5+x faster on NVidia GeForce 1080.

Perhaps, could the executable be optionally provided through path setting, since there are so many different options? They're compatible with your code, right? Then additional options like downloading CUDA and compiling that version could be described through documentation.

royshil commented 1 year ago

@Destroy666x ok i've added CUDA building instructions.

as soon as this clears i'm going to merge since i'd like to release a new version

royshil commented 1 year ago

12 has landed and introduced performance improvements

closing for now until we open again for discussion and requests

ogmkp commented 1 year ago

Hey I opened this issue because the plugin is extremely slow and cpu consuming on Linux, please keep it open !

locaal-ai / obs-localvocal

Performance improvement #9

12 has landed and introduced performance improvements