ggerganov / whisper.cpp

Port of OpenAI's Whisper model in C/C++
MIT License
35.38k stars 3.61k forks source link

is it possible to run openai-whisper ggml model on raspberry pi hardware? #7

Closed nyadla-sys closed 2 years ago

nyadla-sys commented 2 years ago

is it possible to run this gghml model on raspberry pi hardware?

StuartIanNaylor commented 1 year ago

You have to have the delegate and actually Tflite doesn't automatically select the best delegate it selects the best internal delegate. I have used the ArmNN delegate which is supposedly faster but you have to specifify as a delegate to be used.

The Pi Gpu is much less as its not a GPU as a Pi is a Arm chip ontop of DSP and it is alaredy doing various functions and prob somewhere someone has managed to calc it should be able do a theoretical 13.5 - 32.0 GFLOPS. When it comes to actual from those who have played out of interest with Pi3-4 with various retro emulators it sort of sucks as it prob can do a theoretical 13.5 - 32.0 GFLOPS but the paging of memory to it is a major bottleneck from memory. I have read so many times about the supposed possibilities of vc4/6 and have never seen anything to back them up as for retro gaming or opencl the results have always been pretty poor and using the Neon on the Arm chip always seems to take preference.

https://github.com/ARM-software/armnn/blob/main/delegate/DelegateQuickStartGuide.md

That is how you use a ArmNN delegate and there used to be an OpenCL delegate as I am unsure why that isn't a current thing but don't its supported as ArmNN does use OpenCL as does there delegate, so why not just an OpenCL delegate has always confused but prob has limited operations.

nyadla-sys commented 1 year ago

One important thing to note, the current whisper tflite model is hybrid(activations are in float and weights are in int8) , which may be restricting to delegate efficiently on gpu or various hardware accelerators , as gpu only supports complete float model or int 8 model to efficiently offload on it.

nyadla-sys commented 1 year ago

I am actively working to generate full int8 model, however tflite converter is having issues and reported to Google and Google is actively working on it..

j1nx commented 1 year ago

@StuartIanNaylor Yes, indeed. It selects the best internal delegate. that logic does not take external delegates into account.

StuartIanNaylor commented 1 year ago

You can build it with the opencl gpu delegate cmake ../tensorflow_src/tensorflow/lite -DTFLITE_ENABLE_GPU=ON But the ops are really limited https://www.tensorflow.org/lite/performance/gpu#supported_ops to really simple models or it will be swapping from layer to layer support...? Prob could be done https://www.tensorflow.org/lite/performance/implementing_delegate

j1nx commented 1 year ago

@StuartIanNaylor

Have planned to check out for the rpi3 the opencl implementation; https://github.com/doe300/VC4CL

And for the rpi4 with mesa3d vulkan and the clvk implementation; https://github.com/kpet/clvk

But won't expect anything from it. Just for fun

StuartIanNaylor commented 1 year ago

Nope :) but there is also that maybe the encoder and decoder could be split and you run both simultaneously on cpu & gpu maybe?

I think this is where the Mali Arm boards gain an advantage as presume they have DMA methods than just paging in and out of memory as with ArmNN the Mali G610 is near the 8 core RK3588 for ML so of you can partition a model and run on both then guess there will be overheads but should give a perf boost.

Also though always wondered why there is not a Vulkan delegate rather than OpenCL as isn't Vulkan more lower level with more ops and likely would make a better delegate? I am trying to get my hands on a OrangePi02 as its a quad A53 1.5Ghz with a MaliG31Mp2 that was £30 delivered (when delivered), I currently purchased a Pi0W purely out of interest and that was £20 and before even testing it think its a lot of pointless at that price/perf level :)

j1nx commented 1 year ago

Also though always wondered why there is not a Vulkan delegate rather than OpenCL as isn't Vulkan more lower level with more ops and likely would make a better delegate?

I believe if you build tensorflow-lite with the GPU support on it builds mutliple GPU delegates or at least with multiple support. As it only pulls in the headers, including the opengl and vilkan headers. I believe again it auto selects the best one possible.

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/CMakeLists.txt#L254

StuartIanNaylor commented 1 year ago

Only seems to be Cl or OPenGL https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/delegates/gpu

Metal seems to be in the works which might be an interest to Georgi and prob from what you posted maybe Vulkan but apart from headers can not find much else. If you look at the kernels there does seem to be more support that is suggested on the webpages. https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/delegates/gpu/gl/kernels

PS did you run the new benchmark tests? https://github.com/ggerganov/whisper.cpp/issues/89#issuecomment-1397840049

j1nx commented 1 year ago

Looks like Vulkan has not yet been released; https://github.com/tensorflow/tensorflow/issues/57861#issuecomment-1262662236

hence, you know what. Will install the headers and enable GPU. Perhaps the OpenGL is not that bad conpared to XXNpack. At least perhaps we can use the CPU cores for other things while tflite runs on GPU.

just because we can.😉

StuartIanNaylor commented 1 year ago

Lols I like the attitude I think OpenCL is the faster though as just trying with the RK3588 MaliG610 & tflite Hey https://github.com/tensorflow/tensorflow/issues/57861#issuecomment-1262662236 bummer but at least its in the wings! I presume my mention of running encoder/decoder as 2 models on both in parallel is a no go!?

j1nx commented 1 year ago

I can get things to work and patch the hell out of it, however coding to seperate encoding and decoding is far beyond my reach.

StuartIanNaylor commented 1 year ago

I can not seem to get tflite-minimal to use the Opencl delegate still uses xnnpack even though the compile does create the binaries. Maybe it does need hardcoding. Also it seems to try and install opencl 300 even though I did export and prefix with CL_TARGET_OPENCL_VERSION=210. The pip package seems to be the same but if anyone manages and can save me the pain post a url for a binary so I can test.

nyadla-sys commented 1 year ago

I can get things to work and patch the hell out of it, however coding to seperate encoding and decoding is far beyond my reach.

I managed to generate encoder and decoder tflite models and just pending to complete the decoder post processing to generate tokens with text https://colab.research.google.com/github/usefulsensors/openai-whisper/blob/main/notebooks/whisper_encoder_decoder_tflite.ipynb

j1nx commented 1 year ago

I can not seem to get tflite-minimal to use the Opencl delegate still uses xnnpack even though the compile does create the binaries. Maybe it does need hardcoding. Also it seems to try and install opencl 300 even though I did export and prefix with CL_TARGET_OPENCL_VERSION=210. The pip package seems to be the same but if anyone manages and can save me the pain post a url for a binary so I can test.

Can you download the proper benchmark binary from here; https://www.tensorflow.org/lite/performance/measurement

With that one you can benchmark models and force different delegates to use. What does forcing it with gpu say as output?

--help gives you all the options

what i read online, opencl requires some configuration on an OS level. There is some sort of opencl info binary that tests if it really uses gpu and not llvm. "clinfo" I believe.

j1nx commented 1 year ago

@StuartIanNaylor

https://www.tensorflow.org/api_docs/python/tf/lite/experimental/load_delegate

And for C++ https://www.tensorflow.org/lite/performance/gpu

StuartIanNaylor commented 1 year ago

Yeah I was expecting it to act as a internal delegate, but don't think I will bother as generally its always slower than the CPU on SBC type boards. If the Mali on the RK3588 was a MP6 than MP4 then we would prob be talking to run the main model GPU and ancillary functions CPU. There is also a 3 core 2Tops (6Tops) NPU but still to work out the framework that seems to favour Onnx. If you could partition a model and run both concurrently then obviously your going to get a perf boost, but how to do that is beyond me. Or if you have a system that is running several models you can offload a model to gpu or npu if you have one.

I am finding the optimisations for Arm v8.2 give quite a boost and why I was trying the compile

fquirin commented 1 year ago

Maybe interesting to try out and compare? They claim its pretty fast: https://github.com/openai/whisper/discussions/937

fquirin commented 1 year ago

I've done some testing with the CTranslate2 port of Whisper and it seems to be the same speed as the optimized Bazel build of tflite_runtime. In addition it is much smaller since it doesn't require the Whisper package itself and it works well for other languages 🙂. Just the tiny.en model seems to behave a bit weird in my tests ^^.

StuartIanNaylor commented 1 year ago

I dunno if my version will run on yours as would be interesting as maybe the Cortex-A76 instructions are not implemented as the A76 is a strange Arm v8.2 as it contains elements all the way up to Arm v8.5. https://developer.arm.com/documentation/100798/0300/xdc1477563390075

I will link you a tflite as if it works on a A72 then Bazel doesn't optimise for but maybe I am wrong as didn't realise fp16/dotprod where optional.

 lscpu
Architecture:           aarch64
  CPU op-mode(s):       32-bit, 64-bit
  Byte Order:           Little Endian
CPU(s):                 8
  On-line CPU(s) list:  0-7
Vendor ID:              ARM
  Model name:           Cortex-A55
    Model:              0
    Thread(s) per core: 1
    Core(s) per socket: 4
    Socket(s):          1
    Stepping:           r2p0
    CPU max MHz:        1800.0000
    CPU min MHz:        408.0000
    BogoMIPS:           48.00
    Flags:              fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp
                        asimdhp cpuid asimdrdm lrcpc dcpop asimddp
  Model name:           Cortex-A76
    Model:              0
    Thread(s) per core: 1
    Core(s) per socket: 2
    Socket(s):          2
    Stepping:           r4p0
    CPU max MHz:        2400.0000
    CPU min MHz:        408.0000
    BogoMIPS:           48.00
    Flags:              fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp
                        asimdhp cpuid asimdrdm lrcpc dcpop asimddp
Caches (sum of all):
  L1d:                  384 KiB (8 instances)
  L1i:                  384 KiB (8 instances)
  L2:                   2.5 MiB (8 instances)
  L3:                   3 MiB (1 instance)
Vulnerabilities:
  Itlb multihit:        Not affected
  L1tf:                 Not affected
  Mds:                  Not affected
  Meltdown:             Not affected
  Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:           Mitigation; __user pointer sanitization
  Spectre v2:           Vulnerable: Unprivileged eBPF enabled
  Srbds:                Not affected
  Tsx async abort:      Not affected

I should try as maybe not listed as a feature and maybe needs mtune also cortex-a76.cortex-a55

fquirin commented 1 year ago

Are you trying to optimize tflite further? 🙂. Let me know when you want me to test a wheel file etc..

Btw my Orange Pi 5 8GB just arrived, can't wait to start experimenting 🤩

StuartIanNaylor commented 1 year ago

https://github.com/radxa/apt/blob/gh-pages/bullseye-stable/pool/main/libm/libmali/libmali-valhall-g610-g6p0-x11_1.9-1_arm64.deb If you need it also https://github.com/JeffyCN/rockchip_mirrors/raw/libmali/firmware/g610/mali_csffw.bin should be in lib firmware and then you should be good to go with OpenCL.

fquirin commented 1 year ago

I've added some Orange Pi 5 results using latest Armbian image (without any custom modifications): https://github.com/fquirin/speech-recognition-experiments . Speed is really incredible! 🚀

What I've noticed is that you should limit threads to 4 in most cases (see new Whisper-org test) to avoid problems with big-little architecture (4 performance, 4 efficiency cores)!

Unfortunately latest tests with Whisper-Cpp on Arm64 (Rpi and OPi) confirm that it is actually slower than the original 🤔

https://github.com/radxa/apt/blob/gh-pages/bullseye-stable/pool/main/libm/libmali/libmali-valhall-g610-g6p0-x11_1.9-1_arm64.deb

@StuartIanNaylor what does that do exactly? Some experimental GPU driver?

StuartIanNaylor commented 1 year ago

Its the rockchip mali driver that is part of the Radxa install as Opi sources its more DiY. Yeah RK3588x are a Tri core of 2x Perf + 2x Perf + 4x efficiency If you use more than x4 threads then you will also be running on the efficiency cores and often has the effect of slowdown as you hit a scaling deficiency + slower core.

Speed is really incredible! 🚀

Yeah its the 1st SBC that actually you can work on also your prob running on a sdcard but the deault install has zram running with a swap and log dir defined. That is really great as near all sdcard wear is log writes as the majority in run is all reads. One thing they have missed is if the zram ram swap hits a uncompressable file you will OOM and install a 2gb dphys-swap as you will use zram swap but the dphys will act as a fail over on the rare occasion an OOM may happen.

Radxa are releasing a RK3588s also, but the shear volume of vendors has this big herd and with Kernel v6.2 we might see base mainline already but all are sharing the same soc, so there is much inertia.

I still haven't worked out if The Cortex-A76 core optionally implements the SDOT and UDOT instructions introduced in the Armv8.4-A extensions. as presumed it must be as the speed increase is far more than just clock but maybe not. Confused but yeah its extremely fast if your used to a Pi.

https://patchwork.ozlabs.org/project/gcc/patch/5B335DF0.2010201@foss.arm.com/

The Opi02 is not a bad alternative to a Pi3 as also a tad faster than a Pi3b but nothing like this, but Pi3 price.

@fquirin PS manjaro have the basics of uart and lan running 6.2 on the Opi5

fquirin commented 1 year ago

Thanks for the info. Mainline Kernel would certainly be great as I struggle to get my RTL8811CU Wifi stick running on Armbian (rumors say it works on other distros). Haven't experimented yet with the GPU but installed a M2 SATA SSD (20 bucks for 240GB 🤑) and it runs at neat 360 MB/s 😎. Not using any desktop apps so far (Chromium etc.), just headless server, but I wonder what ML libs could profit from the GPU 🤔

StuartIanNaylor commented 1 year ago

ML libs could profit from the GPU

ArmNN will utilise the GPU https://github.com/StuartIanNaylor/rock5b-wav2letter-bench is where I experimented with the Arm tutorial which was really bad so did some changes. It just needs models converting same with the NPU but both can do ML heavy lifting at a similar load level as the CPU leaving the CPU to do other things... ML wise the CPU is approx 4-5x Pi4, GPU seems approx 3-4x Pi4 and I think each core of the NPU of the x3 core is approx same. Whisper likely process on the NPU as the toolkit seems very onnx concentric.

fquirin commented 1 year ago

ArmNN will utilise the GPU https://github.com/StuartIanNaylor/rock5b-wav2letter-bench is where I experimented with the Arm tutorial which was really bad so did some changes.

Thanks, I'll try to do some experiments when I find time. It's an interesting topic, but I'll probably do an implementation of Whisper into SEPIA STT server next, so it can actually be used for the assistant :-)

Whisper likely process on the NPU as the toolkit seems very onnx concentric.

All Whisper versions I've tried ran completely on CPU (RPi4 and OPi5), at least thats what I saw via htop.

virtuary commented 1 year ago

Some more experiments - enabling NEON instructions reduces the time all the way down to just ~15 seconds to process a 30 second audio.

New to this thread and exploring the Whisper /AI world on Raspberry. I see you've reduced 'processing time' (sorry I don't know the correct terms) down to 15secs from voice input to Output results. In terms of something similar to an AI assistant, that still is a very long time to wait, and seems "slow" even though it goes without saying that what you've done is impressive job, and this is just the beginning of playing with these combinations. Suggestion: I'm no good at building and coding from scratch, but I've tinkered with some projects. Concerning improving the performance and to enhance this Whisper/Raspberry setup, would it be possible to improve performance by removing any and all config related to the various languages Whisper recognizes? For example, I personally would not need it to recognize any language other than English, yet the system will still have 71 other languages in its models of bloat. While I can understand how "cutting down the fat" this way might improve something like storage or size of the code/build/program, I can't help thinking that it will improve speed as well. My thinking is that it will have to scrub through significantly less datasets or code, to reach the junctions in the workflow necessary to output a result derived from a request. If I had any knowledge of how to code, and had python experience, or anything remotely close to what some of you guys do here, I would try going down this route. I've seen in another thread ([https://github.com/openai/whisper/discussions/849] someone ask something similar, but I don't think 'performance increase' was their objective.
There must be some point in the process of the speech recognizer where it first must determine which language is being spoken, aside from processing/holding/then reading audio input, and then finally scrubbing through the datasets in the English language models to output information. I can't help thinking that if it has less of a jungle to crawl through, it would get to and from with a lot more ease. At least, a little more processing time could be chipped away, even if it's just 2 or 3 more seconds- that's still an improvement! I'm already blown away by the Whisper/OpenAI mash up, but man, if this flow can become as fast as a response-time from (lets say) Alexa or any of the other big assistants, the possibilities are more mind-boggling than they already are. Thanks to all the peeps here working on these projects! You are more valuable than you can think, and a hugely underrated part of the tech world that is rarely given it's credit in this often-times thankless section of innovation. Cheers!!

fquirin commented 1 year ago

Hi @virtuary . When I tested all the Whisper variants (https://github.com/fquirin/speech-recognition-experiments) I made sure auto-detection of the language is off and the model is called with a fixed value. This saves some time, but not much. The tiny.en model is kind of what you've described, a model for English only, but I haven't seen any significant speed improvements when using it. Currently it feels like the Tflite and Ct2 versions hit the limit of what is possible on Raspberry Pi. 3-4s is already amazing considering the performance of Whisper, but in the context of a voice assistant (I wrote some comments about that as well) it's a bit too slow. On next-gen SBCs like the Orange Pi 5 situation is a bit different, they might even be able to handle the next bigger model with acceptable speed 😎

Antigen-1 commented 1 year ago

Good news!

I just tried it on a Raspberry Pi 4 Model B from 2018 and it works!

The tiny.en model takes 140 sec to transcribe a 30 sec audio, but I think this can be improved, because I disabled all SIMD instructions to make it compile. I will improve this in the following days.

If you want to try it, use the raspberry branch:

git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
git checkout reaspberry
make tiny.en

But I got an error when building v1.2.1 with make base.en CC=clang on Ubuntu server 22.04 in my Rasperry Pi 4:

whisper_model_load: ERROR not all tensors loaded from model file - expected 245, got 3
whisper_init: failed to load model
error: failed to initialize whisper context
fquirin commented 1 year ago

Hey @StuartIanNaylor , I thought its better to continue discussing your latest benchmark results here :-). Did you use any special flags during build for the Rock 5b/Orange Pi 5? Any other Linux packages we might need to checkout?

StuartIanNaylor commented 1 year ago

No str8 build no compile changes. Only thing is distro from what I can see. Temp wise 60c max.

fquirin commented 1 year ago

Could you run a simple test for comparison: ./main -m "models/ggml-tiny.bin" -f samples/jfk.wav -t 4 -l en --beam-size 1

system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | COREML = 0 |
...
whisper_print_timings:     load time =   152.68 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   301.16 ms
whisper_print_timings:   sample time =    26.09 ms /    25 runs (    1.04 ms per run)
whisper_print_timings:   encode time =  1798.86 ms /     1 runs ( 1798.86 ms per run)
whisper_print_timings:   decode time =   187.75 ms /    25 runs (    7.51 ms per run)
whisper_print_timings:    total time =  2549.35 ms
StuartIanNaylor commented 1 year ago
whisper_print_timings:     load time =  1384.94 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   328.22 ms
whisper_print_timings:   sample time =    25.17 ms /    25 runs (    1.01 ms per run)
whisper_print_timings:   encode time =  1203.73 ms /     1 runs ( 1203.73 ms per run)
whisper_print_timings:   decode time =   188.75 ms /    25 runs (    7.55 ms per run)
whisper_print_timings:    total time =  3290.21 ms
StuartIanNaylor commented 1 year ago

taskset -c 4-7 ./main -m "models/ggml-tiny.bin" -f samples/jfk.wav -t 4 -l en --beam-size 1

whisper_print_timings:     load time =   136.70 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   286.95 ms
whisper_print_timings:   sample time =    24.52 ms /    25 runs (    0.98 ms per run)
whisper_print_timings:   encode time =  1171.08 ms /     1 runs ( 1171.08 ms per run)
whisper_print_timings:   decode time =   119.49 ms /    25 runs (    4.78 ms per run)
whisper_print_timings:    total time =  1794.83 ms

Load time though as 2nd run ran from memory

fquirin commented 1 year ago

Thanks. Taskset doesn't really change anything for me, just the usual random fluctuations:

taskset -c 4-7 ./main -m "models/ggml-tiny.bin" -f samples/jfk.wav -t 4 -l en --beam-size 1

whisper_print_timings:     load time =   237.44 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   444.52 ms
whisper_print_timings:   sample time =    26.07 ms /    25 runs (    1.04 ms per run)
whisper_print_timings:   encode time =  1719.09 ms /     1 runs ( 1719.09 ms per run)
whisper_print_timings:   decode time =    96.07 ms /    25 runs (    3.84 ms per run)
whisper_print_timings:    total time =  2579.70 ms

I'm very confused 😅

StuartIanNaylor commented 1 year ago

Try the vendor supplied distros. Taskset gives a tad more than without taskset set.

whisper_print_timings:     load time =   135.55 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   293.48 ms
whisper_print_timings:   sample time =    24.70 ms /    25 runs (    0.99 ms per run)
whisper_print_timings:   encode time =  1196.19 ms /     1 runs ( 1196.19 ms per run)
whisper_print_timings:   decode time =   186.75 ms /    25 runs (    7.47 ms per run)
whisper_print_timings:    total time =  1892.93 ms
fquirin commented 1 year ago

Did a sudo apt upgrade and it might have done a little bit ... or I simply get lucky shots from time to time:

whisper_print_timings:     load time =   198.58 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   276.80 ms
whisper_print_timings:   sample time =    26.17 ms /    25 runs (    1.05 ms per run)
whisper_print_timings:   encode time =  1247.96 ms /     1 runs ( 1247.96 ms per run)
whisper_print_timings:   decode time =    97.54 ms /    25 runs (    3.90 ms per run)
whisper_print_timings:    total time =  1902.06 ms

followed by 🤦‍♂️ :

whisper_print_timings:     load time =   148.12 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   428.73 ms
whisper_print_timings:   sample time =    27.10 ms /    25 runs (    1.08 ms per run)
whisper_print_timings:   encode time =  2160.85 ms /     1 runs ( 2160.85 ms per run)
whisper_print_timings:   decode time =   255.73 ms /    25 runs (   10.23 ms per run)
whisper_print_timings:    total time =  3101.65 ms
ggerganov commented 1 year ago

@fquirin Is it more stable with -t 3?

fquirin commented 1 year ago

Doesn't look like:

whisper_print_timings:     load time =   205.04 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   538.64 ms
whisper_print_timings:   sample time =    25.99 ms /    25 runs (    1.04 ms per run)
whisper_print_timings:   encode time =  2296.48 ms /     1 runs ( 2296.48 ms per run)
whisper_print_timings:   decode time =    95.20 ms /    25 runs (    3.81 ms per run)
whisper_print_timings:    total time =  3244.47 ms

Directly after this:

whisper_print_timings:     load time =   225.39 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   366.83 ms
whisper_print_timings:   sample time =    26.04 ms /    25 runs (    1.04 ms per run)
whisper_print_timings:   encode time =  1449.13 ms /     1 runs ( 1449.13 ms per run)
whisper_print_timings:   decode time =    95.01 ms /    25 runs (    3.80 ms per run)
whisper_print_timings:    total time =  2245.81 ms
fquirin commented 1 year ago

Before trying a completely new OS I gave the Ubuntu 23 slim Docker image a chance:

Got a new all-time low ^^:

whisper_print_timings:     load time =   147.25 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   279.83 ms
whisper_print_timings:   sample time =    26.91 ms /    25 runs (    1.08 ms per run)
whisper_print_timings:   encode time =  1149.59 ms /     1 runs ( 1149.59 ms per run)
whisper_print_timings:   decode time =    97.91 ms /    25 runs (    3.92 ms per run)
whisper_print_timings:    total time =  1781.86 ms

but average looks more like:

whisper_print_timings:     load time =   287.24 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   337.92 ms
whisper_print_timings:   sample time =    26.78 ms /    25 runs (    1.07 ms per run)
whisper_print_timings:   encode time =  1549.98 ms /     1 runs ( 1549.98 ms per run)
whisper_print_timings:   decode time =    97.53 ms /    25 runs (    3.90 ms per run)
whisper_print_timings:    total time =  2382.50 ms

I've seen everything from 1.7s to 3.3s in no particular order. Of cause this could still be an issue with the host OS.

ggerganov commented 1 year ago

Just tried the 8-bit model on my RPi4 which is running a 32-bit OS:

pi@raspberrypi:~/whisper.cpp $ getconf LONG_BIT
32
pi@raspberrypi:~/whisper.cpp $ ./main -m ./models/ggml-tiny.en-q8_0.bin ./samples/jfk.wav -t 3
whisper_init_from_file_no_state: loading model from './models/ggml-tiny.en-q8_0.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 7
whisper_model_load: type          = 1
whisper_model_load: mem required  =  172.00 MB (+    3.00 MB per decoder)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =   43.18 MB
whisper_model_load: model size    =   43.14 MB
whisper_init_state: kv self size  =    2.62 MB
whisper_init_state: kv cross size =    8.79 MB

system_info: n_threads = 3 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | COREML = 0 | 

main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 3 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...

[00:00:00.000 --> 00:00:08.000]   And so my fellow Americans ask not what your country can do for you
[00:00:08.000 --> 00:00:11.000]   ask what you can do for your country.

whisper_print_timings:     load time =   433.38 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =  1068.06 ms
whisper_print_timings:   sample time =   192.17 ms /    27 runs (    7.12 ms per run)
whisper_print_timings:   encode time =  9107.05 ms /     1 runs ( 9107.05 ms per run)
whisper_print_timings:   decode time =   762.21 ms /    27 runs (   28.23 ms per run)
whisper_print_timings:    total time = 11918.20 ms
pi@raspberrypi:~/whisper.cpp $ ./main -m ./models/ggml-tiny.en-q8_0.bin ./samples/jfk.wav -t 3
whisper_init_from_file_no_state: loading model from './models/ggml-tiny.en-q8_0.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 7
whisper_model_load: type          = 1
whisper_model_load: mem required  =  172.00 MB (+    3.00 MB per decoder)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =   43.18 MB
whisper_model_load: model size    =   43.14 MB
whisper_init_state: kv self size  =    2.62 MB
whisper_init_state: kv cross size =    8.79 MB

system_info: n_threads = 3 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | COREML = 0 | 

main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 3 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...

[00:00:00.000 --> 00:00:08.000]   And so my fellow Americans ask not what your country can do for you
[00:00:08.000 --> 00:00:11.000]   ask what you can do for your country.

whisper_print_timings:     load time =   429.34 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =  1062.75 ms
whisper_print_timings:   sample time =    77.46 ms /    27 runs (    2.87 ms per run)
whisper_print_timings:   encode time = 10014.02 ms /     1 runs (10014.02 ms per run)
whisper_print_timings:   decode time =   413.60 ms /    27 runs (   15.32 ms per run)
whisper_print_timings:    total time = 12351.25 ms
pi@raspberrypi:~/whisper.cpp $ ./main -m ./models/ggml-tiny.en-q8_0.bin ./samples/jfk.wav -t 3
whisper_init_from_file_no_state: loading model from './models/ggml-tiny.en-q8_0.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 7
whisper_model_load: type          = 1
whisper_model_load: mem required  =  172.00 MB (+    3.00 MB per decoder)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =   43.18 MB
whisper_model_load: model size    =   43.14 MB
whisper_init_state: kv self size  =    2.62 MB
whisper_init_state: kv cross size =    8.79 MB

system_info: n_threads = 3 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | COREML = 0 | 

main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 3 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...

[00:00:00.000 --> 00:00:08.000]   And so my fellow Americans ask not what your country can do for you
[00:00:08.000 --> 00:00:11.000]   ask what you can do for your country.

whisper_print_timings:     load time =   433.39 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   890.49 ms
whisper_print_timings:   sample time =    77.42 ms /    27 runs (    2.87 ms per run)
whisper_print_timings:   encode time =  9910.22 ms /     1 runs ( 9910.22 ms per run)
whisper_print_timings:   decode time =   417.30 ms /    27 runs (   15.46 ms per run)
whisper_print_timings:    total time = 12083.65 ms
pi@raspberrypi:~/whisper.cpp $ ./main -m ./models/ggml-tiny.en-q8_0.bin ./samples/jfk.wav -t 3
whisper_init_from_file_no_state: loading model from './models/ggml-tiny.en-q8_0.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 7
whisper_model_load: type          = 1
whisper_model_load: mem required  =  172.00 MB (+    3.00 MB per decoder)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =   43.18 MB
whisper_model_load: model size    =   43.14 MB
whisper_init_state: kv self size  =    2.62 MB
whisper_init_state: kv cross size =    8.79 MB

system_info: n_threads = 3 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | COREML = 0 | 

main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 3 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...

[00:00:00.000 --> 00:00:08.000]   And so my fellow Americans ask not what your country can do for you
[00:00:08.000 --> 00:00:11.000]   ask what you can do for your country.

whisper_print_timings:     load time =   435.73 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =  1075.16 ms
whisper_print_timings:   sample time =    77.48 ms /    27 runs (    2.87 ms per run)
whisper_print_timings:   encode time =  8273.19 ms /     1 runs ( 8273.19 ms per run)
whisper_print_timings:   decode time =   414.45 ms /    27 runs (   15.35 ms per run)
whisper_print_timings:    total time = 10632.44 ms
pi@raspberrypi:~/whisper.cpp $  

The total time fluctuates around 12s but there is a big variation as well. Last run dropped to 10.6s. Not sure what is the cause of this variation.

StuartIanNaylor commented 1 year ago

Download the Focal OPi image to get it out of the equation. I don't get how with exact same SBC/Board you get so much variances

whisper_print_timings:    total time =  1791.24 ms
whisper_print_timings:    total time =  1797.02 ms
whisper_print_timings:    total time =  1780.25 ms
whisper_print_timings:    total time =  1791.39 ms
whisper_print_timings:    total time =  1784.43 ms

Consequtive

fquirin commented 1 year ago

Indeed, I quickly flashed Ubuntu Jammy server (Ubuntu 22.04.2 LTS) onto a SD card 😲:

taskset -c 4-7 ./main -m "models/ggml-tiny.bin" -f samples/jfk.wav -t 4 -l en --beam-size 1

whisper_print_timings:    total time =  1799.18 ms
whisper_print_timings:    total time =  1818.56 ms
whisper_print_timings:    total time =  1795.55 ms
whisper_print_timings:    total time =  1811.46 ms

[EDIT] Benchmark is still pretty slow though (maybe the CPU is throttling due to heat idk):

CPU OS Config Model Th Load Enc. Commit
OrangePi5 Ubuntu 22.04.2 LTS NEON tiny 4 111 3397 05bef0f
ggerganov commented 1 year ago

@fquirin and @StuartIanNaylor

Can you bench the Q8_0 model as well?

# quantize to 8-bits
./quantize models/ggml-tiny.bin models/ggml-tiny-q8_0.bin q8_0
fquirin commented 1 year ago

With Q8_0 I'm getting pretty consistent: whisper_print_timings: total time = 1553.91 ms

and with Q5_0: whisper_print_timings: total time = 1888.17 ms

fquirin commented 1 year ago

Running the benchmark with only the encoder gives pretty stable results:

CPU OS Config Model Th Load Enc. Commit
OPi5 Ubuntu 22.04.2 LTS NEON tiny 4 106 1179 05bef0f
OPi5 Ubuntu 22.04.2 LTS NEON tiny-q5_0 4 77 1339 05bef0f
OPi5 Ubuntu 22.04.2 LTS NEON tiny-q8_0 4 91 1027 05bef0f

Maybe the 'ggml_mul_mat' benchmark leads to a throttling of the CPU after some time 🤔, but a drop from '3397' to '1179' seems pretty hard.

StuartIanNaylor commented 1 year ago
| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| <todo> | <todo> |  NEON | tiny | 4 | 122 | 1184 | 05bef0f |
| <todo> | <todo> |  NEON | tiny-q5_0 | 4 | 91 | 1358 | 05bef0f |
| <todo> | <todo> |  NEON | tiny-q8_0 | 4 | 103 | 1042 | 05bef0f |
| <todo> | <todo> |  NEON | base | 4 | 167 | 2908 | 05bef0f |
| <todo> | <todo> |  NEON | base-q5_0 | 4 | 103 | 3203 | 05bef0f |
| <todo> | <todo> |  NEON | base-q8_0 | 4 | 125 | 2542 | 05bef0f |
| <todo> | <todo> |  NEON | small | 4 | 382 | 10883 | 05bef0f |
| <todo> | <todo> |  NEON | small-q5_0 | 4 | 190 | 11475 | 05bef0f |
| <todo> | <todo> |  NEON | small-q8_0 | 4 | 264 | 8009 | 05bef0f |
| <todo> | <todo> |  NEON | medium | 4 | 3253 | 35805 | 05bef0f |
| <todo> | <todo> |  NEON | medium-q5_0 | 4 | 441 | 37224 | 05bef0f |
| <todo> | <todo> |  NEON | medium-q8_0 | 4 | 5922 | 26390 | 05bef0f |
| <todo> | <todo> |  NEON | large | 4 | 46942 | 85866 | 05bef0f |
| <todo> | <todo> |  NEON | large-q5_0 | 4 | 826 | 69961 | 05bef0f |
| <todo> | <todo> |  NEON | large-q8_0 | 4 | 26708 | 47956 | 05bef0f |
fquirin commented 1 year ago

Since we get very similar and stable results in single runs now, I decided to investigate the degrading performance for longer runs a bit more. Here are 2 consecutive benchmark runs (encoder-only):

CPU OS Config Model Th Load Enc. Commit
NEON tiny 4 114 1173 05bef0f
NEON tiny-q5_0 4 70 1342 05bef0f
NEON tiny-q8_0 4 87 1035 05bef0f
NEON small 4 374 12469 05bef0f
NEON medium 4 1063 67746 05bef0f
CPU OS Config Model Th Load Enc. Commit
NEON tiny 4 110 1399 05bef0f
NEON tiny-q5_0 4 70 2060 05bef0f
NEON tiny-q8_0 4 92 1756 05bef0f
NEON small 4 383 22989 05bef0f
NEON medium 4 1098 90906 05bef0f

And here are a few consecutive single runs with the small model:

whisper_print_timings:    total time = 13441.03 ms
whisper_print_timings:    total time = 13952.61 ms
whisper_print_timings:    total time = 14572.80 ms
whisper_print_timings:    total time = 16029.31 ms
whisper_print_timings:    total time = 17203.41 ms

I'd say this is a pretty strong indication that my Orange Pi 5 is throttling after about 30s of cooking the CPU 🤔. I'm starting to think that Armbian is handling this throttling differently.

StuartIanNaylor commented 1 year ago

@fquirin Have you ever just opened up another Cli window and monitored the temps vs clock speed? 75c is the throttle point which is quite low for a cpu. I have an extremely good cooling solution now with the armour case but as it comes default just about every thing is wrong and was far inferior to a 40mm stick-on I used at 1st.

I can run stress-ng --cpu 8 --vm 2 --vm-bytes 128M --fork 4 constantly and settle on approx 60c max.

phoronix-test-suite benchmark stockfish is supposedly, but the peak of ggml_mul_mat benchmark can reach 11 watts at the plug which is highest I have seen on this SoC.

watch -n1 cat /sys/class/thermal/thermal_zone*/temp watch -n 1 cat /sys/devices/system/cpu/cpu*/cpufreq/cpuinfo_cur_freq

If you want to go all out on a cooling solution then https://www.amazon.com/dp/B0C2T9N9L2? A stick-on 30-40mm with fan is a much cheaper option, but it sounds like what you are using is not adequate.