Open ggerganov opened 1 year ago
Results for Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
CPU | OS | Config | Model | Threads | Load [ms] | Encode [ms] |
---|---|---|---|---|---|---|
i7-4790K | Debian | tiny.en | 4 | 165 | 808 | |
i7-4790K | Debian | tiny.en | 8 | 165 | 783 | |
i7-4790K | Debian | base.en | 4 | 212 | 1813 | |
i7-4790K | Debian | base.en | 8 | 214 | 1746 |
Results for Ryzen 5 4500U 6C/6T laptop CPU (I've just included one result for 8 threads as Encode time is much higher when threads > CPU cores).
CPU | OS | Config | Model | Threads | Load [ms] | Encode [ms] |
---|---|---|---|---|---|---|
Ryzen 5 4500U (6C/6T) | Opensuse Leap | tiny.en | 4 | 170.00 | 829.43 | |
Ryzen 5 4500U (6C/6T) | Opensuse Leap | tiny.en | 6 | 143.03 | 671.74 | |
Ryzen 5 4500U (6C/6T) | Opensuse Leap | base.en | 4 | 305.92 | 2,092.39 | |
Ryzen 5 4500U (6C/6T) | Opensuse Leap | base.en | 6 | 188.05 | 1,495.61 | |
Ryzen 5 4500U (6C/6T) | Opensuse Leap | small.en | 4 | 408.03 | 6,919.31 | |
Ryzen 5 4500U (6C/6T) | Opensuse Leap | small.en | 6 | 359.23 | 6,370.83 | |
Ryzen 5 4500U (6C/6T) | Opensuse Leap | medium.en | 4 | 2,238.11 | 25,863.28 | |
Ryzen 5 4500U (6C/6T) | Opensuse Leap | medium.en | 6 | 1,113.04 | 19,672.63 | |
Ryzen 5 4500U (6C/6T) | Opensuse Leap | medium.en | 8 | 973.65 | 39,619.20 |
CPU | OS | Config | Model | Threads | Load [ms] | Encode [ms] |
---|---|---|---|---|---|---|
i7-11800H | WSL2 Ubuntu | AVX2 | tiny | 2 | 164.35 | 1087.61 |
i7-11800H | WSL2 Ubuntu | AVX2 | tiny | 4 | 128.94 | 733.24 |
i7-11800H | WSL2 Ubuntu | AVX2 | tiny | 8 | 137.57 | 619.88 |
i7-11800H | WSL2 Ubuntu | AVX2 AVX512 | tiny | 2 | 143.02 | 1087.15 |
i7-11800H | WSL2 Ubuntu | AVX2 AVX512 | tiny | 4 | 127.60 | 730.57 |
i7-11800H | WSL2 Ubuntu | AVX2 AVX512 | tiny | 8 | 125.62 | 616.27 |
i7-11800H | WSL2 Ubuntu | AVX2 AVX512 BLAS | tiny | 2 | 132.59 | 1511.38 |
i7-11800H | WSL2 Ubuntu | AVX2 AVX512 BLAS | tiny | 4 | 132.48 | 1407.49 |
i7-11800H | WSL2 Ubuntu | AVX2 AVX512 BLAS | tiny | 8 | 133.82 | 1458.27 |
CPU | OS | Config | Model | Threads | Load [ms] | Encode [ms] |
---|---|---|---|---|---|---|
i7-11800H | WSL2 Ubuntu | AVX2 | base | 2 | 174.34 | 2533.79 |
i7-11800H | WSL2 Ubuntu | AVX2 | base | 4 | 166.68 | 1830.67 |
i7-11800H | WSL2 Ubuntu | AVX2 | base | 8 | 165.53 | 1478.73 |
i7-11800H | WSL2 Ubuntu | AVX2 | small | 2 | 340.12 | 8714.24 |
i7-11800H | WSL2 Ubuntu | AVX2 | small | 4 | 394.32 | 6021.41 |
i7-11800H | WSL2 Ubuntu | AVX2 | small | 8 | 305.98 | 4828.84 |
i7-11800H | WSL2 Ubuntu | AVX2 | large | 2 | 3205.36 | 57109.10 |
i7-11800H | WSL2 Ubuntu | AVX2 | large | 4 | 2720.25 | 38519.89 |
i7-11800H | WSL2 Ubuntu | AVX2 | large | 8 | 3716.34 | 27739.99 |
CPU | OS | Config | Model | Threads | Load [ms] | Encode [ms] |
---|---|---|---|---|---|---|
i7-11800H | WSL2 Ubuntu | AVX2 AVX512 | large | 2 | 1954.21 | 54966.84 |
i7-11800H | WSL2 Ubuntu | AVX2 AVX512 | large | 4 | 1455.40 | 37320.62 |
i7-11800H | WSL2 Ubuntu | AVX2 AVX512 | large | 8 | 1372.58 | 27937.64 |
This performance is impressing!
M1 Pro | MacOS | | large | 8 | 1973 | 4208
This performance is impressing!
Yes, there is a huge performance boost due to using the built-in BLAS implementation on these devices. I will soon add OpenBLAS support for x86 architectures and see how this compares.
By the way, AVX-512 is not supported on master
. I have added initial support here, but I am not sure if it works: https://github.com/ggerganov/whisper.cpp/pull/95
CPU | OS | Config | Model | Threads | Load[ms] | encode[ms] |
---|---|---|---|---|---|---|
Intel® Core™ i5-8250U | Win11 Home | AVX2 | Large | 8 | 2226.85 | 61547.61 |
compiled with MinGW64 gcc 11.3
Valve Jupiter (AMD Custom APU 0405, Zen 2 microarch, 4c8t, 16GB DDR5 @ 5200 MT/s)
CPU | OS | Config | Model | Threads | Load[ms] | encode[ms] |
---|---|---|---|---|---|---|
AMD Custom APU 0405 | SteamOS 3.2 | AVX2 | Base | 8 | 326.32 | 2592.96 |
Compiled with cc (GCC) 11.3.0
The performance gains on jfk.wav since last test (two weeks or so ago) are extremely impressive, ~10-20x speedup from 40 to 2-4 seconds.
CPU | OS | Config | Model | Threads | Load [ms] | Encode [ms] |
---|---|---|---|---|---|---|
MacBook M1 Max | macOS Ventura | BLAS | small | 1 | 299.09 | 4166.00 |
MacBook M1 Max | macOS Ventura | BLAS | small | 4 | 329.45 | 1304.32 |
MacBook M1 Max | macOS Ventura | BLAS | base | 1 | 139.10 | 1302.17 |
MacBook M1 Max | macOS Ventura | BLAS | base | 4 | 135.96 | 399.45 |
On a AMD EPYC 64 core 240 threads cloud instance it is stuck like this with 240 threads. I noticed that above a certain number of threads its slow, or the cloud provider is cpu limiting. Can anyone else with real hardware check if this is the case?
time ./main -m models/ggml-base.en.bin -f elon.wav -t 240
whisper_model_load: loading model from 'models/ggml-base.en.bin'
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 512
whisper_model_load: n_text_head = 8
whisper_model_load: n_text_layer = 6
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 2
whisper_model_load: mem_required = 670.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 140.60 MB
whisper_model_load: memory size = 22.83 MB
whisper_model_load: model size = 140.54 MB
system_info: n_threads = 240 / 240 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 |
main: processing 'elon.wav' (34466688 samples, 2154.2 sec), 240 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ..
So I have tried with the above mentioned cloud provider various number of threads.
I found that anything above 64 threads gets slower and usable upto 120 threads. Anything above is a hang. Must be that the cloud provider is throttling on free trial or too many threads could actually slow down stuff.
...
...
processor : 239
vendor_id : AuthenticAMD
cpu family : 23
model : 49
model name : AMD EPYC 7742 64-Core Processor
stepping : 0
microcode : 0x830104d
cpu MHz : 2245.780
cache size : 512 KB
physical id : 1
siblings : 120
core id : 59
cpu cores : 60
apicid : 247
initial apicid : 247
fpu : yes
fpu_exception : yes
cpuid level : 16
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr wbnoinvd arat npt nrip_save umip rdpid arch_capabilities
bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass
bogomips : 4491.56
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management:
time ./main -m models/ggml-base.en.bin -f elon.wav -t 64
whisper_model_load: loading model from 'models/ggml-base.en.bin'
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 512
whisper_model_load: n_text_head = 8
whisper_model_load: n_text_layer = 6
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 2
whisper_model_load: mem_required = 670.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 140.60 MB
whisper_model_load: memory size = 22.83 MB
whisper_model_load: model size = 140.54 MB
system_info: n_threads = 64 / 240 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 |
main: processing 'elon.wav' (34466688 samples, 2154.2 sec), 64 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...
[00:00:00.000 --> 00:00:03.960] [MUSIC PLAYING]
[00:00:03.960 --> 00:00:18.240] In life, we've seen within this part of the world
...
...
[00:35:40.320 --> 00:35:41.920] Thank you, and have a great day.
[00:35:41.920 --> 00:35:43.920] [APPLAUSE]
[00:35:43.920 --> 00:35:45.920] [MUSIC PLAYING]
[00:35:45.920 --> 00:35:56.240] [VIDEO PLAYBACK]
whisper_print_timings: load time = 249.61 ms
whisper_print_timings: mel time = 1267.11 ms
whisper_print_timings: sample time = 1718.69 ms
whisper_print_timings: encode time = 63702.25 ms / 10617.04 ms per layer
whisper_print_timings: decode time = 381317.66 ms / 63552.94 ms per layer
whisper_print_timings: total time = 448362.19 ms
real 7m28.411s
user 347m2.230s
sys 22m42.511s
32 threads was faster than 64 threads. I think 32 threads took around 7 minutes or so.
Env: Restricted Cloud / Throttled Maybe
CPU: AMD EPYC 7742 64-Core Processor
OS:
Distributor ID: Ubuntu
Description: Ubuntu 20.04.3 LTS
Release: 20.04
Codename: focal
Linux XXXX 5.4.0-131-generic #147-Ubuntu SMP Fri Oct 14 17:07:22 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Compiler:
gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/9/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none:hsa
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 9.4.0-1ubuntu1~20.04.1' --with-bugurl=file:///usr/share/doc/gcc-9/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,gm2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-9 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --enable-default-pie --with-system-zlib --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none=/build/gcc-9-Av3uEd/gcc-9-9.4.0/debian/tmp-nvptx/usr,hsa --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1)
$ ./bench -m ./models/ggml-small.en.bin -t 4
whisper_model_load: loading model from './models/ggml-small.en.bin'
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 768
whisper_model_load: n_text_head = 12
whisper_model_load: n_text_layer = 12
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 3
whisper_model_load: mem_required = 1588.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 464.56 MB
whisper_model_load: memory size = 68.48 MB
whisper_model_load: model size = 464.44 MB
system_info: n_threads = 4 / 240 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 |
whisper_print_timings: load time = 515.02 ms
whisper_print_timings: mel time = 0.00 ms
whisper_print_timings: sample time = 0.00 ms
whisper_print_timings: encode time = 6878.32 ms / 573.19 ms per layer
whisper_print_timings: decode time = 0.00 ms / 0.00 ms per layer
whisper_print_timings: total time = 7393.42 ms
If you wish, you can submit these results here:
https://github.com/ggerganov/whisper.cpp/issues/89
Please include the following information:
- CPU model
- Operating system
- Compiler
$ ./bench -m ./models/ggml-small.en.bin -t 240
whisper_model_load: loading model from './models/ggml-small.en.bin'
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 768
whisper_model_load: n_text_head = 12
whisper_model_load: n_text_layer = 12
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 3
whisper_model_load: mem_required = 1588.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 464.56 MB
whisper_model_load: memory size = 68.48 MB
whisper_model_load: model size = 464.44 MB
system_info: n_threads = 240 / 240 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 |
whisper_print_timings: load time = 528.66 ms
whisper_print_timings: mel time = 0.00 ms
whisper_print_timings: sample time = 0.00 ms
whisper_print_timings: encode time = 12898.34 ms / 1074.86 ms per layer
whisper_print_timings: decode time = 0.00 ms / 0.00 ms per layer
whisper_print_timings: total time = 13427.03 ms
If you wish, you can submit these results here:
https://github.com/ggerganov/whisper.cpp/issues/89
Please include the following information:
- CPU model
- Operating system
- Compiler
I'll remove the above posts if too much clutter.
@trholding Thanks for the results.
You can generate a table with performance results by simply running the extra/bench_all.sh script.
Regarding the threads - yes, it seems that going beyond 8 threads does not help regardless of how many cores you have. My guess is that the computation is memory-bound so that's why using more threads does not improve the performance.
Okay, 8 threads max, so for a large file, is there a possibility of splitting the file to chunks with silences as terminators and dividing the conversion to ((total threads/cores)/8) but also keeping track of timestamps? This could be awesome for batch conversion.
You can generate a table with performance results by simply running the extra/bench_all.sh script.
Oh, I didn't know, I'll update with tables soon and remove my previous comments in a few hours.
You can generate a table with performance results by simply running the extra/bench_all.sh script.
Hey Sorry. That didn't pan out well, I did the benchmark thrice, my account got deleted without notice. Could't get the logs as it was a web terminal. On the other hand I am happy that this happened, I was giving serious thought of purchasing a GPU+CPU plan there, so performance check of CPU was equally important. Probably or technically it was my fault - probably shouldn't have used a reverse shell and done benchmarks on a free trial, but how does one know if a service is real good or all just vapor...
Dell Precision 5560 laptop results:
CPU | OS | Config | Model | Threads | Load [ms] | Encode [ms] |
---|---|---|---|---|---|---|
i7-11850H | Ubuntu | AVX2 | tiny | 4 | 115.87 | 538.43 |
i7-11850H | Ubuntu | AVX2 | base | 4 | 145.14 | 1241.84 |
i7-11850H | Ubuntu | AVX2 | small | 4 | 299.30 | 4343.57 |
i7-11850H | Ubuntu | AVX2 | medium | 4 | 760.98 | 15238.31 |
i7-11850H | Ubuntu | AVX2 | large | 4 | 1404.32 | 27476.86 |
i7-11850H | Ubuntu | AVX2 | tiny | 8 | 131.96 | 358.81 |
i7-11850H | Ubuntu | AVX2 | base | 8 | 166.61 | 839.31 |
i7-11850H | Ubuntu | AVX2 | small | 8 | 320.29 | 2854.86 |
i7-11850H | Ubuntu | AVX2 | medium | 8 | 756.20 | 9829.62 |
i7-11850H | Ubuntu | AVX2 | large | 8 | 1382.38 | 19872.81 |
CPU | OS | Config | Model | Threads | Load [ms] | Encode [ms] |
---|---|---|---|---|---|---|
i9-11950H | Pop!_OS 22.04 LTS | AVX2 | Tiny | 4 | 124.28 | 656.41 |
i9-11950H | Pop!_OS 22.04 LTS | AVX2 | Tiny | 8 | 123.70 | 696.41 |
i9-11950H | Pop!_OS 22.04 LTS | AVX2 | Base | 4 | 159.91 | 1754.44 |
i9-11950H | Pop!_OS 22.04 LTS | AVX2 | Base | 8 | 164.47 | 1658.55 |
i9-11950H | Pop!_OS 22.04 LTS | AVX2 | Small | 4 | 330.91 | 6161.86 |
i9-11950H | Pop!_OS 22.04 LTS | AVX2 | Small | 8 | 346.22 | 5187.85 |
CPU | OS | Config | Model | Threads | Load [ms] | Encode [ms] |
---|---|---|---|---|---|---|
i7-1065G7 | Windows 11 | - | small.en | 4 | 1,314.25 | 294,168.09 |
Compiled with VS 2022
Something is off, right?
Yup - you are missing the AVX2
flag. See if some of the comments in https://github.com/ggerganov/whisper.cpp/issues/5 can help you resolve this.
OK, the AVX2
flag seems to help :)
CPU | OS | Config | Model | Threads | Load [ms] | Encode [ms] |
---|---|---|---|---|---|---|
i7-1065G7 | Windows 11 | AVX2 | small.en | 4 | 527.59 | 9,648.67 |
Compiled with VS 2022
CPU | OS | Config | Model | Threads | Load [ms] | Encode [ms] | Remarks |
---|---|---|---|---|---|---|---|
Raspberry Pi 4 - 2GB | OpenVoiceOS | NEON | tiny | 1 | 861.34 | 29428.21 | With OVOS services running |
Raspberry Pi 4 - 2GB | OpenVoiceOS | NEON BLAS | tiny | 1 | 843.80 | 16145.62 | With OVOS services running |
Raspberry Pi 4 - 2GB | OpenVoiceOS | NEON | tiny | 4 | 835.68 | 21509.08 | With OVOS services running |
Raspberry Pi 4 - 2GB | OpenVoiceOS | NEON BLAS | tiny | 4 | 824.24 | 13187.96 | With OVOS services running |
Raspberry Pi 4 - 2GB | OpenVoiceOS | NEON | base | 1 | 1146.02 | 87615.00 | With OVOS services running |
Raspberry Pi 4 - 2GB | OpenVoiceOS | NEON BLAS | base | 1 | 1103.39 | 52228.30 | With OVOS services running |
Raspberry Pi 4 - 2GB | OpenVoiceOS | NEON | base | 4 | 1183.47 | 55256.20 | With OVOS services running |
Raspberry Pi 4 - 2GB | OpenVoiceOS | NEON BLAS | base | 4 | 1161.32 | 29851.40 | With OVOS services running |
Raspberry Pi 4 - 2GB | OpenVoiceOS | NEON | tiny | 1 | 752.64 | 24018.10 | Without OVOS services running |
Raspberry Pi 4 - 2GB | OpenVoiceOS | NEON BLAS | tiny | 1 | 751.96 | 13082.95 | Without OVOS services running |
Raspberry Pi 4 - 2GB | OpenVoiceOS | NEON | tiny | 4 | 743.37 | 10122.80 | Without OVOS services running |
Raspberry Pi 4 - 2GB | OpenVoiceOS | NEON BLAS | tiny | 4 | 742.90 | 9564.89 | Without OVOS services running |
Raspberry Pi 4 - 2GB | OpenVoiceOS | NEON | base | 1 | 974.46 | 71587.61 | Without OVOS services running |
Raspberry Pi 4 - 2GB | OpenVoiceOS | NEON BLAS | base | 1 | 979.65 | 43852.07 | Without OVOS services running |
Raspberry Pi 4 - 2GB | OpenVoiceOS | NEON | base | 4 | 982.24 | 24814.62 | Without OVOS services running |
Raspberry Pi 4 - 2GB | OpenVoiceOS | NEON BLAS | base | 4 | 982.80 | 19910.19 | Without OVOS services running |
From the stream repo
CPU | OS | Config | Model | Threads | Load [ms] | Encode [ms] |
---|---|---|---|---|---|---|
RK3588 | Ubuntu20.04 | NEON | tiny.en | 4 | 243.54 ms | 779.49 ms |
RK3588 | Ubuntu20.04 | NEON | base.en | 4 | 316.52 ms | 1821.06 ms |
RK3588 | Ubuntu20.04 | NEON | small.en | 4 | 618.93 ms | 7117.69 ms |
RK3588 | Ubuntu20.04 | NEON | medium.en | 4 | 1514.88 ms | 24139.92 ms |
CPU | OS | Config | Model | Threads | Load [ms] | Encode [ms] |
---|---|---|---|---|---|---|
RK3588 | Ubuntu20.04 | NEON | tiny | 4 | 233.86 ms | 791.01 ms |
RK3588 | Ubuntu20.04 | NEON | base | 4 | 297.93 ms | 1813.69 ms |
RK3588 | Ubuntu20.04 | NEON | small | 4 | 592.18 ms | 7102.28 ms |
RK3588 | Ubuntu20.04 | NEON | medium | 4 | 1587.36 ms | 24147.87 ms |
CPU | OS | Config | Model | Threads | Load [ms] | Encode [ms] |
---|---|---|---|---|---|---|
RK3588 | Ubuntu20.04 | NEON | tiny | 8 | 226.48 ms | 740.34 ms |
RK3588 | Ubuntu20.04 | NEON | base | 8 | 300.48 ms | 1723.42 ms |
RK3588 | Ubuntu20.04 | NEON | small | 8 | 620.58 ms | 6392.47 ms |
RK3588 | Ubuntu20.04 | NEON | medium | 8 | 1533.75 ms | 21899.08 ms |
I still haven't worked out the little(0-3).Big(4-7) on this thing as if I pin to big cores taskset -c 4-7
CPU | OS | Config | Model | Threads | Load [ms] | Encode [ms] |
---|---|---|---|---|---|---|
RK3588 | Ubuntu20.04 | NEON | tiny.en | 4 | 234.14 ms | 681.53 ms |
RK3588 | Ubuntu20.04 | NEON | base.en | 4 | 297.08 ms | 1679.75 ms |
RK3588 | Ubuntu20.04 | NEON | small.en | 4 | 599.98 ms | 6867.66 ms |
RK3588 | Ubuntu20.04 | NEON | medium.en | 4 | 1492.73 ms | 23600.45 ms |
I tried to compile with openBlas but seemed to kill the make
From the master repo as didn't think about the repo after trying streaming input CPU | OS | Config | Model | Threads | Load [ms] | Encode [ms] |
---|---|---|---|---|---|---|
RK3588 | Ubuntu20.04 | NEON | tiny | 8 | 226.48 ms | 2681.05 ms |
RK3588 | Ubuntu20.04 | NEON | base | 8 | 283.56 ms | 6132.44 ms |
RK3588 | Ubuntu20.04 | NEON | small | 8 | 583.39 ms | 24397.78 ms |
RK3588 | Ubuntu20.04 | NEON | medium | 8 | 1490.98 | 85099.45 ms |
CPU | OS | Config | Model | Threads | Load [ms] | Encode [ms] |
---|---|---|---|---|---|---|
Ryzen 7 PRO 4750G | Ubuntu 22.04 | AVX2 | tiny.en | 8 | 136.29 | 454.52 |
Ryzen 7 PRO 4750G | Ubuntu 22.04 | AVX2 | tiny | 8 | 134.64 | 486.01 |
Ryzen 7 PRO 4750G | Ubuntu 22.04 | AVX2 | base | 8 | 180.22 | 1184.80 |
Ryzen 7 PRO 4750G | Ubuntu 22.04 | AVX2 | base.en | 8 | 192.86 | 1197.85 |
Ryzen 7 PRO 4750G | Ubuntu 22.04 | AVX2 | small | 8 | 367.55 | 4179.00 |
Ryzen 7 PRO 4750G | Ubuntu 22.04 | AVX2 | small.en | 8 | 378.27 | 4557.73 |
Ryzen 7 PRO 4750G | Ubuntu 22.04 | AVX2 | medium | 8 | 923.48 | 15552.61 |
Ryzen 7 PRO 4750G | Ubuntu 22.04 | AVX2 | medium.en | 8 | 952.48 | 15708.63 |
Ryzen 7 PRO 4750G | Ubuntu 22.04 | AVX2 | large | 8 | 1650.28 | 28357.09 |
8 threads seemed to be the fastest. However I managed to squeeze a bit more performance by pinning CPU:
$ taskset -c 0-15 ./extra/bench-all.sh 16
CPU | OS | Config | Model | Threads | Load [ms] | Encode [ms] |
---|---|---|---|---|---|---|
Ryzen 7 PRO 4750G | Ubuntu 22.04 | AVX2 | tiny | 16 | 143.17 | 437.73 |
Ryzen 7 PRO 4750G | Ubuntu 22.04 | AVX2 | base | 16 | 184.10 | 1061.14 |
Ryzen 7 PRO 4750G | Ubuntu 22.04 | AVX2 | small | 16 | 374.41 | 3645.64 |
Ryzen 7 PRO 4750G | Ubuntu 22.04 | AVX2 | medium | 16 | 935.45 | 13029.54 |
Results for AWS Graviton 3 Processor (c7g.4xlarge
instance type).
Compiled with -march=native -ffast-math
.
./extra/bench-all.sh 8
CPU | OS | Config | Model | Threads | Load [ms] | Encode [ms] |
---|---|---|---|---|---|---|
Graviton 3 | Ubuntu 22.04 | NEON | tiny | 8 | 125.92 | 230.33 |
Graviton 3 | Ubuntu 22.04 | NEON | base | 8 | 160.17 | 547.88 |
Graviton 3 | Ubuntu 22.04 | NEON | small | 8 | 299.59 | 2138.86 |
Graviton 3 | Ubuntu 22.04 | NEON | medium | 8 | 741.49 | 6999.33 |
Graviton 3 | Ubuntu 22.04 | NEON | large | 8 | 1313.95 | 14174.00 |
./extra/bench-all.sh 16
CPU | OS | Config | Model | Threads | Load [ms] | Encode [ms] |
---|---|---|---|---|---|---|
Graviton 3 | Ubuntu 22.04 | NEON | tiny | 16 | 121.92 | 158.61 |
Graviton 3 | Ubuntu 22.04 | NEON | base | 16 | 156.01 | 386.78 |
Graviton 3 | Ubuntu 22.04 | NEON | small | 16 | 299.85 | 1596.38 |
Graviton 3 | Ubuntu 22.04 | NEON | medium | 16 | 750.93 | 5351.24 |
Graviton 3 | Ubuntu 22.04 | NEON | large | 16 | 1313.82 | 11115.69 |
@matth Do you observe significant performance difference with / without -march=native -ffast-math
?
@ggerganov -ffast-math
seems to make only a very small difference that could be noise between runs
-march=native
does seem to make a big difference, without it FP16_VA
is not reported as being enabled (I can get this with -march=armv8.4-a+bf16+fp16fml
) - I think -march=native
is enabling more intrinsics than this though.
Results without any -march
or -ffast-math
flags ...
./extra/bench-all.sh 16
CPU | OS | Config | Model | Threads | Load [ms] | Encode [ms] |
---|---|---|---|---|---|---|
Graviton 3 | Ubuntu 22.04 | NEON | tiny | 16 | 124.25 | 320.53 |
Graviton 3 | Ubuntu 22.04 | NEON | base | 16 | 156.91 | 734.22 |
Graviton 3 | Ubuntu 22.04 | NEON | small | 16 | 301.78 | 2812.75 |
Graviton 3 | Ubuntu 22.04 | NEON | medium | 16 | 714.23 | 9139.86 |
Graviton 3 | Ubuntu 22.04 | NEON | large | 16 | 1298.33 | 18147.47 |
I have tried to improve by using OpenBlas and armpl.h
but with they both slow it down considerably - I'll keep trying with the latter.
Are there any possibilities for further optimisations in ggml.c
that can take advantage of the situation where you have bf16
functions but not BLAS or Accelerate?
CPU | OS | Config | Model | Threads | Load [ms] | Encode [ms] |
---|---|---|---|---|---|---|
E5-2640 | Ubuntu 18.04 | AVX2 | tiny | 8 | 235.10 | 1094.45 |
E5-2640 | Ubuntu 18.04 | AVX2 | base | 8 | 326.11 | 2307.32 |
E5-2640 | Ubuntu 18.04 | AVX2 | small | 8 | 669.31 | 7706.24 |
@matth My experiments with OpenBLAS on x86 showed that it is not faster compared to hand-written AVX2 + FP16: https://github.com/ggerganov/whisper.cpp/commit/fbd513b813ea42a500ba92be3dcfea0b6b6a4fa3
It seems this is also the case for Arm based on your experiments. My guess is that we don't see improvement because the computation is memory-bound and OpenBLAS works with FP32.
The reason that on Apple Silicon using CBLAS is so fast is because it utilizes the matrix co-processor which somehow is very efficient even for FP32. At least this is how I explain the results that I am seeing.
Interesting if armpl.h
can provide some more insight - I haven't used it.
The most heavy stuff in ggml.c
is the mul_mat_f16
and flash_attn_f16
calls. I think the conv_1d_...
calls could be probably optimized more, but they are called only once as the start of the Encoder, so the improvement would be marginal.
Also, I am just looking at whisper.cpp
and I realize I have forgotten why I use Flash Attention only in the Encoder and not use it also in the Decoder. Maybe this can help, because the Flash Attention reduces the memory transfers and improves cache locality.
Not sure about bf16
compared to fp16
. I don't expect to provide big improvement based on quick search through some articles about the difference between the 2 data types.
Ihttps://medium.com/swlh/apples-m1-secret-coprocessor-6599492fc1e1
Gives a good write up if medium doesn't try to charge you.
https://nod.ai/comparing-apple-m1-with-amx2-m1-with-neon/
Maybe after the m3 comes out I might be able to pickup a bargain m1 mini.
I think fp16 is coming though and may help a bit
https://github.com/xianyi/OpenBLAS/pull/3754
PS for those of us without the secret apple sauce would implementing https://github.com/CNugteren/CLBlast be any use on integrated gpu's?
OpenBLAS helps Windows AMD64 MSVC
CPU | OS | Config | Model | Threads | Load [ms] | Encode [ms] |
---|---|---|---|---|---|---|
Ryzen 5 PRO 2400GE | Windows 10 | AVX2 | medium | 4 | 4259.10 | 116609.75 |
Ryzen 5 PRO 2400GE | Windows 10 | AVX2 BLAS | medium | 4 | 4259.58 | 75312.90 |
CPU | OS | Config | Model | Threads | Load [ms] | Encode [ms] |
---|---|---|---|---|---|---|
rk3588 | Debian11 | NEON | tiny | 8 | 232.45 | 2768.78 |
rk3588 | Debian11 | NEON | base | 8 | 308.36 | 6374.82 |
rk3588 | Debian11 | NEON | small | 8 | 626.23 | 25784.05 |
rk3588 | Debian11 | NEON | medium | 8 | 1667.23 | 86026.82 |
rk3588 | Debian11 | NEON | large | 8 | 4307.16 | 161328.59 |
CFLAGS = -I. -O3 -std=c11 -ffast-math -march=native
CPU | OS | Config | Model | Threads | Load [ms] | Encode [ms] |
---|---|---|---|---|---|---|
rk3588 | Debian11 | NEON | tiny | 8 | 230.69 | 2078.40 |
rk3588 | Debian11 | NEON | base | 8 | 299.10 | 4379.62 |
rk3588 | Debian11 | NEON | small | 8 | 621.43 | 18565.42 |
rk3588 | Debian11 | NEON | medium | 8 | 1532.61 | 65504.91 |
rk3588 | Debian11 | NEON | large | 8 | 3618.18 | 121710.31 |
If I try to compile with open blas in seperate build Encode becomes approx x2 slower so either I am doing wrong or with Armv8.2 its just bad, its -march=native that seems to make the above difference.
Results on AWS mac2.metal
instance:
CPU | OS | Config | Model | Threads | Load [ms] | Encode [ms] |
---|---|---|---|---|---|---|
mac2.metal | OSX Ventura | NEON BLAS | tiny | 4 | 64.39 | 184.98 |
mac2.metal | OSX Ventura | NEON BLAS | base | 4 | 87.93 | 368.04 |
mac2.metal | OSX Ventura | NEON BLAS | small | 4 | 198.80 | 1212.46 |
mac2.metal | OSX Ventura | NEON BLAS | medium | 4 | 551.49 | 3552.73 |
mac2.metal | OSX Ventura | NEON BLAS | large | 4 | 1042.91 | 6726.99 |
I tried disabling Accelerate and it makes a significant difference (i.e. very much slower without it!).
I assumed Accelerate was using the Neural Engine, but using both powermetrics
and asitop
I cannot see any utilization, both report 0mw
power usage. Can anyone confirm on an M1 machine?
EDIT Possibly I was confused. Apple’s Matrix Coprocessor (AMX) and Neural Engine are different things, from @ggerganov other issues and commits it appears Accelerate might be using the former
CPU | OS | Config | Model | Threads | Load [ms] | Encode [ms] |
---|---|---|---|---|---|---|
i9-13900k | WSL2 Ubuntu | AVX2 | tiny | 4 | 58.49 | 360.95 |
i9-13900k | WSL2 Ubuntu | AVX2 | base | 4 | 72.44 | 756.48 |
i9-13900k | WSL2 Ubuntu | AVX2 | small | 4 | 154.37 | 2676.12 |
i9-13900k | WSL2 Ubuntu | AVX2 | medium | 4 | 393.76 | 8924.90 |
i9-13900k | WSL2 Ubuntu | AVX2 | large | 4 | 698.69 | 15862.58 |
i9-13900k | WSL2 Ubuntu | AVX2 | tiny | 8 | 55.13 | 291.51 |
i9-13900k | WSL2 Ubuntu | AVX2 | base | 8 | 70.93 | 603.33 |
i9-13900k | WSL2 Ubuntu | AVX2 | small | 8 | 141.85 | 1800.05 |
i9-13900k | WSL2 Ubuntu | AVX2 | medium | 8 | 356.29 | 5946.78 |
i9-13900k | WSL2 Ubuntu | AVX2 | large | 8 | 658.83 | 10868.89 |
CPU | OS | Config | Model | Threads | Load [ms] | Encode [ms] |
---|---|---|---|---|---|---|
E5-2697 V2 | MacOS Monterey 12.6.1 | BLAS | tiny | 4 | 301.22 | 872.27 |
E5-2697 V2 | MacOS Monterey 12.6.1 | BLAS | base | 4 | 405.40 | 1705.58 |
E5-2697 V2 | MacOS Monterey 12.6.1 | BLAS | small | 4 | 921.24 | 5419.73 |
E5-2697 V2 | MacOS Monterey 12.6.1 | BLAS | medium | 4 | 2356.76 | 15188.90 |
E5-2697 V2 | MacOS Monterey 12.6.1 | BLAS | large | 4 | 4457.29 | 26444.06 |
E5-2697 V2 | MacOS Monterey 12.6.1 | BLAS | tiny | 8 | 299.89 | 540.47 |
E5-2697 V2 | MacOS Monterey 12.6.1 | BLAS | base | 8 | 419.41 | 1129.01 |
E5-2697 V2 | MacOS Monterey 12.6.1 | BLAS | small | 8 | 888.64 | 3632.89 |
E5-2697 V2 | MacOS Monterey 12.6.1 | BLAS | medium | 8 | 2377.96 | 10525.92 |
E5-2697 V2 | MacOS Monterey 12.6.1 | BLAS | large | 8 | 4412.20 | 18933.41 |
CPU | OS | Config | Model | Threads | Load [ms] | Encode [ms] |
---|---|---|---|---|---|---|
i7-8750H | macOS Ventura 13.0.1 | AVX2 BLAS | tiny | 4 | 307.20 | 570.86 |
i7-8750H | macOS Ventura 13.0.1 | AVX2 BLAS | base | 4 | 406.45 | 1183.90 |
i7-8750H | macOS Ventura 13.0.1 | AVX2 BLAS | small | 4 | 941.96 | 4156.69 |
i7-8750H | macOS Ventura 13.0.1 | AVX2 BLAS | medium | 4 | 3124.62 | 13072.06 |
i7-8750H | macOS Ventura 13.0.1 | AVX2 BLAS | large | 4 | 10090.85 | 36383.82 |
i7-8750H | macOS Ventura 13.0.1 | AVX2 BLAS | tiny | 8 | 299.42 | 487.26 |
i7-8750H | macOS Ventura 13.0.1 | AVX2 BLAS | base | 8 | 403.74 | 1113.54 |
i7-8750H | macOS Ventura 13.0.1 | AVX2 BLAS | small | 8 | 910.07 | 3955.48 |
i7-8750H | macOS Ventura 13.0.1 | AVX2 BLAS | medium | 8 | 2241.90 | 13076.31 |
i7-8750H | macOS Ventura 13.0.1 | AVX2 BLAS | large | 8 | 5620.87 | 25562.17 |
CPU | OS | Config | Model | Threads | Load [ms] | Encode [ms] |
---|---|---|---|---|---|---|
i7-8700 | Ubuntu 20.04.4 LTS | AVX2 | tiny | 4 | 158.49 | 730.72 |
i7-8700 | Ubuntu 20.04.4 LTS | AVX2 | base | 4 | 205.93 | 1603.67 |
i7-8700 | Ubuntu 20.04.4 LTS | AVX2 | small | 4 | 426.62 | 5630.58 |
i7-8700 | Ubuntu 20.04.4 LTS | AVX2 | medium | 4 | 1080.15 | 18748.66 |
i7-8700 | Ubuntu 20.04.4 LTS | AVX2 | large | 4 | 1976.77 | 37188.47 |
i7-8700 | Ubuntu 20.04.4 LTS | AVX2 | tiny | 8 | 159.00 | 662.07 |
i7-8700 | Ubuntu 20.04.4 LTS | AVX2 | base | 8 | 206.62 | 1436.59 |
i7-8700 | Ubuntu 20.04.4 LTS | AVX2 | small | 8 | 428.20 | 5345.27 |
i7-8700 | Ubuntu 20.04.4 LTS | AVX2 | medium | 8 | 1108.97 | 16780.53 |
i7-8700 | Ubuntu 20.04.4 LTS | AVX2 | large | 8 | 1965.67 | 32019.44 |
i7-8700 | Ubuntu 20.04.4 LTS | AVX2 | tiny | 12 | 157.60 | 585.65 |
i7-8700 | Ubuntu 20.04.4 LTS | AVX2 | base | 12 | 216.74 | 1696.32 |
i7-8700 | Ubuntu 20.04.4 LTS | AVX2 | small | 12 | 428.51 | 4504.18 |
i7-8700 | Ubuntu 20.04.4 LTS | AVX2 | medium | 12 | 1081.65 | 15442.25 |
i7-8700 | Ubuntu 20.04.4 LTS | AVX2 | large | 12 | 1969.63 | 28108.55 |
CPU | OS | Config | Model | Threads | Load [ms] | Encode [ms] |
---|---|---|---|---|---|---|
i3-9100F | Ubuntu 20.04.4 LTS | AVX2 | tiny | 4 | 164.71 | 726.05 |
i3-9100F | Ubuntu 20.04.4 LTS | AVX2 | base | 4 | 214.56 | 1806.20 |
i3-9100F | Ubuntu 20.04.4 LTS | AVX2 | small | 4 | 445.48 | 6613.19 |
i3-9100F | Ubuntu 20.04.4 LTS | AVX2 | medium | 4 | 1131.80 | 22667.64 |
i3-9100F | Ubuntu 20.04.4 LTS | AVX2 | large | 4 | 7615.74 | 42137.29 |
CPU | OS | Config | Model | Threads | Load [ms] | Encode [ms] |
---|---|---|---|---|---|---|
E3-1220 V2 | Ubuntu 20.04.3 LTS | tiny | 4 | 227.41 | 1757.56 | |
E3-1220 V2 | Ubuntu 20.04.3 LTS | base | 4 | 297.67 | 3801.48 | |
E3-1220 V2 | Ubuntu 20.04.3 LTS | small | 4 | 625.18 | 14544.59 | |
E3-1220 V2 | Ubuntu 20.04.3 LTS | medium | 4 | 9618.55 | 49937.12 | |
E3-1220 V2 | Ubuntu 20.04.3 LTS | large | 4 | 40399.48 | 71661.48 |
Has anyone tried benchmarking on WASM? Seems like the encoder takes much longer time than other platform
CPU | OS | Config | Model | Threads | Load [ms] | Encode [ms] |
---|---|---|---|---|---|---|
i7-5600U @2.60GHz | Xubuntu 18.04 | AVX2 BLAS | ggml-tiny.en | 4 | 258.59 | 2934.34 |
i7-5600U @2.60GHz | Xubuntu 18.04 | AVX2 BLAS | ggml-tiny | 4 | 255.46 | 2906.67 |
i7-5600U @2.60GHz | Xubuntu 18.04 | AVX2 BLAS | ggml-base.en | 4 | 316.73 | 6197.29 |
i7-5600U @2.60GHz | Xubuntu 18.04 | AVX2 BLAS | ggml-base | 4 | 319.93 | 5825.65 |
i7-5600U @2.60GHz | Xubuntu 18.04 | AVX2 | ggml-tiny.en | 4 | 217.28 | 1548.92 |
i7-5600U @2.60GHz | Xubuntu 18.04 | AVX2 | ggml-tiny | 4 | 215.59 | 1625.69 |
i7-5600U @2.60GHz | Xubuntu 18.04 | AVX2 | ggml-base.en | 4 | 275.62 | 3823.34 |
i7-5600U @2.60GHz | Xubuntu 18.04 | AVX2 | ggml-base | 4 | 275.72 | 3740.50 |
Cortex-A53 | Android 10 | NEON | ggml-tiny.en | 8 | 399.05 | 5841.70 |
Cortex-A53 | Android 10 | NEON | ggml-tiny | 8 | 376.25 | 5548.72 |
Cortex-A53 | Android 10 | NEON | ggml-base.en | 8 | 492.92 | 12728.42 |
Cortex-A53 | Android 10 | NEON | ggml-base | 8 | 1034.48 | 13365.86 |
Test-bench properties
3996ecc156486fb93ff505c01090d13192e72aa2
.cmake
for building (mkdir build && cd build
, cmake .. && make
).gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
clang version 15.0.2 (aarch64-unknown-linux-android24)
fish shell
snippet to run the benchmarks:
# cwd is whisper.cpp/build
# Adding `-t 8` to `bench` for aarch64
$ for model in "ggml-tiny" "ggml-base"
for suffix in "en.bin" "bin"
./bin/bench -m "../models/$model.$suffix"
end
end
Remarks
-DWHISPER_SUPPORT_OPENBLAS=ON
) deteriorates the performance!Quite the difference between the 2017 Intel i3 4C/4T and the 2019 Ryzen Zen+ 6C/12T. And not looking good for AVX2 on the old AMD Zen+. I must admit, all in all I really envy the M1 for having that accelerator.
gcc
vs clang
doesn't seem to make a difference, at least it's not distinguishable from noise.
This is my home server. Tested while it was doing home server things (load 0.7). I can see this machine acting as a "whisper server" in a 2C configuration.
CPU | OS | Config | Model | Threads | Load [ms] | Encode [ms] | Commit | Compiler |
---|---|---|---|---|---|---|---|---|
i3-8100 @ 3.60GHz | Arch Linux | AVX2 | tiny | 1 | 88.38 | 2013.67 | 832b4f3 | gcc 12.2.0 |
i3-8100 @ 3.60GHz | Arch Linux | AVX2 | base | 1 | 113.58 | 4692.04 | 832b4f3 | gcc 12.2.0 |
i3-8100 @ 3.60GHz | Arch Linux | AVX2 | small | 1 | 225.74 | 18469.62 | 832b4f3 | gcc 12.2.0 |
i3-8100 @ 3.60GHz | Arch Linux | AVX2 | tiny | 2 | 89.55 | 1189.92 | 832b4f3 | clang 14.0.6 |
i3-8100 @ 3.60GHz | Arch Linux | AVX2 | base | 2 | 119.97 | 2756.52 | 832b4f3 | clang 14.0.6 |
i3-8100 @ 3.60GHz | Arch Linux | AVX2 | small | 2 | 238.71 | 10491.67 | 832b4f3 | clang 14.0.6 |
i3-8100 @ 3.60GHz | Arch Linux | AVX2 | tiny | 4 | 201.37 | 695.39 | 832b4f3 | gcc 12.2.0 |
i3-8100 @ 3.60GHz | Arch Linux | AVX2 | base | 4 | 262.76 | 2023.16 | 832b4f3 | gcc 12.2.0 |
i3-8100 @ 3.60GHz | Arch Linux | AVX2 | small | 4 | 526.66 | 6788.01 | 832b4f3 | gcc 12.2.0 |
i3-8100 @ 3.60GHz | Arch Linux | AVX2 | medium | 4 | 3836.26 | 21889.30 | 832b4f3 | gcc 12.2.0 |
i3-8100 @ 3.60GHz | Arch Linux | AVX2 | large | 4 | 26819.67 | 60880.62 | 832b4f3 | gcc 12.2.0 |
i3-8100 @ 3.60GHz | Arch Linux | AVX2 | tiny | 4 | 89.05 | 696.08 | 832b4f3 | clang 14.0.6 |
i3-8100 @ 3.60GHz | Arch Linux | AVX2 | base | 4 | 114.65 | 1711.15 | 832b4f3 | clang 14.0.6 |
i3-8100 @ 3.60GHz | Arch Linux | AVX2 | small | 4 | 309.30 | 6995.25 | 832b4f3 | clang 14.0.6 |
i3-8100 @ 3.60GHz | Arch Linux | AVX2 | medium | 4 | 4854.02 | 23570.42 | 832b4f3 | clang 14.0.6 |
i3-8100 @ 3.60GHz | Arch Linux | AVX2 | large | 4 | 21415.07 | 60547.99 | 832b4f3 | clang 14.0.6 |
Just my Desktop. The difference to the 5950 at 8C is really massive; but luckily it has no impact for daily usage, so I'm glad I can still wait with upgrading to the last AM4 CPU generation :joy: Looking forward to benching CUDA on this machine (3080Ti).
CPU | OS | Config | Model | Threads | Load [ms] | Encode [ms] | Commit | Compiler |
---|---|---|---|---|---|---|---|---|
Ryzen 1600AF | Manjaro | AVX2 | tiny | 1 | 104.04 | 4691.38 | 832b4f3 | clang 14.0.6 |
Ryzen 1600AF | Manjaro | AVX2 | base | 1 | 134.54 | 11092.84 | 832b4f3 | clang 14.0.6 |
Ryzen 1600AF | Manjaro | AVX2 | small | 1 | 254.71 | 43923.42 | 832b4f3 | clang 14.0.6 |
Ryzen 1600AF | Manjaro | AVX2 | tiny | 4 | 107.40 | 1336.49 | 832b4f3 | gcc 12.2.0 |
Ryzen 1600AF | Manjaro | AVX2 | base | 4 | 132.69 | 3062.12 | 832b4f3 | gcc 12.2.0 |
Ryzen 1600AF | Manjaro | AVX2 | small | 4 | 262.27 | 11655.22 | 832b4f3 | gcc 12.2.0 |
Ryzen 1600AF | Manjaro | AVX2 | medium | 4 | 662.81 | 38829.74 | 832b4f3 | gcc 12.2.0 |
Ryzen 1600AF | Manjaro | AVX2 | large | 4 | 1365.09 | 77063.30 | 832b4f3 | gcc 12.2.0 |
Ryzen 1600AF | Manjaro | AVX2 | tiny | 6 | 100.82 | 1007.36 | 832b4f3 | gcc 12.2.0 |
Ryzen 1600AF | Manjaro | AVX2 | base | 6 | 130.20 | 2472.55 | 832b4f3 | gcc 12.2.0 |
Ryzen 1600AF | Manjaro | AVX2 | small | 6 | 256.83 | 9311.54 | 832b4f3 | gcc 12.2.0 |
Ryzen 1600AF | Manjaro | AVX2 | medium | 6 | 657.89 | 28051.40 | 832b4f3 | gcc 12.2.0 |
Ryzen 1600AF | Manjaro | AVX2 | large | 6 | 1190.62 | 54292.72 | 832b4f3 | gcc 12.2.0 |
Ryzen 1600AF | Manjaro | AVX2 | tiny | 6 | 104.77 | 1012.70 | 832b4f3 | clang 14.0.6 |
Ryzen 1600AF | Manjaro | AVX2 | base | 6 | 137.00 | 2212.20 | 832b4f3 | clang 14.0.6 |
Ryzen 1600AF | Manjaro | AVX2 | small | 6 | 257.97 | 9296.33 | 832b4f3 | clang 14.0.6 |
Ryzen 1600AF | Manjaro | AVX2 | medium | 6 | 624.04 | 28524.38 | 832b4f3 | clang 14.0.6 |
Ryzen 1600AF | Manjaro | AVX2 | large | 6 | 1189.10 | 56445.31 | 832b4f3 | clang 14.0.6 |
Ryzen 1600AF | Manjaro | AVX2 | tiny | 12 | 101.41 | 898.96 | 832b4f3 | gcc 12.2.0 |
Ryzen 1600AF | Manjaro | AVX2 | base | 12 | 139.26 | 2200.78 | 832b4f3 | gcc 12.2.0 |
Ryzen 1600AF | Manjaro | AVX2 | small | 12 | 256.50 | 8125.48 | 832b4f3 | gcc 12.2.0 |
Ryzen 1600AF | Manjaro | AVX2 | medium | 12 | 623.59 | 29255.08 | 832b4f3 | gcc 12.2.0 |
Ryzen 1600AF | Manjaro | AVX2 | large | 12 | 1192.90 | 51902.81 | 832b4f3 | gcc 12.2.0 |
CPU | OS | Config | Model | Threads | Load [ms] | Encode [ms] | Commit | Compiler |
---|---|---|---|---|---|---|---|---|
POWER9v2 | Gentoo | -Ofast -mcpu=native |
base.en | 4/64 | 144.84 | 42708.33 | 85c9ac18b59125b988cda40f40d8687e1ba88a7a | clang 15.0.3 |
POWER9v2 | Gentoo | -Ofast -mcpu=native |
base.en | 16/64 | 161.95 | 22302.28 | 85c9ac18b59125b988cda40f40d8687e1ba88a7a | clang 15.0.3 |
POWER9v2 | Gentoo | -Ofast -mcpu=native |
base.en | 32/64 | 142.06 | 20263.56 | 85c9ac18b59125b988cda40f40d8687e1ba88a7a | clang 15.0.3 |
POWER9v2 | Gentoo | -Ofast -mcpu=native |
base.en | 64/64 | 160.51 | 12645.79 | 85c9ac18b59125b988cda40f40d8687e1ba88a7a | clang 15.0.3 |
@Xavier-i WASM performance is much worse compared to native - this is expected. Today I added the bench.wasm that can be used to benchmark performance in the browser.
Redo of my OpenVoiceOS Raspberry Pi 4 benchmark
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
Raspberry Pi 4 - 2GB | OpenVoiceOS | NEON | tiny.en | 4 | 735 | 9486 | aa6adda26e1ee9843dddb013890e3312bee52cfe |
Raspberry Pi 4 - 2GB | OpenVoiceOS | NEON | base.en | 4 | 950 | 25402 | aa6adda26e1ee9843dddb013890e3312bee52cfe |
Raspberry Pi 4 - 2GB | OpenVoiceOS | NEON BLAS | tiny.en | 4 | 752 | 9178 | aa6adda26e1ee9843dddb013890e3312bee52cfe |
Raspberry Pi 4 - 2GB | OpenVoiceOS | NEON BLAS | base.en | 4 | 969 | 19642 | aa6adda26e1ee9843dddb013890e3312bee52cfe |
And just (and only) because we can, the same on a Raspberry Pi 3B+ running the same codebase / OS
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
Raspberry Pi 3B+ - 1GB | OpenVoiceOS | NEON | tiny.en | 4 | 1331 | 22573 | aa6adda26e1ee9843dddb013890e3312bee52cfe |
Raspberry Pi 3B+ - 1GB | OpenVoiceOS | NEON | base.en | 4 | 5886 | 58733 | aa6adda26e1ee9843dddb013890e3312bee52cfe |
Raspberry Pi 3B+ - 1GB | OpenVoiceOS | NEON BLAS | tiny.en | 4 | 1333 | 21184 | aa6adda26e1ee9843dddb013890e3312bee52cfe |
Raspberry Pi 3B+ - 1GB | OpenVoiceOS | NEON BLAS | base.en | 4 | 4605 | 47877 | aa6adda26e1ee9843dddb013890e3312bee52cfe |
I hope this isn't misplaced but I thought it interesting to share ...
I have recently finished some tests comparing whisper.cpp
runtime performance against the original PyTorch version on various GPUs and CPUs.
We test against a fixed set of long form audio files (UK TV, each file ~1 hour long, mixed speech and noise) and record the runtime as a factor of real audio time.
Depending on the software and environment transcription can take anywhere between around 5x real-time to 0.14x real-time to complete.
ARM based whisper.cpp
runtime is very impressive, in particular the Apple M1 performance can match that of the original PyTorch version on NVIDIA V100 and T4 gpus ...
CPU / GPU | OS | Config | Model | Threads | xRT Transcribe |
---|---|---|---|---|---|
Intel Xeon | Ubuntu 22.04 | whisper original - pytorch cpu | medium.en | 8 | 4.78 |
Intel Xeon | Ubuntu 22.04 | whisper.cpp - AVX2 | medium.en | 8 | 4.44 |
Graviton 3 | Ubuntu 22.04 | whisper.cpp - NEON | medium.en | 8 | 0.63 |
mac2.metal | OSX Ventura | whisper.cpp - NEON BLAS | medium.en | 4 | 0.26 |
NVIDIA V100 | Ubuntu 22.04 | whisper original - pytorch cuda | medium.en | N/A | 0.25 |
NVIDIA T4 | Ubuntu 22.04 | whisper original - pytorch cuda | medium.en | N/A | 0.25 |
NVIDIA A10G | Ubuntu 22.04 | whisper original - pytorch cuda | medium.en | N/A | 0.16 |
NVIDIA A100 | Ubuntu 22.04 | whisper original - pytorch cuda | medium.en | N/A | 0.14 |
Additionally I did some very rough power consumption tests, again whisper.cpp
on the M1 is really impressive against PyTorch on the GPU.
Platform | Whisper Type | Model | Avg Power | Peak Power |
---|---|---|---|---|
Apple M1 | whisper.cpp | ggml-medium.en | 13202 mW | 18412 mW |
Nvidia T4 | pytorch | medium.en | 69587 mW | 85650 mW |
Thanks for the fantastic work @ggerganov - this is a really inspiring project and demonstrates the ARM FP16 functionality wonderfully. Off to buy some more Apple Macs now ;)
@matth @rgerganov Been thinking myself that perf/watts for ML is truly outstanding and just wondered if the 8gb can squeeze the medium model in as not sure how memory is shared on the m1 or is it really a case of the 16gb?
@matth Thanks for the data - it's interesting to see.
However, there are some caveats that are important to be considered when benchmarking the 2 implementations that I've been meaning to discuss, so here are my thoughts on this:
At a high-level, the Whisper transcription is a combination of 2 main parts:
The first part is branchless and does not depend on the audio input or the parameters that you use. For a given model, evaluating the transformer requires the same amount of operations every time. This is easy to benchmark.
The second part (decoding strategy) is different. The number of operations here depends both on the audio input contents and the decoding parameters / strategy that you use. For example, two different audio recordings with the same time length generally result in different decoded text based on the speech content and hence can take different amount of processing (even with the same decoding parameters). Also, the decoded timestamp tokens affect how the 30s sliding window of the transcription is updated and therefore can lead to a different number of transformer evaluations in total.
My understanding is that there is no "correct" decoding strategy. The OpenAI implementation generally offers 2 different strategies - Greedy and BeamSearch. Both of them are combinations of various heuristics that aim to improve the text coherency and reduce the number of catastrophic failures.
In whisper.cpp
we currently have a Greedy strategy which is similar to the one in the OpenAI repo, but is not exactly the same.
So all of this means that there is no point in comparing the 2 implementations by measuring the total time to transcribe an audio, because the decoding strategy is not the same and therefore the variation will be very large due to the factors outlined above. It only makes sense to benchmark the transformer evaluation in isolation, because it is well-defined.
That is why in the benchmarks in this issue, I chose to run the Encoder on some random input buffer. The Encoder is the heavy part of the transformer and being able to evaluate it efficiently is very important and is the most defining factor for the efficiency of the implementation. It's the "engine" of the transcription. You can then put on top of it any decoding strategy that you like and this will define how accurate your transcription is. But it does not make sense to benchmark the performance of that anymore.
I think if we want to make a fair comparison with PyTorch, we need to have the bench
tool implemented in python using PyTorch. Any other comparison will be flawed to some extent.
But in any case, your results are interesting - thanks for sharing them. What parameters did you use for the PyTorch runs?
Regarding the power consumption - I think there is more we can do in whisper.cpp
. Currently, the thread synchronization uses busy loops which is very power inefficient because it keeps the CPU at 100%, but it gives a slight performance edge. I am thinking of adding an option that uses condition variable synchronization which will likely reduce the power usage at the cost of some performance. For some use cases, it could be beneficial to have lower power consumption.
Thanks @ggerganov , we are using PyTorch whisper with default settings in that benchmark so I believe that is a beam search decoder. I will see if I can test again with the greedy decoder for a more similar comparison. I think I understand your point though - these are not like for like implementations so at a certain level the comparison is flawed.
I also neglected to measure the PyTorch version on the M1 & Graviton which was a huge oversight!
There's a motivation behind these benchmarks. Looking at various solutions as improvements to existing transcription capabilities - each solution in my mind is a balance of accuracy, completeness, runtime, financial cost and energy efficiency.
On one end you have paying humans to do the transcription, slow and expensive but very accurate and something that is still done at a massive scale in my industry. At the other end there are existing Kaldi models that are less accurate but incredibly fast for inference on the CPU and very cheap to run.
I feel larger transformer models like Whisper sit somewhat in the middle of all this - closer to human accuracy but increased associated costs over existing software.
But whisper.cpp
adds to this, if we can get similar or even just acceptable accuracy and runtime but on commodity hardware the choice can start to become more about cost, efficiency and functionality. e.g. you could buy 30+ Apple Macs for the price of an NVIDIA A100 server, being able to run Whisper on a laptop enables a different set of use cases, you can cut power consumption by a huge margin, etc
I think for me this is one of the many exciting outcomes of this project :)
@matth Yeah - the default in PyTorch when running from the command line is BeamSearch. I haven't measure exactly, but it is significantly slower compared to Greedy.
I think regarding the total-time benchmark - it can make sense once whisper.cpp
reaches the accuracy of OpenAI. Currently, due to the inferior decoding, whisper.cpp
has lower transcription accuracy (based on some results I saw floating around). But when the decoding gets improved and we have comparable accuracy, then we can make a benchmark that says:
"for a given word error rate (WER) the 2 implementation take this amount of processing time on average, over some large set of audio"
And another thing I was thinking is that even if today whisper.cpp
is more efficient on Apple Macs - it is not going to be always the case. If I understand correctly, it's just a matter of time for the proper Apple Silicon frameworks (Metal, MPS, etc.) to become supported in PyTorch, Tensorflow, etc and when this happens (probably very soon), the performance of whisper.cpp
will be the same or possibly worse.
So yeah - just trying to adjust expectations :) Will probably write some more on this in the F.A.Q. discussion.
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
i9-9900K @ 3.60GHz | macOS 12.6.2 | AVX2 BLAS | tiny.en | 4 | 175 | 360 | 7282e2109e0748421ee73271496f5911ca2b89a7 |
i9-9900K @ 3.60GHz | macOS 12.6.2 | AVX2 BLAS | base.en | 4 | 233 | 736 | 7282e2109e0748421ee73271496f5911ca2b89a7 |
i9-9900K @ 3.60GHz | macOS 12.6.2 | AVX2 BLAS | small.en | 4 | 507 | 2400 | 7282e2109e0748421ee73271496f5911ca2b89a7 |
i9-9900K @ 3.60GHz | macOS 12.6.2 | AVX2 BLAS | medium.en | 4 | 1333 | 6860 | 7282e2109e0748421ee73271496f5911ca2b89a7 |
Using 8 threads is slightly slower to load, faster to encode: | CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|---|
i9-9900K @ 3.60GHz | macOS 12.6.2 | AVX2 BLAS | tiny.en | 8 | 185 | 283 | 7282e2109e0748421ee73271496f5911ca2b89a7 | |
i9-9900K @ 3.60GHz | macOS 12.6.2 | AVX2 BLAS | base.en | 8 | 241 | 579 | 7282e2109e0748421ee73271496f5911ca2b89a7 | |
i9-9900K @ 3.60GHz | macOS 12.6.2 | AVX2 BLAS | small.en | 8 | 526 | 1959 | 7282e2109e0748421ee73271496f5911ca2b89a7 | |
i9-9900K @ 3.60GHz | macOS 12.6.2 | AVX2 BLAS | medium.en | 8 | 1390 | 6271 | 7282e2109e0748421ee73271496f5911ca2b89a7 |
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
MacBookPro M1 Max | macOS 12.6 | NEON BLAS | tiny | 8 | 65 | 108 | a593b93 |
MacBookPro M1 Max | macOS 12.6 | NEON BLAS | base | 8 | 86 | 250 | a593b93 |
MacBookPro M1 Max | macOS 12.6 | NEON BLAS | small | 8 | 185 | 789 | a593b93 |
MacBookPro M1 Max | macOS 12.6 | NEON BLAS | medium | 8 | 493 | 2126 | a593b93 |
MacBookPro M1 Max | macOS 12.6 | NEON BLAS | large | 8 | 955 | 3860 | a593b93 |
There are actually 10 threads, but when using -t 10 the performance goes down. Lower numbers (such as -t 4) result in similar load performance, but slower encode (although not linear).
AMD Ryzen 5 3400G (4 CPU cores, 8 threads) on Ubuntu 22.10 with 5.19.0-26-generic Kernel
4 threads
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
3400G | Ubuntu 22.10 | AVX2 | tiny | 4 | 163 | 1415 | 0be6a1a |
3400G | Ubuntu 22.10 | AVX2 | tiny.en | 4 | 175 | 1351 | 0be6a1a |
3400G | Ubuntu 22.10 | AVX2 | base.en | 4 | 200 | 3095 | 0be6a1a |
3400G | Ubuntu 22.10 | AVX2 | base | 4 | 205 | 3241 | 0be6a1a |
3400G | Ubuntu 22.10 | AVX2 | small.en | 4 | 412 | 12343 | 0be6a1a |
3400G | Ubuntu 22.10 | AVX2 | small | 4 | 421 | 11983 | 0be6a1a |
3400G | Ubuntu 22.10 | AVX2 | medium.en | 4 | 995 | 38818 | 0be6a1a |
3400G | Ubuntu 22.10 | AVX2 | medium | 4 | 1006 | 38573 | 0be6a1a |
3400G | Ubuntu 22.10 | AVX2 | large-v1 | 4 | 0be6a1a | ||
3400G | Ubuntu 22.10 | AVX2 | large | 4 | 1870 | 77302 | 0be6a1a |
8 threads is just marginally better
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
3400G | Ubuntu 22.10 | AVX2 | tiny.en | 8 | 191 | 1275 | 0be6a1a |
3400G | Ubuntu 22.10 | AVX2 | tiny | 8 | 183 | 1258 | 0be6a1a |
3400G | Ubuntu 22.10 | AVX2 | base.en | 8 | 232 | 2894 | 0be6a1a |
3400G | Ubuntu 22.10 | AVX2 | base | 8 | 231 | 2927 | 0be6a1a |
3400G | Ubuntu 22.10 | AVX2 | small.en | 8 | 435 | 11299 | 0be6a1a |
3400G | Ubuntu 22.10 | AVX2 | small | 8 | 414 | 11511 | 0be6a1a |
3400G | Ubuntu 22.10 | AVX2 | medium.en | 8 | 1011 | 37557 | 0be6a1a |
3400G | Ubuntu 22.10 | AVX2 | medium | 8 | 1049 | 37306 | 0be6a1a |
3400G | Ubuntu 22.10 | AVX2 | large-v1 | 8 | 0be6a1a | ||
3400G | Ubuntu 22.10 | AVX2 | large | 8 | 3237 | 77396 | 0be6a1a |
Someone mentioned BLAS?
Encoder
Collection of bench results for various platforms and devices. If you want to submit info about your device, simply run the bench tool or the extra/bench-all.sh and report the results in the comments below.
Suggestions for better summary of the results are welcome
memcpy
MacBook M1 Pro
Ryzen 9 5950X
ggml_mul_mat
MacBook M1 Pro
Ryzen 9 5950X