ggerganov / whisper.cpp

Port of OpenAI's Whisper model in C/C++
MIT License
33.3k stars 3.36k forks source link

Benchmark results #89

Open ggerganov opened 1 year ago

ggerganov commented 1 year ago

Encoder

Collection of bench results for various platforms and devices. If you want to submit info about your device, simply run the bench tool or the extra/bench-all.sh and report the results in the comments below.

Suggestions for better summary of the results are welcome

CPU OS Config Model Th Load Enc. Commit
MacBook M1 Pro MacOS 13.0.1 NEON BLAS tiny 8 71 102 206fc93
MacBook M1 Pro MacOS 13.0.1 NEON BLAS base 8 96 220 206fc93
MacBook M1 Pro MacOS 13.0.1 NEON BLAS small 8 233 685 206fc93
MacBook M1 Pro MacOS 13.0.1 NEON BLAS medium 8 603 1928 206fc93
MacBook M1 Pro MacOS 13.0.1 NEON BLAS large 8 1158 3350 206fc93
---
MacBook M1 Pro MacOS 13.0.1 NEON BLAS small 1 251 2605 206fc93
MacBook M1 Pro MacOS 13.0.1 NEON BLAS small 4 255 884 206fc93
---
Mac Mini M1 MacOS NEON BLAS tiny 4 62 194 fcf515d
Mac Mini M1 MacOS NEON BLAS base 4 81 380 fcf515d
Mac Mini M1 MacOS NEON BLAS small 4 204 1249 fcf515d
Mac Mini M1 MacOS NEON BLAS medium 4 876 3980 fcf515d
Mac Mini M1 MacOS NEON BLAS large 4 1876 7979 fcf515d
---
Ryzen 9 3900X Ubuntu 20.04 AVX2 tiny 8 107 422 fcf515d
Ryzen 9 3900X Ubuntu 20.04 AVX2 base 8 137 880 fcf515d
Ryzen 9 3900X Ubuntu 20.04 AVX2 small 8 280 2874 fcf515d
Ryzen 9 3900X Ubuntu 20.04 AVX2 medium 8 692 9610 fcf515d
Ryzen 9 3900X Ubuntu 20.04 AVX2 large 8 1317 16917 fcf515d
---
Ryzen 9 3900X Ubuntu 20.04 AVX2 BLAS tiny 4 120 780 fcf515d
Ryzen 9 3900X Ubuntu 20.04 AVX2 BLAS base 4 151 1173 fcf515d
Ryzen 9 3900X Ubuntu 20.04 AVX2 BLAS small 4 289 3062 fcf515d
Ryzen 9 3900X Ubuntu 20.04 AVX2 BLAS medium 4 711 9175 fcf515d
Ryzen 9 3900X Ubuntu 20.04 AVX2 BLAS large 4 1282 16050 fcf515d
---
Ryzen 9 5950X Ubuntu 22.04 AVX2 tiny 8 135 197 fcf515d
Ryzen 9 5950X Ubuntu 22.04 AVX2 base 8 176 421 fcf515d
Ryzen 9 5950X Ubuntu 22.04 AVX2 small 8 357 1393 fcf515d
Ryzen 9 5950X Ubuntu 22.04 AVX2 medium 8 855 4404 fcf515d
Ryzen 9 5950X Ubuntu 22.04 AVX2 large 8 1576 8118 fcf515d
---
Raspberry Pi 4 NEON tiny 4 1436 13839 fcf515d
Raspberry Pi 4 NEON base 4 1894 30552 fcf515d
---
iPhone 13 Mini iOS 16.0 NEON BLAS base 4 97 1091 fcf515d
---
MacBook M1 Pro Vivaldi WASM tiny 8 133 3785 fcf515d
MacBook M1 Pro Vivaldi WASM base 8 172 8253 fcf515d
---
MacBook M1 Pro Chrome WASM tiny 8 134 3776 fcf515d
MacBook M1 Pro Chrome WASM base 8 168 8200 fcf515d
---
MacBook M1 Pro Firefox WASM tiny 8 137 2626 fcf515d
MacBook M1 Pro Firefox WASM base 8 183 6226 fcf515d

memcpy

MacBook M1 Pro

./bench -w 1 -t 1
memcpy: 37.59 GB/s

Ryzen 9 5950X

./bench -w 1 -t 1
memcpy: 16.74 GB/s

ggml_mul_mat

MacBook M1 Pro

./bench -w 2 -t 1
ggml_mul_mat:    64 x    64: F16    330.6 GFLOPS (128 runs) / F32    466.0 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16    737.5 GFLOPS (128 runs) / F32    838.9 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16    938.6 GFLOPS (128 runs) / F32   1062.3 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16   1312.5 GFLOPS (128 runs) / F32   1835.5 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16   1765.1 GFLOPS (128 runs) / F32   2041.4 GFLOPS (128 runs)
ggml_mul_mat:  2048 x  2048: F16   1784.3 GFLOPS (104 runs) / F32   1859.2 GFLOPS (109 runs)
ggml_mul_mat:  4096 x  4096: F16   1855.1 GFLOPS ( 14 runs) / F32   1873.3 GFLOPS ( 14 runs)

Ryzen 9 5950X

WHISPER_OPENBLAS=1 make -j bench && ./bench -w 2 -t 1
ggml_mul_mat:    64 x    64: F16     56.3 GFLOPS (128 runs) / F32     70.2 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     47.8 GFLOPS (128 runs) / F32     67.0 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16    185.1 GFLOPS (128 runs) / F32    332.7 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16    386.4 GFLOPS (128 runs) / F32    658.6 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16    636.2 GFLOPS (128 runs) / F32   1012.0 GFLOPS (128 runs)
ggml_mul_mat:  2048 x  2048: F16    950.9 GFLOPS ( 56 runs) / F32   1296.8 GFLOPS ( 76 runs)
ggml_mul_mat:  4096 x  4096: F16   1168.6 GFLOPS (  9 runs) / F32   1403.1 GFLOPS ( 11 runs)
cdosoftei commented 1 year ago

Results for Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz

CPU OS Config Model Threads Load [ms] Encode [ms]
i7-4790K Debian   tiny.en 4 165 808
i7-4790K Debian   tiny.en 8 165 783
i7-4790K Debian   base.en 4 212 1813
i7-4790K Debian   base.en 8 214 1746
rjwilmsi commented 1 year ago

Results for Ryzen 5 4500U 6C/6T laptop CPU (I've just included one result for 8 threads as Encode time is much higher when threads > CPU cores).

CPU OS Config Model Threads Load [ms] Encode [ms]
Ryzen 5 4500U (6C/6T) Opensuse Leap tiny.en 4 170.00 829.43
Ryzen 5 4500U (6C/6T) Opensuse Leap tiny.en 6 143.03 671.74
Ryzen 5 4500U (6C/6T) Opensuse Leap base.en 4 305.92 2,092.39
Ryzen 5 4500U (6C/6T) Opensuse Leap base.en 6 188.05 1,495.61
Ryzen 5 4500U (6C/6T) Opensuse Leap small.en 4 408.03 6,919.31
Ryzen 5 4500U (6C/6T) Opensuse Leap small.en 6 359.23 6,370.83
Ryzen 5 4500U (6C/6T) Opensuse Leap medium.en 4 2,238.11 25,863.28
Ryzen 5 4500U (6C/6T) Opensuse Leap medium.en 6 1,113.04 19,672.63
Ryzen 5 4500U (6C/6T) Opensuse Leap medium.en 8 973.65 39,619.20
ArtyomZemlyak commented 1 year ago
CPU OS Config Model Threads Load [ms] Encode [ms]
i7-11800H WSL2 Ubuntu AVX2 tiny 2 164.35 1087.61
i7-11800H WSL2 Ubuntu AVX2 tiny 4 128.94 733.24
i7-11800H WSL2 Ubuntu AVX2 tiny 8 137.57 619.88
i7-11800H WSL2 Ubuntu AVX2 AVX512 tiny 2 143.02 1087.15
i7-11800H WSL2 Ubuntu AVX2 AVX512 tiny 4 127.60 730.57
i7-11800H WSL2 Ubuntu AVX2 AVX512 tiny 8 125.62 616.27
i7-11800H WSL2 Ubuntu AVX2 AVX512 BLAS tiny 2 132.59 1511.38
i7-11800H WSL2 Ubuntu AVX2 AVX512 BLAS tiny 4 132.48 1407.49
i7-11800H WSL2 Ubuntu AVX2 AVX512 BLAS tiny 8 133.82 1458.27
ArtyomZemlyak commented 1 year ago
CPU OS Config Model Threads Load [ms] Encode [ms]
i7-11800H WSL2 Ubuntu AVX2 base 2 174.34 2533.79
i7-11800H WSL2 Ubuntu AVX2 base 4 166.68 1830.67
i7-11800H WSL2 Ubuntu AVX2 base 8 165.53 1478.73
i7-11800H WSL2 Ubuntu AVX2 small 2 340.12 8714.24
i7-11800H WSL2 Ubuntu AVX2 small 4 394.32 6021.41
i7-11800H WSL2 Ubuntu AVX2 small 8 305.98 4828.84
i7-11800H WSL2 Ubuntu AVX2 large 2 3205.36 57109.10
i7-11800H WSL2 Ubuntu AVX2 large 4 2720.25 38519.89
i7-11800H WSL2 Ubuntu AVX2 large 8 3716.34 27739.99
ArtyomZemlyak commented 1 year ago
CPU OS Config Model Threads Load [ms] Encode [ms]
i7-11800H WSL2 Ubuntu AVX2 AVX512 large 2 1954.21 54966.84
i7-11800H WSL2 Ubuntu AVX2 AVX512 large 4 1455.40 37320.62
i7-11800H WSL2 Ubuntu AVX2 AVX512 large 8 1372.58 27937.64
ArtyomZemlyak commented 1 year ago

This performance is impressing!

M1 Pro | MacOS |   | large | 8 | 1973 | 4208

ggerganov commented 1 year ago

This performance is impressing!

Yes, there is a huge performance boost due to using the built-in BLAS implementation on these devices. I will soon add OpenBLAS support for x86 architectures and see how this compares.

By the way, AVX-512 is not supported on master. I have added initial support here, but I am not sure if it works: https://github.com/ggerganov/whisper.cpp/pull/95

cristianglezm commented 1 year ago
CPU OS Config Model Threads Load[ms] encode[ms]
Intel® Core™ i5-8250U Win11 Home AVX2 Large 8 2226.85 61547.61

compiled with MinGW64 gcc 11.3

tazz4843 commented 1 year ago

Valve Jupiter (AMD Custom APU 0405, Zen 2 microarch, 4c8t, 16GB DDR5 @ 5200 MT/s)

CPU OS Config Model Threads Load[ms] encode[ms]
AMD Custom APU 0405 SteamOS 3.2 AVX2 Base 8 326.32 2592.96

Compiled with cc (GCC) 11.3.0

The performance gains on jfk.wav since last test (two weeks or so ago) are extremely impressive, ~10-20x speedup from 40 to 2-4 seconds.

yujinqiu commented 1 year ago
CPU OS Config Model Threads Load [ms] Encode [ms]
MacBook M1 Max macOS Ventura BLAS small 1 299.09 4166.00
MacBook M1 Max macOS Ventura BLAS small 4 329.45 1304.32
MacBook M1 Max macOS Ventura BLAS base 1 139.10 1302.17
MacBook M1 Max macOS Ventura BLAS base 4 135.96 399.45
trholding commented 1 year ago

On a AMD EPYC 64 core 240 threads cloud instance it is stuck like this with 240 threads. I noticed that above a certain number of threads its slow, or the cloud provider is cpu limiting. Can anyone else with real hardware check if this is the case?

time ./main -m models/ggml-base.en.bin -f elon.wav -t 240
whisper_model_load: loading model from 'models/ggml-base.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem_required  = 670.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 140.60 MB
whisper_model_load: memory size =    22.83 MB
whisper_model_load: model size  =   140.54 MB

system_info: n_threads = 240 / 240 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | 

main: processing 'elon.wav' (34466688 samples, 2154.2 sec), 240 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ..
trholding commented 1 year ago

So I have tried with the above mentioned cloud provider various number of threads.

I found that anything above 64 threads gets slower and usable upto 120 threads. Anything above is a hang. Must be that the cloud provider is throttling on free trial or too many threads could actually slow down stuff.

...
...
processor       : 239
vendor_id       : AuthenticAMD
cpu family      : 23
model           : 49
model name      : AMD EPYC 7742 64-Core Processor
stepping        : 0
microcode       : 0x830104d
cpu MHz         : 2245.780
cache size      : 512 KB
physical id     : 1
siblings        : 120
core id         : 59
cpu cores       : 60
apicid          : 247
initial apicid  : 247
fpu             : yes
fpu_exception   : yes
cpuid level     : 16
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr wbnoinvd arat npt nrip_save umip rdpid arch_capabilities
bugs            : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass
bogomips        : 4491.56
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management:
time ./main -m models/ggml-base.en.bin -f elon.wav -t 64
whisper_model_load: loading model from 'models/ggml-base.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem_required  = 670.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 140.60 MB
whisper_model_load: memory size =    22.83 MB
whisper_model_load: model size  =   140.54 MB

system_info: n_threads = 64 / 240 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | 

main: processing 'elon.wav' (34466688 samples, 2154.2 sec), 64 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...

[00:00:00.000 --> 00:00:03.960]   [MUSIC PLAYING]
[00:00:03.960 --> 00:00:18.240]   In life, we've seen within this part of the world
...
...
[00:35:40.320 --> 00:35:41.920]   Thank you, and have a great day.
[00:35:41.920 --> 00:35:43.920]   [APPLAUSE]
[00:35:43.920 --> 00:35:45.920]   [MUSIC PLAYING]
[00:35:45.920 --> 00:35:56.240]   [VIDEO PLAYBACK]

whisper_print_timings:     load time =   249.61 ms
whisper_print_timings:      mel time =  1267.11 ms
whisper_print_timings:   sample time =  1718.69 ms
whisper_print_timings:   encode time = 63702.25 ms / 10617.04 ms per layer
whisper_print_timings:   decode time = 381317.66 ms / 63552.94 ms per layer
whisper_print_timings:    total time = 448362.19 ms

real    7m28.411s
user    347m2.230s
sys     22m42.511s

32 threads was faster than 64 threads. I think 32 threads took around 7 minutes or so.

trholding commented 1 year ago

Env: Restricted Cloud / Throttled Maybe

CPU: AMD EPYC 7742 64-Core Processor

OS:

Distributor ID: Ubuntu
Description:    Ubuntu 20.04.3 LTS
Release:        20.04
Codename:       focal
Linux XXXX 5.4.0-131-generic #147-Ubuntu SMP Fri Oct 14 17:07:22 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Compiler:

gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/9/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none:hsa
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 9.4.0-1ubuntu1~20.04.1' --with-bugurl=file:///usr/share/doc/gcc-9/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,gm2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-9 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --enable-default-pie --with-system-zlib --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none=/build/gcc-9-Av3uEd/gcc-9-9.4.0/debian/tmp-nvptx/usr,hsa --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1) 
$ ./bench -m ./models/ggml-small.en.bin -t 4
whisper_model_load: loading model from './models/ggml-small.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head  = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 768
whisper_model_load: n_text_head   = 12
whisper_model_load: n_text_layer  = 12
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 3
whisper_model_load: mem_required  = 1588.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 464.56 MB
whisper_model_load: memory size =    68.48 MB
whisper_model_load: model size  =   464.44 MB

system_info: n_threads = 4 / 240 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | 

whisper_print_timings:     load time =   515.02 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  6878.32 ms / 573.19 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time =  7393.42 ms

If you wish, you can submit these results here:

  https://github.com/ggerganov/whisper.cpp/issues/89

Please include the following information:

  - CPU model
  - Operating system
  - Compiler
$ ./bench -m ./models/ggml-small.en.bin -t 240
whisper_model_load: loading model from './models/ggml-small.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head  = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 768
whisper_model_load: n_text_head   = 12
whisper_model_load: n_text_layer  = 12
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 3
whisper_model_load: mem_required  = 1588.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 464.56 MB
whisper_model_load: memory size =    68.48 MB
whisper_model_load: model size  =   464.44 MB

system_info: n_threads = 240 / 240 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | 

whisper_print_timings:     load time =   528.66 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time = 12898.34 ms / 1074.86 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time = 13427.03 ms

If you wish, you can submit these results here:

  https://github.com/ggerganov/whisper.cpp/issues/89

Please include the following information:

  - CPU model
  - Operating system
  - Compiler

I'll remove the above posts if too much clutter.

ggerganov commented 1 year ago

@trholding Thanks for the results.

You can generate a table with performance results by simply running the extra/bench_all.sh script.

Regarding the threads - yes, it seems that going beyond 8 threads does not help regardless of how many cores you have. My guess is that the computation is memory-bound so that's why using more threads does not improve the performance.

trholding commented 1 year ago

Okay, 8 threads max, so for a large file, is there a possibility of splitting the file to chunks with silences as terminators and dividing the conversion to ((total threads/cores)/8) but also keeping track of timestamps? This could be awesome for batch conversion.

You can generate a table with performance results by simply running the extra/bench_all.sh script.

Oh, I didn't know, I'll update with tables soon and remove my previous comments in a few hours.

trholding commented 1 year ago

You can generate a table with performance results by simply running the extra/bench_all.sh script.

Hey Sorry. That didn't pan out well, I did the benchmark thrice, my account got deleted without notice. Could't get the logs as it was a web terminal. On the other hand I am happy that this happened, I was giving serious thought of purchasing a GPU+CPU plan there, so performance check of CPU was equally important. Probably or technically it was my fault - probably shouldn't have used a reverse shell and done benchmarks on a free trial, but how does one know if a service is real good or all just vapor...

rgerganov commented 1 year ago

Dell Precision 5560 laptop results:

CPU OS Config Model Threads Load [ms] Encode [ms]
i7-11850H Ubuntu AVX2 tiny 4 115.87 538.43
i7-11850H Ubuntu AVX2 base 4 145.14 1241.84
i7-11850H Ubuntu AVX2 small 4 299.30 4343.57
i7-11850H Ubuntu AVX2 medium 4 760.98 15238.31
i7-11850H Ubuntu AVX2 large 4 1404.32 27476.86
i7-11850H Ubuntu AVX2 tiny 8 131.96 358.81
i7-11850H Ubuntu AVX2 base 8 166.61 839.31
i7-11850H Ubuntu AVX2 small 8 320.29 2854.86
i7-11850H Ubuntu AVX2 medium 8 756.20 9829.62
i7-11850H Ubuntu AVX2 large 8 1382.38 19872.81
jaybinks commented 1 year ago
CPU | OS | Config | Model | Threads | Load [ms] | Encode [ms] -- | -- | -- | -- | -- | -- | -- i9-9900K | WSL2 Ubuntu (GCC) | AVX2  | tiny.en | 4 | 85.71 | 601.56 i9-9900K | WSL2 Ubuntu (GCC) | AVX2  | small.en | 4 | 212.59 | 5146.23 i9-9900K | OSX 10.14.1 (hackintosh - GCC) | AVX2  | tiny.en | 4 | 198.17 | 455.12 i9-9900K | OSX 10.14.1 (hackintosh - GCC) | AVX2  | base.en | 4 | 272.62 | 909.71 i9-9900K | OSX 10.14.1 (hackintosh - GCC) | AVX2 | small.en | 4 | 598.75 | 2968.75 Xeon(R) Silver 4210R CPU @ 2.40GHz | **Virtual Machine** - Debian Stretch (GCC - master branch) | AVX2 avx512f avx512dq avx512cd avx512bw avx512vl | small.en | 4 | 776.56 | 12340.41 Xeon(R) Silver 4210R CPU @ 2.40GHz | **Virtual Machine** - Debian Stretch (GCC - master branch) | AVX2 avx512f avx512dq avx512cd avx512bw avx512vl | tiny.en | 4 | 295.54 | 1710.46
mark-beeby commented 1 year ago
CPU OS Config Model Threads Load [ms] Encode [ms]
i9-11950H Pop!_OS 22.04 LTS AVX2 Tiny 4 124.28 656.41
i9-11950H Pop!_OS 22.04 LTS AVX2 Tiny 8 123.70 696.41
i9-11950H Pop!_OS 22.04 LTS AVX2 Base 4 159.91 1754.44
i9-11950H Pop!_OS 22.04 LTS AVX2 Base 8 164.47 1658.55
i9-11950H Pop!_OS 22.04 LTS AVX2 Small 4 330.91 6161.86
i9-11950H Pop!_OS 22.04 LTS AVX2 Small 8 346.22 5187.85
niksedk commented 1 year ago
CPU OS Config Model Threads Load [ms] Encode [ms]
i7-1065G7 Windows 11 - small.en 4 1,314.25 294,168.09

Compiled with VS 2022

Something is off, right?

ggerganov commented 1 year ago

Yup - you are missing the AVX2 flag. See if some of the comments in https://github.com/ggerganov/whisper.cpp/issues/5 can help you resolve this.

niksedk commented 1 year ago

OK, the AVX2 flag seems to help :)

CPU OS Config Model Threads Load [ms] Encode [ms]
i7-1065G7 Windows 11 AVX2 small.en 4 527.59 9,648.67

Compiled with VS 2022

j1nx commented 1 year ago
CPU OS Config Model Threads Load [ms] Encode [ms] Remarks
Raspberry Pi 4 - 2GB OpenVoiceOS NEON tiny 1 861.34 29428.21 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS tiny 1 843.80 16145.62 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON tiny 4 835.68 21509.08 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS tiny 4 824.24 13187.96 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON base 1 1146.02 87615.00 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS base 1 1103.39 52228.30 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON base 4 1183.47 55256.20 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS base 4 1161.32 29851.40 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON tiny 1 752.64 24018.10 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS tiny 1 751.96 13082.95 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON tiny 4 743.37 10122.80 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS tiny 4 742.90 9564.89 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON base 1 974.46 71587.61 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS base 1 979.65 43852.07 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON base 4 982.24 24814.62 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS base 4 982.80 19910.19 Without OVOS services running
StuartIanNaylor commented 1 year ago

From the stream repo


CPU OS Config Model Threads Load [ms] Encode [ms]
RK3588 Ubuntu20.04 NEON tiny.en 4 243.54 ms 779.49 ms
RK3588 Ubuntu20.04 NEON base.en 4 316.52 ms 1821.06 ms
RK3588 Ubuntu20.04 NEON small.en 4 618.93 ms 7117.69 ms
RK3588 Ubuntu20.04 NEON medium.en 4 1514.88 ms 24139.92 ms
CPU OS Config Model Threads Load [ms] Encode [ms]
RK3588 Ubuntu20.04 NEON tiny 4 233.86 ms 791.01 ms
RK3588 Ubuntu20.04 NEON base 4 297.93 ms 1813.69 ms
RK3588 Ubuntu20.04 NEON small 4 592.18 ms 7102.28 ms
RK3588 Ubuntu20.04 NEON medium 4 1587.36 ms 24147.87 ms
CPU OS Config Model Threads Load [ms] Encode [ms]
RK3588 Ubuntu20.04 NEON tiny 8 226.48 ms 740.34 ms
RK3588 Ubuntu20.04 NEON base 8 300.48 ms 1723.42 ms
RK3588 Ubuntu20.04 NEON small 8 620.58 ms 6392.47 ms
RK3588 Ubuntu20.04 NEON medium 8 1533.75 ms 21899.08 ms

I still haven't worked out the little(0-3).Big(4-7) on this thing as if I pin to big cores taskset -c 4-7

CPU OS Config Model Threads Load [ms] Encode [ms]
RK3588 Ubuntu20.04 NEON tiny.en 4 234.14 ms 681.53 ms
RK3588 Ubuntu20.04 NEON base.en 4 297.08 ms 1679.75 ms
RK3588 Ubuntu20.04 NEON small.en 4 599.98 ms 6867.66 ms
RK3588 Ubuntu20.04 NEON medium.en 4 1492.73 ms 23600.45 ms

I tried to compile with openBlas but seemed to kill the make


From the master repo as didn't think about the repo after trying streaming input CPU OS Config Model Threads Load [ms] Encode [ms]
RK3588 Ubuntu20.04 NEON tiny 8 226.48 ms 2681.05 ms
RK3588 Ubuntu20.04 NEON base 8 283.56 ms 6132.44 ms
RK3588 Ubuntu20.04 NEON small 8 583.39 ms 24397.78 ms
RK3588 Ubuntu20.04 NEON medium 8 1490.98 85099.45 ms
dodysw commented 1 year ago
CPU OS Config Model Threads Load [ms] Encode [ms]
Ryzen 7 PRO 4750G Ubuntu 22.04 AVX2 tiny.en 8 136.29 454.52
Ryzen 7 PRO 4750G Ubuntu 22.04 AVX2 tiny 8 134.64 486.01
Ryzen 7 PRO 4750G Ubuntu 22.04 AVX2 base 8 180.22 1184.80
Ryzen 7 PRO 4750G Ubuntu 22.04 AVX2 base.en 8 192.86 1197.85
Ryzen 7 PRO 4750G Ubuntu 22.04 AVX2 small 8 367.55 4179.00
Ryzen 7 PRO 4750G Ubuntu 22.04 AVX2 small.en 8 378.27 4557.73
Ryzen 7 PRO 4750G Ubuntu 22.04 AVX2 medium 8 923.48 15552.61
Ryzen 7 PRO 4750G Ubuntu 22.04 AVX2 medium.en 8 952.48 15708.63
Ryzen 7 PRO 4750G Ubuntu 22.04 AVX2 large 8 1650.28 28357.09

8 threads seemed to be the fastest. However I managed to squeeze a bit more performance by pinning CPU:

$ taskset -c 0-15 ./extra/bench-all.sh 16
CPU OS Config Model Threads Load [ms] Encode [ms]
Ryzen 7 PRO 4750G Ubuntu 22.04 AVX2 tiny 16 143.17 437.73
Ryzen 7 PRO 4750G Ubuntu 22.04 AVX2 base 16 184.10 1061.14
Ryzen 7 PRO 4750G Ubuntu 22.04 AVX2 small 16 374.41 3645.64
Ryzen 7 PRO 4750G Ubuntu 22.04 AVX2 medium 16 935.45 13029.54
matth commented 1 year ago

Results for AWS Graviton 3 Processor (c7g.4xlarge instance type).

Compiled with -march=native -ffast-math.

./extra/bench-all.sh 8

CPU OS Config Model Threads Load [ms] Encode [ms]
Graviton 3 Ubuntu 22.04 NEON tiny 8 125.92 230.33
Graviton 3 Ubuntu 22.04 NEON base 8 160.17 547.88
Graviton 3 Ubuntu 22.04 NEON small 8 299.59 2138.86
Graviton 3 Ubuntu 22.04 NEON medium 8 741.49 6999.33
Graviton 3 Ubuntu 22.04 NEON large 8 1313.95 14174.00

./extra/bench-all.sh 16

CPU OS Config Model Threads Load [ms] Encode [ms]
Graviton 3 Ubuntu 22.04 NEON tiny 16 121.92 158.61
Graviton 3 Ubuntu 22.04 NEON base 16 156.01 386.78
Graviton 3 Ubuntu 22.04 NEON small 16 299.85 1596.38
Graviton 3 Ubuntu 22.04 NEON medium 16 750.93 5351.24
Graviton 3 Ubuntu 22.04 NEON large 16 1313.82 11115.69
ggerganov commented 1 year ago

@matth Do you observe significant performance difference with / without -march=native -ffast-math?

matth commented 1 year ago

@ggerganov -ffast-math seems to make only a very small difference that could be noise between runs

-march=native does seem to make a big difference, without it FP16_VA is not reported as being enabled (I can get this with -march=armv8.4-a+bf16+fp16fml) - I think -march=native is enabling more intrinsics than this though.

Results without any -march or -ffast-math flags ...

./extra/bench-all.sh 16

CPU OS Config Model Threads Load [ms] Encode [ms]
Graviton 3 Ubuntu 22.04 NEON tiny 16 124.25 320.53
Graviton 3 Ubuntu 22.04 NEON base 16 156.91 734.22
Graviton 3 Ubuntu 22.04 NEON small 16 301.78 2812.75
Graviton 3 Ubuntu 22.04 NEON medium 16 714.23 9139.86
Graviton 3 Ubuntu 22.04 NEON large 16 1298.33 18147.47

I have tried to improve by using OpenBlas and armpl.h but with they both slow it down considerably - I'll keep trying with the latter.

Are there any possibilities for further optimisations in ggml.c that can take advantage of the situation where you have bf16 functions but not BLAS or Accelerate?

maltoze commented 1 year ago
CPU OS Config Model Threads Load [ms] Encode [ms]
E5-2640 Ubuntu 18.04 AVX2 tiny 8 235.10 1094.45
E5-2640 Ubuntu 18.04 AVX2 base 8 326.11 2307.32
E5-2640 Ubuntu 18.04 AVX2 small 8 669.31 7706.24
ggerganov commented 1 year ago

@matth My experiments with OpenBLAS on x86 showed that it is not faster compared to hand-written AVX2 + FP16: https://github.com/ggerganov/whisper.cpp/commit/fbd513b813ea42a500ba92be3dcfea0b6b6a4fa3

It seems this is also the case for Arm based on your experiments. My guess is that we don't see improvement because the computation is memory-bound and OpenBLAS works with FP32.

The reason that on Apple Silicon using CBLAS is so fast is because it utilizes the matrix co-processor which somehow is very efficient even for FP32. At least this is how I explain the results that I am seeing.

Interesting if armpl.h can provide some more insight - I haven't used it.

The most heavy stuff in ggml.c is the mul_mat_f16 and flash_attn_f16 calls. I think the conv_1d_... calls could be probably optimized more, but they are called only once as the start of the Encoder, so the improvement would be marginal.

Also, I am just looking at whisper.cpp and I realize I have forgotten why I use Flash Attention only in the Encoder and not use it also in the Decoder. Maybe this can help, because the Flash Attention reduces the memory transfers and improves cache locality.

Not sure about bf16 compared to fp16. I don't expect to provide big improvement based on quick search through some articles about the difference between the 2 data types.

StuartIanNaylor commented 1 year ago

Ihttps://medium.com/swlh/apples-m1-secret-coprocessor-6599492fc1e1

Gives a good write up if medium doesn't try to charge you.

https://nod.ai/comparing-apple-m1-with-amx2-m1-with-neon/

Maybe after the m3 comes out I might be able to pickup a bargain m1 mini.

I think fp16 is coming though and may help a bit

https://github.com/xianyi/OpenBLAS/pull/3754

PS for those of us without the secret apple sauce would implementing https://github.com/CNugteren/CLBlast be any use on integrated gpu's?

tamo commented 1 year ago

OpenBLAS helps Windows AMD64 MSVC

CPU OS Config Model Threads Load [ms] Encode [ms]
Ryzen 5 PRO 2400GE Windows 10 AVX2 medium 4 4259.10 116609.75
Ryzen 5 PRO 2400GE Windows 10 AVX2 BLAS medium 4 4259.58 75312.90
StuartIanNaylor commented 1 year ago
CPU OS Config Model Threads Load [ms] Encode [ms]
rk3588 Debian11 NEON tiny 8 232.45 2768.78
rk3588 Debian11 NEON base 8 308.36 6374.82
rk3588 Debian11 NEON small 8 626.23 25784.05
rk3588 Debian11 NEON medium 8 1667.23 86026.82
rk3588 Debian11 NEON large 8 4307.16 161328.59

CFLAGS = -I. -O3 -std=c11 -ffast-math -march=native

CPU OS Config Model Threads Load [ms] Encode [ms]
rk3588 Debian11 NEON tiny 8 230.69 2078.40
rk3588 Debian11 NEON base 8 299.10 4379.62
rk3588 Debian11 NEON small 8 621.43 18565.42
rk3588 Debian11 NEON medium 8 1532.61 65504.91
rk3588 Debian11 NEON large 8 3618.18 121710.31

If I try to compile with open blas in seperate build Encode becomes approx x2 slower so either I am doing wrong or with Armv8.2 its just bad, its -march=native that seems to make the above difference.

matth commented 1 year ago

Results on AWS mac2.metal instance:

CPU OS Config Model Threads Load [ms] Encode [ms]
mac2.metal OSX Ventura NEON BLAS tiny 4 64.39 184.98
mac2.metal OSX Ventura NEON BLAS base 4 87.93 368.04
mac2.metal OSX Ventura NEON BLAS small 4 198.80 1212.46
mac2.metal OSX Ventura NEON BLAS medium 4 551.49 3552.73
mac2.metal OSX Ventura NEON BLAS large 4 1042.91 6726.99

I tried disabling Accelerate and it makes a significant difference (i.e. very much slower without it!).

I assumed Accelerate was using the Neural Engine, but using both powermetrics and asitop I cannot see any utilization, both report 0mw power usage. Can anyone confirm on an M1 machine?

EDIT Possibly I was confused. Apple’s Matrix Coprocessor (AMX) and Neural Engine are different things, from @ggerganov other issues and commits it appears Accelerate might be using the former

tienshiao commented 1 year ago
CPU OS Config Model Threads Load [ms] Encode [ms]
i9-13900k WSL2 Ubuntu AVX2 tiny 4 58.49 360.95
i9-13900k WSL2 Ubuntu AVX2 base 4 72.44 756.48
i9-13900k WSL2 Ubuntu AVX2 small 4 154.37 2676.12
i9-13900k WSL2 Ubuntu AVX2 medium 4 393.76 8924.90
i9-13900k WSL2 Ubuntu AVX2 large 4 698.69 15862.58
i9-13900k WSL2 Ubuntu AVX2 tiny 8 55.13 291.51
i9-13900k WSL2 Ubuntu AVX2 base 8 70.93 603.33
i9-13900k WSL2 Ubuntu AVX2 small 8 141.85 1800.05
i9-13900k WSL2 Ubuntu AVX2 medium 8 356.29 5946.78
i9-13900k WSL2 Ubuntu AVX2 large 8 658.83 10868.89
CPU OS Config Model Threads Load [ms] Encode [ms]
E5-2697 V2 MacOS Monterey 12.6.1 BLAS tiny 4 301.22 872.27
E5-2697 V2 MacOS Monterey 12.6.1 BLAS base 4 405.40 1705.58
E5-2697 V2 MacOS Monterey 12.6.1 BLAS small 4 921.24 5419.73
E5-2697 V2 MacOS Monterey 12.6.1 BLAS medium 4 2356.76 15188.90
E5-2697 V2 MacOS Monterey 12.6.1 BLAS large 4 4457.29 26444.06
E5-2697 V2 MacOS Monterey 12.6.1 BLAS tiny 8 299.89 540.47
E5-2697 V2 MacOS Monterey 12.6.1 BLAS base 8 419.41 1129.01
E5-2697 V2 MacOS Monterey 12.6.1 BLAS small 8 888.64 3632.89
E5-2697 V2 MacOS Monterey 12.6.1 BLAS medium 8 2377.96 10525.92
E5-2697 V2 MacOS Monterey 12.6.1 BLAS large 8 4412.20 18933.41
peressinoto commented 1 year ago

Intel(R) Core(TM) i7-8750H CPU @ 2.20GH

CPU OS Config Model Threads Load [ms] Encode [ms]
i7-8750H macOS Ventura 13.0.1 AVX2 BLAS tiny 4 307.20 570.86
i7-8750H macOS Ventura 13.0.1 AVX2 BLAS base 4 406.45 1183.90
i7-8750H macOS Ventura 13.0.1 AVX2 BLAS small 4 941.96 4156.69
i7-8750H macOS Ventura 13.0.1 AVX2 BLAS medium 4 3124.62 13072.06
i7-8750H macOS Ventura 13.0.1 AVX2 BLAS large 4 10090.85 36383.82
i7-8750H macOS Ventura 13.0.1 AVX2 BLAS tiny 8 299.42 487.26
i7-8750H macOS Ventura 13.0.1 AVX2 BLAS base 8 403.74 1113.54
i7-8750H macOS Ventura 13.0.1 AVX2 BLAS small 8 910.07 3955.48
i7-8750H macOS Ventura 13.0.1 AVX2 BLAS medium 8 2241.90 13076.31
i7-8750H macOS Ventura 13.0.1 AVX2 BLAS large 8 5620.87 25562.17

Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz (12)

CPU OS Config Model Threads Load [ms] Encode [ms]
i7-8700 Ubuntu 20.04.4 LTS AVX2 tiny 4 158.49 730.72
i7-8700 Ubuntu 20.04.4 LTS AVX2 base 4 205.93 1603.67
i7-8700 Ubuntu 20.04.4 LTS AVX2 small 4 426.62 5630.58
i7-8700 Ubuntu 20.04.4 LTS AVX2 medium 4 1080.15 18748.66
i7-8700 Ubuntu 20.04.4 LTS AVX2 large 4 1976.77 37188.47
i7-8700 Ubuntu 20.04.4 LTS AVX2 tiny 8 159.00 662.07
i7-8700 Ubuntu 20.04.4 LTS AVX2 base 8 206.62 1436.59
i7-8700 Ubuntu 20.04.4 LTS AVX2 small 8 428.20 5345.27
i7-8700 Ubuntu 20.04.4 LTS AVX2 medium 8 1108.97 16780.53
i7-8700 Ubuntu 20.04.4 LTS AVX2 large 8 1965.67 32019.44
i7-8700 Ubuntu 20.04.4 LTS AVX2 tiny 12 157.60 585.65
i7-8700 Ubuntu 20.04.4 LTS AVX2 base 12 216.74 1696.32
i7-8700 Ubuntu 20.04.4 LTS AVX2 small 12 428.51 4504.18
i7-8700 Ubuntu 20.04.4 LTS AVX2 medium 12 1081.65 15442.25
i7-8700 Ubuntu 20.04.4 LTS AVX2 large 12 1969.63 28108.55

Intel(R) Core(TM) i3-9100F CPU @ 3.60GHz (4)

CPU OS Config Model Threads Load [ms] Encode [ms]
i3-9100F Ubuntu 20.04.4 LTS AVX2 tiny 4 164.71 726.05
i3-9100F Ubuntu 20.04.4 LTS AVX2 base 4 214.56 1806.20
i3-9100F Ubuntu 20.04.4 LTS AVX2 small 4 445.48 6613.19
i3-9100F Ubuntu 20.04.4 LTS AVX2 medium 4 1131.80 22667.64
i3-9100F Ubuntu 20.04.4 LTS AVX2 large 4 7615.74 42137.29

Intel(R) Xeon(R) CPU E3-1220 V2 @ 3.10GHz (4)

CPU OS Config Model Threads Load [ms] Encode [ms]
E3-1220 V2 Ubuntu 20.04.3 LTS tiny 4 227.41 1757.56
E3-1220 V2 Ubuntu 20.04.3 LTS base 4 297.67 3801.48
E3-1220 V2 Ubuntu 20.04.3 LTS small 4 625.18 14544.59
E3-1220 V2 Ubuntu 20.04.3 LTS medium 4 9618.55 49937.12
E3-1220 V2 Ubuntu 20.04.3 LTS large 4 40399.48 71661.48
Xavier-i commented 1 year ago

Has anyone tried benchmarking on WASM? Seems like the encoder takes much longer time than other platform

image
pvonmoradi commented 1 year ago
CPU OS Config Model Threads Load [ms] Encode [ms]
i7-5600U @2.60GHz Xubuntu 18.04 AVX2 BLAS ggml-tiny.en 4 258.59 2934.34
i7-5600U @2.60GHz Xubuntu 18.04 AVX2 BLAS ggml-tiny 4 255.46 2906.67
i7-5600U @2.60GHz Xubuntu 18.04 AVX2 BLAS ggml-base.en 4 316.73 6197.29
i7-5600U @2.60GHz Xubuntu 18.04 AVX2 BLAS ggml-base 4 319.93 5825.65
i7-5600U @2.60GHz Xubuntu 18.04 AVX2 ggml-tiny.en 4 217.28 1548.92
i7-5600U @2.60GHz Xubuntu 18.04 AVX2 ggml-tiny 4 215.59 1625.69
i7-5600U @2.60GHz Xubuntu 18.04 AVX2 ggml-base.en 4 275.62 3823.34
i7-5600U @2.60GHz Xubuntu 18.04 AVX2 ggml-base 4 275.72 3740.50
Cortex-A53 Android 10 NEON ggml-tiny.en 8 399.05 5841.70
Cortex-A53 Android 10 NEON ggml-tiny 8 376.25 5548.72
Cortex-A53 Android 10 NEON ggml-base.en 8 492.92 12728.42
Cortex-A53 Android 10 NEON ggml-base 8 1034.48 13365.86

Test-bench properties

Remarks

archi commented 1 year ago

Quite the difference between the 2017 Intel i3 4C/4T and the 2019 Ryzen Zen+ 6C/12T. And not looking good for AVX2 on the old AMD Zen+. I must admit, all in all I really envy the M1 for having that accelerator.

gcc vs clang doesn't seem to make a difference, at least it's not distinguishable from noise.

i3-8100

This is my home server. Tested while it was doing home server things (load 0.7). I can see this machine acting as a "whisper server" in a 2C configuration.

CPU OS Config Model Threads Load [ms] Encode [ms] Commit Compiler
i3-8100 @ 3.60GHz Arch Linux AVX2 tiny 1 88.38 2013.67 832b4f3 gcc 12.2.0
i3-8100 @ 3.60GHz Arch Linux AVX2 base 1 113.58 4692.04 832b4f3 gcc 12.2.0
i3-8100 @ 3.60GHz Arch Linux AVX2 small 1 225.74 18469.62 832b4f3 gcc 12.2.0
i3-8100 @ 3.60GHz Arch Linux AVX2 tiny 2 89.55 1189.92 832b4f3 clang 14.0.6
i3-8100 @ 3.60GHz Arch Linux AVX2 base 2 119.97 2756.52 832b4f3 clang 14.0.6
i3-8100 @ 3.60GHz Arch Linux AVX2 small 2 238.71 10491.67 832b4f3 clang 14.0.6
i3-8100 @ 3.60GHz Arch Linux AVX2 tiny 4 201.37 695.39 832b4f3 gcc 12.2.0
i3-8100 @ 3.60GHz Arch Linux AVX2 base 4 262.76 2023.16 832b4f3 gcc 12.2.0
i3-8100 @ 3.60GHz Arch Linux AVX2 small 4 526.66 6788.01 832b4f3 gcc 12.2.0
i3-8100 @ 3.60GHz Arch Linux AVX2 medium 4 3836.26 21889.30 832b4f3 gcc 12.2.0
i3-8100 @ 3.60GHz Arch Linux AVX2 large 4 26819.67 60880.62 832b4f3 gcc 12.2.0
i3-8100 @ 3.60GHz Arch Linux AVX2 tiny 4 89.05 696.08 832b4f3 clang 14.0.6
i3-8100 @ 3.60GHz Arch Linux AVX2 base 4 114.65 1711.15 832b4f3 clang 14.0.6
i3-8100 @ 3.60GHz Arch Linux AVX2 small 4 309.30 6995.25 832b4f3 clang 14.0.6
i3-8100 @ 3.60GHz Arch Linux AVX2 medium 4 4854.02 23570.42 832b4f3 clang 14.0.6
i3-8100 @ 3.60GHz Arch Linux AVX2 large 4 21415.07 60547.99 832b4f3 clang 14.0.6

Ryzen 1600AF

Just my Desktop. The difference to the 5950 at 8C is really massive; but luckily it has no impact for daily usage, so I'm glad I can still wait with upgrading to the last AM4 CPU generation :joy: Looking forward to benching CUDA on this machine (3080Ti).

CPU OS Config Model Threads Load [ms] Encode [ms] Commit Compiler
Ryzen 1600AF Manjaro AVX2 tiny 1 104.04 4691.38 832b4f3 clang 14.0.6
Ryzen 1600AF Manjaro AVX2 base 1 134.54 11092.84 832b4f3 clang 14.0.6
Ryzen 1600AF Manjaro AVX2 small 1 254.71 43923.42 832b4f3 clang 14.0.6
Ryzen 1600AF Manjaro AVX2 tiny 4 107.40 1336.49 832b4f3 gcc 12.2.0
Ryzen 1600AF Manjaro AVX2 base 4 132.69 3062.12 832b4f3 gcc 12.2.0
Ryzen 1600AF Manjaro AVX2 small 4 262.27 11655.22 832b4f3 gcc 12.2.0
Ryzen 1600AF Manjaro AVX2 medium 4 662.81 38829.74 832b4f3 gcc 12.2.0
Ryzen 1600AF Manjaro AVX2 large 4 1365.09 77063.30 832b4f3 gcc 12.2.0
Ryzen 1600AF Manjaro AVX2 tiny 6 100.82 1007.36 832b4f3 gcc 12.2.0
Ryzen 1600AF Manjaro AVX2 base 6 130.20 2472.55 832b4f3 gcc 12.2.0
Ryzen 1600AF Manjaro AVX2 small 6 256.83 9311.54 832b4f3 gcc 12.2.0
Ryzen 1600AF Manjaro AVX2 medium 6 657.89 28051.40 832b4f3 gcc 12.2.0
Ryzen 1600AF Manjaro AVX2 large 6 1190.62 54292.72 832b4f3 gcc 12.2.0
Ryzen 1600AF Manjaro AVX2 tiny 6 104.77 1012.70 832b4f3 clang 14.0.6
Ryzen 1600AF Manjaro AVX2 base 6 137.00 2212.20 832b4f3 clang 14.0.6
Ryzen 1600AF Manjaro AVX2 small 6 257.97 9296.33 832b4f3 clang 14.0.6
Ryzen 1600AF Manjaro AVX2 medium 6 624.04 28524.38 832b4f3 clang 14.0.6
Ryzen 1600AF Manjaro AVX2 large 6 1189.10 56445.31 832b4f3 clang 14.0.6
Ryzen 1600AF Manjaro AVX2 tiny 12 101.41 898.96 832b4f3 gcc 12.2.0
Ryzen 1600AF Manjaro AVX2 base 12 139.26 2200.78 832b4f3 gcc 12.2.0
Ryzen 1600AF Manjaro AVX2 small 12 256.50 8125.48 832b4f3 gcc 12.2.0
Ryzen 1600AF Manjaro AVX2 medium 12 623.59 29255.08 832b4f3 gcc 12.2.0
Ryzen 1600AF Manjaro AVX2 large 12 1192.90 51902.81 832b4f3 gcc 12.2.0
luke-jr commented 1 year ago
CPU OS Config Model Threads Load [ms] Encode [ms] Commit Compiler
POWER9v2 Gentoo -Ofast -mcpu=native base.en 4/64 144.84 42708.33 85c9ac18b59125b988cda40f40d8687e1ba88a7a clang 15.0.3
POWER9v2 Gentoo -Ofast -mcpu=native base.en 16/64 161.95 22302.28 85c9ac18b59125b988cda40f40d8687e1ba88a7a clang 15.0.3
POWER9v2 Gentoo -Ofast -mcpu=native base.en 32/64 142.06 20263.56 85c9ac18b59125b988cda40f40d8687e1ba88a7a clang 15.0.3
POWER9v2 Gentoo -Ofast -mcpu=native base.en 64/64 160.51 12645.79 85c9ac18b59125b988cda40f40d8687e1ba88a7a clang 15.0.3
ggerganov commented 1 year ago

@Xavier-i WASM performance is much worse compared to native - this is expected. Today I added the bench.wasm that can be used to benchmark performance in the browser.

Link: https://whisper.ggerganov.com/bench/

j1nx commented 1 year ago

Redo of my OpenVoiceOS Raspberry Pi 4 benchmark

CPU OS Config Model Th Load Enc. Commit
Raspberry Pi 4 - 2GB OpenVoiceOS NEON tiny.en 4 735 9486 aa6adda26e1ee9843dddb013890e3312bee52cfe
Raspberry Pi 4 - 2GB OpenVoiceOS NEON base.en 4 950 25402 aa6adda26e1ee9843dddb013890e3312bee52cfe
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS tiny.en 4 752 9178 aa6adda26e1ee9843dddb013890e3312bee52cfe
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS base.en 4 969 19642 aa6adda26e1ee9843dddb013890e3312bee52cfe

And just (and only) because we can, the same on a Raspberry Pi 3B+ running the same codebase / OS

CPU OS Config Model Th Load Enc. Commit
Raspberry Pi 3B+ - 1GB OpenVoiceOS NEON tiny.en 4 1331 22573 aa6adda26e1ee9843dddb013890e3312bee52cfe
Raspberry Pi 3B+ - 1GB OpenVoiceOS NEON base.en 4 5886 58733 aa6adda26e1ee9843dddb013890e3312bee52cfe
Raspberry Pi 3B+ - 1GB OpenVoiceOS NEON BLAS tiny.en 4 1333 21184 aa6adda26e1ee9843dddb013890e3312bee52cfe
Raspberry Pi 3B+ - 1GB OpenVoiceOS NEON BLAS base.en 4 4605 47877 aa6adda26e1ee9843dddb013890e3312bee52cfe
matth commented 1 year ago

I hope this isn't misplaced but I thought it interesting to share ...

I have recently finished some tests comparing whisper.cpp runtime performance against the original PyTorch version on various GPUs and CPUs.

We test against a fixed set of long form audio files (UK TV, each file ~1 hour long, mixed speech and noise) and record the runtime as a factor of real audio time.

Depending on the software and environment transcription can take anywhere between around 5x real-time to 0.14x real-time to complete.

ARM based whisper.cpp runtime is very impressive, in particular the Apple M1 performance can match that of the original PyTorch version on NVIDIA V100 and T4 gpus ...

CPU / GPU OS Config Model Threads xRT Transcribe
Intel Xeon Ubuntu 22.04 whisper original - pytorch cpu medium.en 8 4.78
Intel Xeon Ubuntu 22.04 whisper.cpp - AVX2 medium.en 8 4.44
Graviton 3 Ubuntu 22.04 whisper.cpp - NEON medium.en 8 0.63
mac2.metal OSX Ventura whisper.cpp - NEON BLAS medium.en 4 0.26
NVIDIA V100 Ubuntu 22.04 whisper original - pytorch cuda medium.en N/A 0.25
NVIDIA T4 Ubuntu 22.04 whisper original - pytorch cuda medium.en N/A 0.25
NVIDIA A10G Ubuntu 22.04 whisper original - pytorch cuda medium.en N/A 0.16
NVIDIA A100 Ubuntu 22.04 whisper original - pytorch cuda medium.en N/A 0.14

Additionally I did some very rough power consumption tests, again whisper.cpp on the M1 is really impressive against PyTorch on the GPU.

Platform Whisper Type Model Avg Power Peak Power
Apple M1 whisper.cpp ggml-medium.en 13202 mW 18412 mW
Nvidia T4 pytorch medium.en 69587 mW 85650 mW

Thanks for the fantastic work @ggerganov - this is a really inspiring project and demonstrates the ARM FP16 functionality wonderfully. Off to buy some more Apple Macs now ;)

StuartIanNaylor commented 1 year ago

@matth @rgerganov Been thinking myself that perf/watts for ML is truly outstanding and just wondered if the 8gb can squeeze the medium model in as not sure how memory is shared on the m1 or is it really a case of the 16gb?

ggerganov commented 1 year ago

@matth Thanks for the data - it's interesting to see.

However, there are some caveats that are important to be considered when benchmarking the 2 implementations that I've been meaning to discuss, so here are my thoughts on this:

At a high-level, the Whisper transcription is a combination of 2 main parts:

The first part is branchless and does not depend on the audio input or the parameters that you use. For a given model, evaluating the transformer requires the same amount of operations every time. This is easy to benchmark.

The second part (decoding strategy) is different. The number of operations here depends both on the audio input contents and the decoding parameters / strategy that you use. For example, two different audio recordings with the same time length generally result in different decoded text based on the speech content and hence can take different amount of processing (even with the same decoding parameters). Also, the decoded timestamp tokens affect how the 30s sliding window of the transcription is updated and therefore can lead to a different number of transformer evaluations in total.

My understanding is that there is no "correct" decoding strategy. The OpenAI implementation generally offers 2 different strategies - Greedy and BeamSearch. Both of them are combinations of various heuristics that aim to improve the text coherency and reduce the number of catastrophic failures.

In whisper.cpp we currently have a Greedy strategy which is similar to the one in the OpenAI repo, but is not exactly the same.

So all of this means that there is no point in comparing the 2 implementations by measuring the total time to transcribe an audio, because the decoding strategy is not the same and therefore the variation will be very large due to the factors outlined above. It only makes sense to benchmark the transformer evaluation in isolation, because it is well-defined.

That is why in the benchmarks in this issue, I chose to run the Encoder on some random input buffer. The Encoder is the heavy part of the transformer and being able to evaluate it efficiently is very important and is the most defining factor for the efficiency of the implementation. It's the "engine" of the transcription. You can then put on top of it any decoding strategy that you like and this will define how accurate your transcription is. But it does not make sense to benchmark the performance of that anymore.

I think if we want to make a fair comparison with PyTorch, we need to have the bench tool implemented in python using PyTorch. Any other comparison will be flawed to some extent.

But in any case, your results are interesting - thanks for sharing them. What parameters did you use for the PyTorch runs?


Regarding the power consumption - I think there is more we can do in whisper.cpp. Currently, the thread synchronization uses busy loops which is very power inefficient because it keeps the CPU at 100%, but it gives a slight performance edge. I am thinking of adding an option that uses condition variable synchronization which will likely reduce the power usage at the cost of some performance. For some use cases, it could be beneficial to have lower power consumption.

matth commented 1 year ago

Thanks @ggerganov , we are using PyTorch whisper with default settings in that benchmark so I believe that is a beam search decoder. I will see if I can test again with the greedy decoder for a more similar comparison. I think I understand your point though - these are not like for like implementations so at a certain level the comparison is flawed.

I also neglected to measure the PyTorch version on the M1 & Graviton which was a huge oversight!

There's a motivation behind these benchmarks. Looking at various solutions as improvements to existing transcription capabilities - each solution in my mind is a balance of accuracy, completeness, runtime, financial cost and energy efficiency.

On one end you have paying humans to do the transcription, slow and expensive but very accurate and something that is still done at a massive scale in my industry. At the other end there are existing Kaldi models that are less accurate but incredibly fast for inference on the CPU and very cheap to run.

I feel larger transformer models like Whisper sit somewhat in the middle of all this - closer to human accuracy but increased associated costs over existing software.

But whisper.cpp adds to this, if we can get similar or even just acceptable accuracy and runtime but on commodity hardware the choice can start to become more about cost, efficiency and functionality. e.g. you could buy 30+ Apple Macs for the price of an NVIDIA A100 server, being able to run Whisper on a laptop enables a different set of use cases, you can cut power consumption by a huge margin, etc

I think for me this is one of the many exciting outcomes of this project :)

ggerganov commented 1 year ago

@matth Yeah - the default in PyTorch when running from the command line is BeamSearch. I haven't measure exactly, but it is significantly slower compared to Greedy.

I think regarding the total-time benchmark - it can make sense once whisper.cpp reaches the accuracy of OpenAI. Currently, due to the inferior decoding, whisper.cpp has lower transcription accuracy (based on some results I saw floating around). But when the decoding gets improved and we have comparable accuracy, then we can make a benchmark that says:

"for a given word error rate (WER) the 2 implementation take this amount of processing time on average, over some large set of audio"

And another thing I was thinking is that even if today whisper.cpp is more efficient on Apple Macs - it is not going to be always the case. If I understand correctly, it's just a matter of time for the proper Apple Silicon frameworks (Metal, MPS, etc.) to become supported in PyTorch, Tensorflow, etc and when this happens (probably very soon), the performance of whisper.cpp will be the same or possibly worse.

So yeah - just trying to adjust expectations :) Will probably write some more on this in the F.A.Q. discussion.

asmaloney commented 1 year ago
CPU OS Config Model Th Load Enc. Commit
i9-9900K @ 3.60GHz macOS 12.6.2 AVX2 BLAS tiny.en 4 175 360 7282e2109e0748421ee73271496f5911ca2b89a7
i9-9900K @ 3.60GHz macOS 12.6.2 AVX2 BLAS base.en 4 233 736 7282e2109e0748421ee73271496f5911ca2b89a7
i9-9900K @ 3.60GHz macOS 12.6.2 AVX2 BLAS small.en 4 507 2400 7282e2109e0748421ee73271496f5911ca2b89a7
i9-9900K @ 3.60GHz macOS 12.6.2 AVX2 BLAS medium.en 4 1333 6860 7282e2109e0748421ee73271496f5911ca2b89a7
Using 8 threads is slightly slower to load, faster to encode: CPU OS Config Model Th Load Enc. Commit
i9-9900K @ 3.60GHz macOS 12.6.2 AVX2 BLAS tiny.en 8 185 283 7282e2109e0748421ee73271496f5911ca2b89a7
i9-9900K @ 3.60GHz macOS 12.6.2 AVX2 BLAS base.en 8 241 579 7282e2109e0748421ee73271496f5911ca2b89a7
i9-9900K @ 3.60GHz macOS 12.6.2 AVX2 BLAS small.en 8 526 1959 7282e2109e0748421ee73271496f5911ca2b89a7
i9-9900K @ 3.60GHz macOS 12.6.2 AVX2 BLAS medium.en 8 1390 6271 7282e2109e0748421ee73271496f5911ca2b89a7
mgc8 commented 1 year ago
CPU OS Config Model Th Load Enc. Commit
MacBookPro M1 Max macOS 12.6 NEON BLAS tiny 8 65 108 a593b93
MacBookPro M1 Max macOS 12.6 NEON BLAS base 8 86 250 a593b93
MacBookPro M1 Max macOS 12.6 NEON BLAS small 8 185 789 a593b93
MacBookPro M1 Max macOS 12.6 NEON BLAS medium 8 493 2126 a593b93
MacBookPro M1 Max macOS 12.6 NEON BLAS large 8 955 3860 a593b93

There are actually 10 threads, but when using -t 10 the performance goes down. Lower numbers (such as -t 4) result in similar load performance, but slower encode (although not linear).

kha84 commented 1 year ago

AMD Ryzen 5 3400G (4 CPU cores, 8 threads) on Ubuntu 22.10 with 5.19.0-26-generic Kernel

4 threads

CPU OS Config Model Th Load Enc. Commit
3400G Ubuntu 22.10 AVX2 tiny 4 163 1415 0be6a1a
3400G Ubuntu 22.10 AVX2 tiny.en 4 175 1351 0be6a1a
3400G Ubuntu 22.10 AVX2 base.en 4 200 3095 0be6a1a
3400G Ubuntu 22.10 AVX2 base 4 205 3241 0be6a1a
3400G Ubuntu 22.10 AVX2 small.en 4 412 12343 0be6a1a
3400G Ubuntu 22.10 AVX2 small 4 421 11983 0be6a1a
3400G Ubuntu 22.10 AVX2 medium.en 4 995 38818 0be6a1a
3400G Ubuntu 22.10 AVX2 medium 4 1006 38573 0be6a1a
3400G Ubuntu 22.10 AVX2 large-v1 4 0be6a1a
3400G Ubuntu 22.10 AVX2 large 4 1870 77302 0be6a1a

8 threads is just marginally better

CPU OS Config Model Th Load Enc. Commit
3400G Ubuntu 22.10 AVX2 tiny.en 8 191 1275 0be6a1a
3400G Ubuntu 22.10 AVX2 tiny 8 183 1258 0be6a1a
3400G Ubuntu 22.10 AVX2 base.en 8 232 2894 0be6a1a
3400G Ubuntu 22.10 AVX2 base 8 231 2927 0be6a1a
3400G Ubuntu 22.10 AVX2 small.en 8 435 11299 0be6a1a
3400G Ubuntu 22.10 AVX2 small 8 414 11511 0be6a1a
3400G Ubuntu 22.10 AVX2 medium.en 8 1011 37557 0be6a1a
3400G Ubuntu 22.10 AVX2 medium 8 1049 37306 0be6a1a
3400G Ubuntu 22.10 AVX2 large-v1 8 0be6a1a
3400G Ubuntu 22.10 AVX2 large 8 3237 77396 0be6a1a

Someone mentioned BLAS?