ggerganov commented 1 year ago

Encoder

Collection of bench results for various platforms and devices. If you want to submit info about your device, simply run the bench tool or the extra/bench-all.sh and report the results in the comments below.

Suggestions for better summary of the results are welcome

CPU	OS	Config	Model	Th	Load	Enc.	Commit
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	tiny	8	71	102	206fc93
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	base	8	96	220	206fc93
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	small	8	233	685	206fc93
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	medium	8	603	1928	206fc93
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	large	8	1158	3350	206fc93
---
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	small	1	251	2605	206fc93
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	small	4	255	884	206fc93
---
Mac Mini M1	MacOS	NEON BLAS	tiny	4	62	194	fcf515d
Mac Mini M1	MacOS	NEON BLAS	base	4	81	380	fcf515d
Mac Mini M1	MacOS	NEON BLAS	small	4	204	1249	fcf515d
Mac Mini M1	MacOS	NEON BLAS	medium	4	876	3980	fcf515d
Mac Mini M1	MacOS	NEON BLAS	large	4	1876	7979	fcf515d
---
Ryzen 9 3900X	Ubuntu 20.04	AVX2	tiny	8	107	422	fcf515d
Ryzen 9 3900X	Ubuntu 20.04	AVX2	base	8	137	880	fcf515d
Ryzen 9 3900X	Ubuntu 20.04	AVX2	small	8	280	2874	fcf515d
Ryzen 9 3900X	Ubuntu 20.04	AVX2	medium	8	692	9610	fcf515d
Ryzen 9 3900X	Ubuntu 20.04	AVX2	large	8	1317	16917	fcf515d
---
Ryzen 9 3900X	Ubuntu 20.04	AVX2 BLAS	tiny	4	120	780	fcf515d
Ryzen 9 3900X	Ubuntu 20.04	AVX2 BLAS	base	4	151	1173	fcf515d
Ryzen 9 3900X	Ubuntu 20.04	AVX2 BLAS	small	4	289	3062	fcf515d
Ryzen 9 3900X	Ubuntu 20.04	AVX2 BLAS	medium	4	711	9175	fcf515d
Ryzen 9 3900X	Ubuntu 20.04	AVX2 BLAS	large	4	1282	16050	fcf515d
---
Ryzen 9 5950X	Ubuntu 22.04	AVX2	tiny	8	135	197	fcf515d
Ryzen 9 5950X	Ubuntu 22.04	AVX2	base	8	176	421	fcf515d
Ryzen 9 5950X	Ubuntu 22.04	AVX2	small	8	357	1393	fcf515d
Ryzen 9 5950X	Ubuntu 22.04	AVX2	medium	8	855	4404	fcf515d
Ryzen 9 5950X	Ubuntu 22.04	AVX2	large	8	1576	8118	fcf515d
---
Raspberry Pi 4		NEON	tiny	4	1436	13839	fcf515d
Raspberry Pi 4		NEON	base	4	1894	30552	fcf515d
---
iPhone 13 Mini	iOS 16.0	NEON BLAS	base	4	97	1091	fcf515d
---
MacBook M1 Pro	Vivaldi	WASM	tiny	8	133	3785	fcf515d
MacBook M1 Pro	Vivaldi	WASM	base	8	172	8253	fcf515d
---
MacBook M1 Pro	Chrome	WASM	tiny	8	134	3776	fcf515d
MacBook M1 Pro	Chrome	WASM	base	8	168	8200	fcf515d
---
MacBook M1 Pro	Firefox	WASM	tiny	8	137	2626	fcf515d
MacBook M1 Pro	Firefox	WASM	base	8	183	6226	fcf515d

memcpy

MacBook M1 Pro

./bench -w 1 -t 1
memcpy: 37.59 GB/s

Ryzen 9 5950X

./bench -w 1 -t 1
memcpy: 16.74 GB/s

ggml_mul_mat

MacBook M1 Pro

./bench -w 2 -t 1
ggml_mul_mat:    64 x    64: F16    330.6 GFLOPS (128 runs) / F32    466.0 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16    737.5 GFLOPS (128 runs) / F32    838.9 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16    938.6 GFLOPS (128 runs) / F32   1062.3 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16   1312.5 GFLOPS (128 runs) / F32   1835.5 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16   1765.1 GFLOPS (128 runs) / F32   2041.4 GFLOPS (128 runs)
ggml_mul_mat:  2048 x  2048: F16   1784.3 GFLOPS (104 runs) / F32   1859.2 GFLOPS (109 runs)
ggml_mul_mat:  4096 x  4096: F16   1855.1 GFLOPS ( 14 runs) / F32   1873.3 GFLOPS ( 14 runs)

Ryzen 9 5950X

WHISPER_OPENBLAS=1 make -j bench && ./bench -w 2 -t 1
ggml_mul_mat:    64 x    64: F16     56.3 GFLOPS (128 runs) / F32     70.2 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     47.8 GFLOPS (128 runs) / F32     67.0 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16    185.1 GFLOPS (128 runs) / F32    332.7 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16    386.4 GFLOPS (128 runs) / F32    658.6 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16    636.2 GFLOPS (128 runs) / F32   1012.0 GFLOPS (128 runs)
ggml_mul_mat:  2048 x  2048: F16    950.9 GFLOPS ( 56 runs) / F32   1296.8 GFLOPS ( 76 runs)
ggml_mul_mat:  4096 x  4096: F16   1168.6 GFLOPS (  9 runs) / F32   1403.1 GFLOPS ( 11 runs)

cdosoftei commented 1 year ago

Results for Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz

CPU	OS	Model	Threads	Load [ms]	Encode [ms]
i7-4790K	Debian	tiny.en	4	165	808
i7-4790K	Debian	tiny.en	8	165	783
i7-4790K	Debian	base.en	4	212	1813
i7-4790K	Debian	base.en	8	214	1746

rjwilmsi commented 1 year ago

Results for Ryzen 5 4500U 6C/6T laptop CPU (I've just included one result for 8 threads as Encode time is much higher when threads > CPU cores).

CPU	OS	Model	Threads	Load [ms]	Encode [ms]
Ryzen 5 4500U (6C/6T)	Opensuse Leap	tiny.en	4	170.00	829.43
Ryzen 5 4500U (6C/6T)	Opensuse Leap	tiny.en	6	143.03	671.74
Ryzen 5 4500U (6C/6T)	Opensuse Leap	base.en	4	305.92	2,092.39
Ryzen 5 4500U (6C/6T)	Opensuse Leap	base.en	6	188.05	1,495.61
Ryzen 5 4500U (6C/6T)	Opensuse Leap	small.en	4	408.03	6,919.31
Ryzen 5 4500U (6C/6T)	Opensuse Leap	small.en	6	359.23	6,370.83
Ryzen 5 4500U (6C/6T)	Opensuse Leap	medium.en	4	2,238.11	25,863.28
Ryzen 5 4500U (6C/6T)	Opensuse Leap	medium.en	6	1,113.04	19,672.63
Ryzen 5 4500U (6C/6T)	Opensuse Leap	medium.en	8	973.65	39,619.20

ArtyomZemlyak commented 1 year ago

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
i7-11800H	WSL2 Ubuntu	AVX2	tiny	2	164.35	1087.61
i7-11800H	WSL2 Ubuntu	AVX2	tiny	4	128.94	733.24
i7-11800H	WSL2 Ubuntu	AVX2	tiny	8	137.57	619.88
i7-11800H	WSL2 Ubuntu	AVX2 AVX512	tiny	2	143.02	1087.15
i7-11800H	WSL2 Ubuntu	AVX2 AVX512	tiny	4	127.60	730.57
i7-11800H	WSL2 Ubuntu	AVX2 AVX512	tiny	8	125.62	616.27
i7-11800H	WSL2 Ubuntu	AVX2 AVX512 BLAS	tiny	2	132.59	1511.38
i7-11800H	WSL2 Ubuntu	AVX2 AVX512 BLAS	tiny	4	132.48	1407.49
i7-11800H	WSL2 Ubuntu	AVX2 AVX512 BLAS	tiny	8	133.82	1458.27

ArtyomZemlyak commented 1 year ago

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
i7-11800H	WSL2 Ubuntu	AVX2	base	2	174.34	2533.79
i7-11800H	WSL2 Ubuntu	AVX2	base	4	166.68	1830.67
i7-11800H	WSL2 Ubuntu	AVX2	base	8	165.53	1478.73
i7-11800H	WSL2 Ubuntu	AVX2	small	2	340.12	8714.24
i7-11800H	WSL2 Ubuntu	AVX2	small	4	394.32	6021.41
i7-11800H	WSL2 Ubuntu	AVX2	small	8	305.98	4828.84
i7-11800H	WSL2 Ubuntu	AVX2	large	2	3205.36	57109.10
i7-11800H	WSL2 Ubuntu	AVX2	large	4	2720.25	38519.89
i7-11800H	WSL2 Ubuntu	AVX2	large	8	3716.34	27739.99

ArtyomZemlyak commented 1 year ago

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
i7-11800H	WSL2 Ubuntu	AVX2 AVX512	large	2	1954.21	54966.84
i7-11800H	WSL2 Ubuntu	AVX2 AVX512	large	4	1455.40	37320.62
i7-11800H	WSL2 Ubuntu	AVX2 AVX512	large	8	1372.58	27937.64

ArtyomZemlyak commented 1 year ago

This performance is impressing!

M1 Pro | MacOS | | large | 8 | 1973 | 4208

ggerganov commented 1 year ago

This performance is impressing!

Yes, there is a huge performance boost due to using the built-in BLAS implementation on these devices. I will soon add OpenBLAS support for x86 architectures and see how this compares.

By the way, AVX-512 is not supported on master. I have added initial support here, but I am not sure if it works: https://github.com/ggerganov/whisper.cpp/pull/95

cristianglezm commented 1 year ago

CPU	OS	Config	Model	Threads	Load[ms]	encode[ms]
Intel® Core™ i5-8250U	Win11 Home	AVX2	Large	8	2226.85	61547.61

compiled with MinGW64 gcc 11.3

tazz4843 commented 1 year ago

Valve Jupiter (AMD Custom APU 0405, Zen 2 microarch, 4c8t, 16GB DDR5 @ 5200 MT/s)

CPU	OS	Config	Model	Threads	Load[ms]	encode[ms]
AMD Custom APU 0405	SteamOS 3.2	AVX2	Base	8	326.32	2592.96

Compiled with cc (GCC) 11.3.0

The performance gains on jfk.wav since last test (two weeks or so ago) are extremely impressive, ~10-20x speedup from 40 to 2-4 seconds.

yujinqiu commented 1 year ago

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
MacBook M1 Max	macOS Ventura	BLAS	small	1	299.09	4166.00
MacBook M1 Max	macOS Ventura	BLAS	small	4	329.45	1304.32
MacBook M1 Max	macOS Ventura	BLAS	base	1	139.10	1302.17
MacBook M1 Max	macOS Ventura	BLAS	base	4	135.96	399.45

trholding commented 1 year ago

On a AMD EPYC 64 core 240 threads cloud instance it is stuck like this with 240 threads. I noticed that above a certain number of threads its slow, or the cloud provider is cpu limiting. Can anyone else with real hardware check if this is the case?

time ./main -m models/ggml-base.en.bin -f elon.wav -t 240
whisper_model_load: loading model from 'models/ggml-base.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem_required  = 670.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 140.60 MB
whisper_model_load: memory size =    22.83 MB
whisper_model_load: model size  =   140.54 MB

system_info: n_threads = 240 / 240 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | 

main: processing 'elon.wav' (34466688 samples, 2154.2 sec), 240 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ..

trholding commented 1 year ago

So I have tried with the above mentioned cloud provider various number of threads.

I found that anything above 64 threads gets slower and usable upto 120 threads. Anything above is a hang. Must be that the cloud provider is throttling on free trial or too many threads could actually slow down stuff.

...
...
processor       : 239
vendor_id       : AuthenticAMD
cpu family      : 23
model           : 49
model name      : AMD EPYC 7742 64-Core Processor
stepping        : 0
microcode       : 0x830104d
cpu MHz         : 2245.780
cache size      : 512 KB
physical id     : 1
siblings        : 120
core id         : 59
cpu cores       : 60
apicid          : 247
initial apicid  : 247
fpu             : yes
fpu_exception   : yes
cpuid level     : 16
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr wbnoinvd arat npt nrip_save umip rdpid arch_capabilities
bugs            : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass
bogomips        : 4491.56
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management:

time ./main -m models/ggml-base.en.bin -f elon.wav -t 64
whisper_model_load: loading model from 'models/ggml-base.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem_required  = 670.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 140.60 MB
whisper_model_load: memory size =    22.83 MB
whisper_model_load: model size  =   140.54 MB

system_info: n_threads = 64 / 240 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | 

main: processing 'elon.wav' (34466688 samples, 2154.2 sec), 64 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...

[00:00:00.000 --> 00:00:03.960]   [MUSIC PLAYING]
[00:00:03.960 --> 00:00:18.240]   In life, we've seen within this part of the world
...
...
[00:35:40.320 --> 00:35:41.920]   Thank you, and have a great day.
[00:35:41.920 --> 00:35:43.920]   [APPLAUSE]
[00:35:43.920 --> 00:35:45.920]   [MUSIC PLAYING]
[00:35:45.920 --> 00:35:56.240]   [VIDEO PLAYBACK]

whisper_print_timings:     load time =   249.61 ms
whisper_print_timings:      mel time =  1267.11 ms
whisper_print_timings:   sample time =  1718.69 ms
whisper_print_timings:   encode time = 63702.25 ms / 10617.04 ms per layer
whisper_print_timings:   decode time = 381317.66 ms / 63552.94 ms per layer
whisper_print_timings:    total time = 448362.19 ms

real    7m28.411s
user    347m2.230s
sys     22m42.511s

32 threads was faster than 64 threads. I think 32 threads took around 7 minutes or so.

trholding commented 1 year ago

Env: Restricted Cloud / Throttled Maybe

CPU: AMD EPYC 7742 64-Core Processor

OS:

Distributor ID: Ubuntu
Description:    Ubuntu 20.04.3 LTS
Release:        20.04
Codename:       focal
Linux XXXX 5.4.0-131-generic #147-Ubuntu SMP Fri Oct 14 17:07:22 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Compiler:

gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/9/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none:hsa
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 9.4.0-1ubuntu1~20.04.1' --with-bugurl=file:///usr/share/doc/gcc-9/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,gm2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-9 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --enable-default-pie --with-system-zlib --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none=/build/gcc-9-Av3uEd/gcc-9-9.4.0/debian/tmp-nvptx/usr,hsa --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1)

$ ./bench -m ./models/ggml-small.en.bin -t 4
whisper_model_load: loading model from './models/ggml-small.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head  = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 768
whisper_model_load: n_text_head   = 12
whisper_model_load: n_text_layer  = 12
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 3
whisper_model_load: mem_required  = 1588.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 464.56 MB
whisper_model_load: memory size =    68.48 MB
whisper_model_load: model size  =   464.44 MB

system_info: n_threads = 4 / 240 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | 

whisper_print_timings:     load time =   515.02 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  6878.32 ms / 573.19 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time =  7393.42 ms

If you wish, you can submit these results here:

  https://github.com/ggerganov/whisper.cpp/issues/89

Please include the following information:

  - CPU model
  - Operating system
  - Compiler

$ ./bench -m ./models/ggml-small.en.bin -t 240
whisper_model_load: loading model from './models/ggml-small.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head  = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 768
whisper_model_load: n_text_head   = 12
whisper_model_load: n_text_layer  = 12
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 3
whisper_model_load: mem_required  = 1588.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 464.56 MB
whisper_model_load: memory size =    68.48 MB
whisper_model_load: model size  =   464.44 MB

system_info: n_threads = 240 / 240 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | 

whisper_print_timings:     load time =   528.66 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time = 12898.34 ms / 1074.86 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time = 13427.03 ms

If you wish, you can submit these results here:

  https://github.com/ggerganov/whisper.cpp/issues/89

Please include the following information:

  - CPU model
  - Operating system
  - Compiler

I'll remove the above posts if too much clutter.

ggerganov commented 1 year ago

@trholding Thanks for the results.

You can generate a table with performance results by simply running the extra/bench_all.sh script.

Regarding the threads - yes, it seems that going beyond 8 threads does not help regardless of how many cores you have. My guess is that the computation is memory-bound so that's why using more threads does not improve the performance.

trholding commented 1 year ago

Okay, 8 threads max, so for a large file, is there a possibility of splitting the file to chunks with silences as terminators and dividing the conversion to ((total threads/cores)/8) but also keeping track of timestamps? This could be awesome for batch conversion.

You can generate a table with performance results by simply running the extra/bench_all.sh script.

Oh, I didn't know, I'll update with tables soon and remove my previous comments in a few hours.

trholding commented 1 year ago

You can generate a table with performance results by simply running the extra/bench_all.sh script.

Hey Sorry. That didn't pan out well, I did the benchmark thrice, my account got deleted without notice. Could't get the logs as it was a web terminal. On the other hand I am happy that this happened, I was giving serious thought of purchasing a GPU+CPU plan there, so performance check of CPU was equally important. Probably or technically it was my fault - probably shouldn't have used a reverse shell and done benchmarks on a free trial, but how does one know if a service is real good or all just vapor...

rgerganov commented 1 year ago

Dell Precision 5560 laptop results:

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
i7-11850H	Ubuntu	AVX2	tiny	4	115.87	538.43
i7-11850H	Ubuntu	AVX2	base	4	145.14	1241.84
i7-11850H	Ubuntu	AVX2	small	4	299.30	4343.57
i7-11850H	Ubuntu	AVX2	medium	4	760.98	15238.31
i7-11850H	Ubuntu	AVX2	large	4	1404.32	27476.86
i7-11850H	Ubuntu	AVX2	tiny	8	131.96	358.81
i7-11850H	Ubuntu	AVX2	base	8	166.61	839.31
i7-11850H	Ubuntu	AVX2	small	8	320.29	2854.86
i7-11850H	Ubuntu	AVX2	medium	8	756.20	9829.62
i7-11850H	Ubuntu	AVX2	large	8	1382.38	19872.81

jaybinks commented 1 year ago

CPU | OS | Config | Model | Threads | Load [ms] | Encode [ms] -- | -- | -- | -- | -- | -- | -- i9-9900K | WSL2 Ubuntu (GCC) | AVX2 | tiny.en | 4 | 85.71 | 601.56 i9-9900K | WSL2 Ubuntu (GCC) | AVX2 | small.en | 4 | 212.59 | 5146.23 i9-9900K | OSX 10.14.1 (hackintosh - GCC) | AVX2 | tiny.en | 4 | 198.17 | 455.12 i9-9900K | OSX 10.14.1 (hackintosh - GCC) | AVX2 | base.en | 4 | 272.62 | 909.71 i9-9900K | OSX 10.14.1 (hackintosh - GCC) | AVX2 | small.en | 4 | 598.75 | 2968.75 Xeon(R) Silver 4210R CPU @ 2.40GHz | **Virtual Machine** - Debian Stretch (GCC - master branch) | AVX2 avx512f avx512dq avx512cd avx512bw avx512vl | small.en | 4 | 776.56 | 12340.41 Xeon(R) Silver 4210R CPU @ 2.40GHz | **Virtual Machine** - Debian Stretch (GCC - master branch) | AVX2 avx512f avx512dq avx512cd avx512bw avx512vl | tiny.en | 4 | 295.54 | 1710.46

mark-beeby commented 1 year ago

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
i9-11950H	Pop!_OS 22.04 LTS	AVX2	Tiny	4	124.28	656.41
i9-11950H	Pop!_OS 22.04 LTS	AVX2	Tiny	8	123.70	696.41
i9-11950H	Pop!_OS 22.04 LTS	AVX2	Base	4	159.91	1754.44
i9-11950H	Pop!_OS 22.04 LTS	AVX2	Base	8	164.47	1658.55
i9-11950H	Pop!_OS 22.04 LTS	AVX2	Small	4	330.91	6161.86
i9-11950H	Pop!_OS 22.04 LTS	AVX2	Small	8	346.22	5187.85

niksedk commented 1 year ago

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
i7-1065G7	Windows 11	-	small.en	4	1,314.25	294,168.09

Compiled with VS 2022

Something is off, right?

ggerganov commented 1 year ago

Yup - you are missing the AVX2 flag. See if some of the comments in https://github.com/ggerganov/whisper.cpp/issues/5 can help you resolve this.

niksedk commented 1 year ago

OK, the AVX2 flag seems to help :)

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
i7-1065G7	Windows 11	AVX2	small.en	4	527.59	9,648.67

Compiled with VS 2022

j1nx commented 1 year ago

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]	Remarks
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON	tiny	1	861.34	29428.21	With OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON BLAS	tiny	1	843.80	16145.62	With OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON	tiny	4	835.68	21509.08	With OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON BLAS	tiny	4	824.24	13187.96	With OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON	base	1	1146.02	87615.00	With OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON BLAS	base	1	1103.39	52228.30	With OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON	base	4	1183.47	55256.20	With OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON BLAS	base	4	1161.32	29851.40	With OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON	tiny	1	752.64	24018.10	Without OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON BLAS	tiny	1	751.96	13082.95	Without OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON	tiny	4	743.37	10122.80	Without OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON BLAS	tiny	4	742.90	9564.89	Without OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON	base	1	974.46	71587.61	Without OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON BLAS	base	1	979.65	43852.07	Without OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON	base	4	982.24	24814.62	Without OVOS services running
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON BLAS	base	4	982.80	19910.19	Without OVOS services running

StuartIanNaylor commented 1 year ago

From the stream repo

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
RK3588	Ubuntu20.04	NEON	tiny.en	4	243.54 ms	779.49 ms
RK3588	Ubuntu20.04	NEON	base.en	4	316.52 ms	1821.06 ms
RK3588	Ubuntu20.04	NEON	small.en	4	618.93 ms	7117.69 ms
RK3588	Ubuntu20.04	NEON	medium.en	4	1514.88 ms	24139.92 ms

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
RK3588	Ubuntu20.04	NEON	tiny	4	233.86 ms	791.01 ms
RK3588	Ubuntu20.04	NEON	base	4	297.93 ms	1813.69 ms
RK3588	Ubuntu20.04	NEON	small	4	592.18 ms	7102.28 ms
RK3588	Ubuntu20.04	NEON	medium	4	1587.36 ms	24147.87 ms

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
RK3588	Ubuntu20.04	NEON	tiny	8	226.48 ms	740.34 ms
RK3588	Ubuntu20.04	NEON	base	8	300.48 ms	1723.42 ms
RK3588	Ubuntu20.04	NEON	small	8	620.58 ms	6392.47 ms
RK3588	Ubuntu20.04	NEON	medium	8	1533.75 ms	21899.08 ms

I still haven't worked out the little(0-3).Big(4-7) on this thing as if I pin to big cores taskset -c 4-7

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
RK3588	Ubuntu20.04	NEON	tiny.en	4	234.14 ms	681.53 ms
RK3588	Ubuntu20.04	NEON	base.en	4	297.08 ms	1679.75 ms
RK3588	Ubuntu20.04	NEON	small.en	4	599.98 ms	6867.66 ms
RK3588	Ubuntu20.04	NEON	medium.en	4	1492.73 ms	23600.45 ms

I tried to compile with openBlas but seemed to kill the make

From the master repo as didn't think about the repo after trying streaming input CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
RK3588	Ubuntu20.04	NEON	tiny	8	226.48 ms	2681.05 ms
RK3588	Ubuntu20.04	NEON	base	8	283.56 ms	6132.44 ms
RK3588	Ubuntu20.04	NEON	small	8	583.39 ms	24397.78 ms
RK3588	Ubuntu20.04	NEON	medium	8	1490.98	85099.45 ms

dodysw commented 1 year ago

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
Ryzen 7 PRO 4750G	Ubuntu 22.04	AVX2	tiny.en	8	136.29	454.52
Ryzen 7 PRO 4750G	Ubuntu 22.04	AVX2	tiny	8	134.64	486.01
Ryzen 7 PRO 4750G	Ubuntu 22.04	AVX2	base	8	180.22	1184.80
Ryzen 7 PRO 4750G	Ubuntu 22.04	AVX2	base.en	8	192.86	1197.85
Ryzen 7 PRO 4750G	Ubuntu 22.04	AVX2	small	8	367.55	4179.00
Ryzen 7 PRO 4750G	Ubuntu 22.04	AVX2	small.en	8	378.27	4557.73
Ryzen 7 PRO 4750G	Ubuntu 22.04	AVX2	medium	8	923.48	15552.61
Ryzen 7 PRO 4750G	Ubuntu 22.04	AVX2	medium.en	8	952.48	15708.63
Ryzen 7 PRO 4750G	Ubuntu 22.04	AVX2	large	8	1650.28	28357.09

8 threads seemed to be the fastest. However I managed to squeeze a bit more performance by pinning CPU:

$ taskset -c 0-15 ./extra/bench-all.sh 16

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
Ryzen 7 PRO 4750G	Ubuntu 22.04	AVX2	tiny	16	143.17	437.73
Ryzen 7 PRO 4750G	Ubuntu 22.04	AVX2	base	16	184.10	1061.14
Ryzen 7 PRO 4750G	Ubuntu 22.04	AVX2	small	16	374.41	3645.64
Ryzen 7 PRO 4750G	Ubuntu 22.04	AVX2	medium	16	935.45	13029.54

matth commented 1 year ago

Results for AWS Graviton 3 Processor (c7g.4xlarge instance type).

Compiled with -march=native -ffast-math.

./extra/bench-all.sh 8

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
Graviton 3	Ubuntu 22.04	NEON	tiny	8	125.92	230.33
Graviton 3	Ubuntu 22.04	NEON	base	8	160.17	547.88
Graviton 3	Ubuntu 22.04	NEON	small	8	299.59	2138.86
Graviton 3	Ubuntu 22.04	NEON	medium	8	741.49	6999.33
Graviton 3	Ubuntu 22.04	NEON	large	8	1313.95	14174.00

./extra/bench-all.sh 16

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
Graviton 3	Ubuntu 22.04	NEON	tiny	16	121.92	158.61
Graviton 3	Ubuntu 22.04	NEON	base	16	156.01	386.78
Graviton 3	Ubuntu 22.04	NEON	small	16	299.85	1596.38
Graviton 3	Ubuntu 22.04	NEON	medium	16	750.93	5351.24
Graviton 3	Ubuntu 22.04	NEON	large	16	1313.82	11115.69

ggerganov commented 1 year ago

@matth Do you observe significant performance difference with / without -march=native -ffast-math?

matth commented 1 year ago

@ggerganov -ffast-math seems to make only a very small difference that could be noise between runs

-march=native does seem to make a big difference, without it FP16_VA is not reported as being enabled (I can get this with -march=armv8.4-a+bf16+fp16fml) - I think -march=native is enabling more intrinsics than this though.

Results without any -march or -ffast-math flags ...

./extra/bench-all.sh 16

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
Graviton 3	Ubuntu 22.04	NEON	tiny	16	124.25	320.53
Graviton 3	Ubuntu 22.04	NEON	base	16	156.91	734.22
Graviton 3	Ubuntu 22.04	NEON	small	16	301.78	2812.75
Graviton 3	Ubuntu 22.04	NEON	medium	16	714.23	9139.86
Graviton 3	Ubuntu 22.04	NEON	large	16	1298.33	18147.47

I have tried to improve by using OpenBlas and armpl.h but with they both slow it down considerably - I'll keep trying with the latter.

Are there any possibilities for further optimisations in ggml.c that can take advantage of the situation where you have bf16 functions but not BLAS or Accelerate?

maltoze commented 1 year ago

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
E5-2640	Ubuntu 18.04	AVX2	tiny	8	235.10	1094.45
E5-2640	Ubuntu 18.04	AVX2	base	8	326.11	2307.32
E5-2640	Ubuntu 18.04	AVX2	small	8	669.31	7706.24

ggerganov commented 1 year ago

@matth My experiments with OpenBLAS on x86 showed that it is not faster compared to hand-written AVX2 + FP16: https://github.com/ggerganov/whisper.cpp/commit/fbd513b813ea42a500ba92be3dcfea0b6b6a4fa3

It seems this is also the case for Arm based on your experiments. My guess is that we don't see improvement because the computation is memory-bound and OpenBLAS works with FP32.

The reason that on Apple Silicon using CBLAS is so fast is because it utilizes the matrix co-processor which somehow is very efficient even for FP32. At least this is how I explain the results that I am seeing.

Interesting if armpl.h can provide some more insight - I haven't used it.

The most heavy stuff in ggml.c is the mul_mat_f16 and flash_attn_f16 calls. I think the conv_1d_... calls could be probably optimized more, but they are called only once as the start of the Encoder, so the improvement would be marginal.

Also, I am just looking at whisper.cpp and I realize I have forgotten why I use Flash Attention only in the Encoder and not use it also in the Decoder. Maybe this can help, because the Flash Attention reduces the memory transfers and improves cache locality.

Not sure about bf16 compared to fp16. I don't expect to provide big improvement based on quick search through some articles about the difference between the 2 data types.

StuartIanNaylor commented 1 year ago

Ihttps://medium.com/swlh/apples-m1-secret-coprocessor-6599492fc1e1

Gives a good write up if medium doesn't try to charge you.

https://nod.ai/comparing-apple-m1-with-amx2-m1-with-neon/

Maybe after the m3 comes out I might be able to pickup a bargain m1 mini.

I think fp16 is coming though and may help a bit

https://github.com/xianyi/OpenBLAS/pull/3754

PS for those of us without the secret apple sauce would implementing https://github.com/CNugteren/CLBlast be any use on integrated gpu's?

tamo commented 1 year ago

OpenBLAS helps Windows AMD64 MSVC

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
Ryzen 5 PRO 2400GE	Windows 10	AVX2	medium	4	4259.10	116609.75
Ryzen 5 PRO 2400GE	Windows 10	AVX2 BLAS	medium	4	4259.58	75312.90

StuartIanNaylor commented 1 year ago

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
rk3588	Debian11	NEON	tiny	8	232.45	2768.78
rk3588	Debian11	NEON	base	8	308.36	6374.82
rk3588	Debian11	NEON	small	8	626.23	25784.05
rk3588	Debian11	NEON	medium	8	1667.23	86026.82
rk3588	Debian11	NEON	large	8	4307.16	161328.59

CFLAGS = -I. -O3 -std=c11 -ffast-math -march=native

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
rk3588	Debian11	NEON	tiny	8	230.69	2078.40
rk3588	Debian11	NEON	base	8	299.10	4379.62
rk3588	Debian11	NEON	small	8	621.43	18565.42
rk3588	Debian11	NEON	medium	8	1532.61	65504.91
rk3588	Debian11	NEON	large	8	3618.18	121710.31

If I try to compile with open blas in seperate build Encode becomes approx x2 slower so either I am doing wrong or with Armv8.2 its just bad, its -march=native that seems to make the above difference.

matth commented 1 year ago

Results on AWS mac2.metal instance:

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
mac2.metal	OSX Ventura	NEON BLAS	tiny	4	64.39	184.98
mac2.metal	OSX Ventura	NEON BLAS	base	4	87.93	368.04
mac2.metal	OSX Ventura	NEON BLAS	small	4	198.80	1212.46
mac2.metal	OSX Ventura	NEON BLAS	medium	4	551.49	3552.73
mac2.metal	OSX Ventura	NEON BLAS	large	4	1042.91	6726.99

I tried disabling Accelerate and it makes a significant difference (i.e. very much slower without it!).

I assumed Accelerate was using the Neural Engine, but using both powermetrics and asitop I cannot see any utilization, both report 0mw power usage. Can anyone confirm on an M1 machine?

EDIT Possibly I was confused. Apple’s Matrix Coprocessor (AMX) and Neural Engine are different things, from @ggerganov other issues and commits it appears Accelerate might be using the former

tienshiao commented 1 year ago

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
i9-13900k	WSL2 Ubuntu	AVX2	tiny	4	58.49	360.95
i9-13900k	WSL2 Ubuntu	AVX2	base	4	72.44	756.48
i9-13900k	WSL2 Ubuntu	AVX2	small	4	154.37	2676.12
i9-13900k	WSL2 Ubuntu	AVX2	medium	4	393.76	8924.90
i9-13900k	WSL2 Ubuntu	AVX2	large	4	698.69	15862.58
i9-13900k	WSL2 Ubuntu	AVX2	tiny	8	55.13	291.51
i9-13900k	WSL2 Ubuntu	AVX2	base	8	70.93	603.33
i9-13900k	WSL2 Ubuntu	AVX2	small	8	141.85	1800.05
i9-13900k	WSL2 Ubuntu	AVX2	medium	8	356.29	5946.78
i9-13900k	WSL2 Ubuntu	AVX2	large	8	658.83	10868.89

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
E5-2697 V2	MacOS Monterey 12.6.1	BLAS	tiny	4	301.22	872.27
E5-2697 V2	MacOS Monterey 12.6.1	BLAS	base	4	405.40	1705.58
E5-2697 V2	MacOS Monterey 12.6.1	BLAS	small	4	921.24	5419.73
E5-2697 V2	MacOS Monterey 12.6.1	BLAS	medium	4	2356.76	15188.90
E5-2697 V2	MacOS Monterey 12.6.1	BLAS	large	4	4457.29	26444.06
E5-2697 V2	MacOS Monterey 12.6.1	BLAS	tiny	8	299.89	540.47
E5-2697 V2	MacOS Monterey 12.6.1	BLAS	base	8	419.41	1129.01
E5-2697 V2	MacOS Monterey 12.6.1	BLAS	small	8	888.64	3632.89
E5-2697 V2	MacOS Monterey 12.6.1	BLAS	medium	8	2377.96	10525.92
E5-2697 V2	MacOS Monterey 12.6.1	BLAS	large	8	4412.20	18933.41

peressinoto commented 1 year ago

Intel(R) Core(TM) i7-8750H CPU @ 2.20GH

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
i7-8750H	macOS Ventura 13.0.1	AVX2 BLAS	tiny	4	307.20	570.86
i7-8750H	macOS Ventura 13.0.1	AVX2 BLAS	base	4	406.45	1183.90
i7-8750H	macOS Ventura 13.0.1	AVX2 BLAS	small	4	941.96	4156.69
i7-8750H	macOS Ventura 13.0.1	AVX2 BLAS	medium	4	3124.62	13072.06
i7-8750H	macOS Ventura 13.0.1	AVX2 BLAS	large	4	10090.85	36383.82
i7-8750H	macOS Ventura 13.0.1	AVX2 BLAS	tiny	8	299.42	487.26
i7-8750H	macOS Ventura 13.0.1	AVX2 BLAS	base	8	403.74	1113.54
i7-8750H	macOS Ventura 13.0.1	AVX2 BLAS	small	8	910.07	3955.48
i7-8750H	macOS Ventura 13.0.1	AVX2 BLAS	medium	8	2241.90	13076.31
i7-8750H	macOS Ventura 13.0.1	AVX2 BLAS	large	8	5620.87	25562.17

Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz (12)

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
i7-8700	Ubuntu 20.04.4 LTS	AVX2	tiny	4	158.49	730.72
i7-8700	Ubuntu 20.04.4 LTS	AVX2	base	4	205.93	1603.67
i7-8700	Ubuntu 20.04.4 LTS	AVX2	small	4	426.62	5630.58
i7-8700	Ubuntu 20.04.4 LTS	AVX2	medium	4	1080.15	18748.66
i7-8700	Ubuntu 20.04.4 LTS	AVX2	large	4	1976.77	37188.47
i7-8700	Ubuntu 20.04.4 LTS	AVX2	tiny	8	159.00	662.07
i7-8700	Ubuntu 20.04.4 LTS	AVX2	base	8	206.62	1436.59
i7-8700	Ubuntu 20.04.4 LTS	AVX2	small	8	428.20	5345.27
i7-8700	Ubuntu 20.04.4 LTS	AVX2	medium	8	1108.97	16780.53
i7-8700	Ubuntu 20.04.4 LTS	AVX2	large	8	1965.67	32019.44
i7-8700	Ubuntu 20.04.4 LTS	AVX2	tiny	12	157.60	585.65
i7-8700	Ubuntu 20.04.4 LTS	AVX2	base	12	216.74	1696.32
i7-8700	Ubuntu 20.04.4 LTS	AVX2	small	12	428.51	4504.18
i7-8700	Ubuntu 20.04.4 LTS	AVX2	medium	12	1081.65	15442.25
i7-8700	Ubuntu 20.04.4 LTS	AVX2	large	12	1969.63	28108.55

Intel(R) Core(TM) i3-9100F CPU @ 3.60GHz (4)

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
i3-9100F	Ubuntu 20.04.4 LTS	AVX2	tiny	4	164.71	726.05
i3-9100F	Ubuntu 20.04.4 LTS	AVX2	base	4	214.56	1806.20
i3-9100F	Ubuntu 20.04.4 LTS	AVX2	small	4	445.48	6613.19
i3-9100F	Ubuntu 20.04.4 LTS	AVX2	medium	4	1131.80	22667.64
i3-9100F	Ubuntu 20.04.4 LTS	AVX2	large	4	7615.74	42137.29

Intel(R) Xeon(R) CPU E3-1220 V2 @ 3.10GHz (4)

CPU	OS	Model	Threads	Load [ms]	Encode [ms]
E3-1220 V2	Ubuntu 20.04.3 LTS	tiny	4	227.41	1757.56
E3-1220 V2	Ubuntu 20.04.3 LTS	base	4	297.67	3801.48
E3-1220 V2	Ubuntu 20.04.3 LTS	small	4	625.18	14544.59
E3-1220 V2	Ubuntu 20.04.3 LTS	medium	4	9618.55	49937.12
E3-1220 V2	Ubuntu 20.04.3 LTS	large	4	40399.48	71661.48

Xavier-i commented 1 year ago

Has anyone tried benchmarking on WASM? Seems like the encoder takes much longer time than other platform

pvonmoradi commented 1 year ago

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
i7-5600U @2.60GHz	Xubuntu 18.04	AVX2 BLAS	ggml-tiny.en	4	258.59	2934.34
i7-5600U @2.60GHz	Xubuntu 18.04	AVX2 BLAS	ggml-tiny	4	255.46	2906.67
i7-5600U @2.60GHz	Xubuntu 18.04	AVX2 BLAS	ggml-base.en	4	316.73	6197.29
i7-5600U @2.60GHz	Xubuntu 18.04	AVX2 BLAS	ggml-base	4	319.93	5825.65
i7-5600U @2.60GHz	Xubuntu 18.04	AVX2	ggml-tiny.en	4	217.28	1548.92
i7-5600U @2.60GHz	Xubuntu 18.04	AVX2	ggml-tiny	4	215.59	1625.69
i7-5600U @2.60GHz	Xubuntu 18.04	AVX2	ggml-base.en	4	275.62	3823.34
i7-5600U @2.60GHz	Xubuntu 18.04	AVX2	ggml-base	4	275.72	3740.50
Cortex-A53	Android 10	NEON	ggml-tiny.en	8	399.05	5841.70
Cortex-A53	Android 10	NEON	ggml-tiny	8	376.25	5548.72
Cortex-A53	Android 10	NEON	ggml-base.en	8	492.92	12728.42
Cortex-A53	Android 10	NEON	ggml-base	8	1034.48	13365.86

Test-bench properties

Benchmarking is done on commit 3996ecc156486fb93ff505c01090d13192e72aa2.
Used cmake for building (mkdir build && cd build, cmake .. && make).
Compiler for Xubuntu 18.04 is gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Compiler for Android 10 is clang version 15.0.2 (aarch64-unknown-linux-android24)

Used the following fish shell snippet to run the benchmarks:

# cwd is whisper.cpp/build
# Adding `-t 8` to `bench` for aarch64
$ for model in "ggml-tiny" "ggml-base"
  for suffix in "en.bin" "bin"
      ./bin/bench -m "../models/$model.$suffix"
  end
end

Remarks

On x86, enabling BLAS (-DWHISPER_SUPPORT_OPENBLAS=ON) deteriorates the performance!

archi commented 1 year ago

Quite the difference between the 2017 Intel i3 4C/4T and the 2019 Ryzen Zen+ 6C/12T. And not looking good for AVX2 on the old AMD Zen+. I must admit, all in all I really envy the M1 for having that accelerator.

gcc vs clang doesn't seem to make a difference, at least it's not distinguishable from noise.

i3-8100

This is my home server. Tested while it was doing home server things (load 0.7). I can see this machine acting as a "whisper server" in a 2C configuration.

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]	Commit	Compiler
i3-8100 @ 3.60GHz	Arch Linux	AVX2	tiny	1	88.38	2013.67	832b4f3	gcc 12.2.0
i3-8100 @ 3.60GHz	Arch Linux	AVX2	base	1	113.58	4692.04	832b4f3	gcc 12.2.0
i3-8100 @ 3.60GHz	Arch Linux	AVX2	small	1	225.74	18469.62	832b4f3	gcc 12.2.0
i3-8100 @ 3.60GHz	Arch Linux	AVX2	tiny	2	89.55	1189.92	832b4f3	clang 14.0.6
i3-8100 @ 3.60GHz	Arch Linux	AVX2	base	2	119.97	2756.52	832b4f3	clang 14.0.6
i3-8100 @ 3.60GHz	Arch Linux	AVX2	small	2	238.71	10491.67	832b4f3	clang 14.0.6
i3-8100 @ 3.60GHz	Arch Linux	AVX2	tiny	4	201.37	695.39	832b4f3	gcc 12.2.0
i3-8100 @ 3.60GHz	Arch Linux	AVX2	base	4	262.76	2023.16	832b4f3	gcc 12.2.0
i3-8100 @ 3.60GHz	Arch Linux	AVX2	small	4	526.66	6788.01	832b4f3	gcc 12.2.0
i3-8100 @ 3.60GHz	Arch Linux	AVX2	medium	4	3836.26	21889.30	832b4f3	gcc 12.2.0
i3-8100 @ 3.60GHz	Arch Linux	AVX2	large	4	26819.67	60880.62	832b4f3	gcc 12.2.0
i3-8100 @ 3.60GHz	Arch Linux	AVX2	tiny	4	89.05	696.08	832b4f3	clang 14.0.6
i3-8100 @ 3.60GHz	Arch Linux	AVX2	base	4	114.65	1711.15	832b4f3	clang 14.0.6
i3-8100 @ 3.60GHz	Arch Linux	AVX2	small	4	309.30	6995.25	832b4f3	clang 14.0.6
i3-8100 @ 3.60GHz	Arch Linux	AVX2	medium	4	4854.02	23570.42	832b4f3	clang 14.0.6
i3-8100 @ 3.60GHz	Arch Linux	AVX2	large	4	21415.07	60547.99	832b4f3	clang 14.0.6

Ryzen 1600AF

Just my Desktop. The difference to the 5950 at 8C is really massive; but luckily it has no impact for daily usage, so I'm glad I can still wait with upgrading to the last AM4 CPU generation :joy: Looking forward to benching CUDA on this machine (3080Ti).

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]	Commit	Compiler
Ryzen 1600AF	Manjaro	AVX2	tiny	1	104.04	4691.38	832b4f3	clang 14.0.6
Ryzen 1600AF	Manjaro	AVX2	base	1	134.54	11092.84	832b4f3	clang 14.0.6
Ryzen 1600AF	Manjaro	AVX2	small	1	254.71	43923.42	832b4f3	clang 14.0.6
Ryzen 1600AF	Manjaro	AVX2	tiny	4	107.40	1336.49	832b4f3	gcc 12.2.0
Ryzen 1600AF	Manjaro	AVX2	base	4	132.69	3062.12	832b4f3	gcc 12.2.0
Ryzen 1600AF	Manjaro	AVX2	small	4	262.27	11655.22	832b4f3	gcc 12.2.0
Ryzen 1600AF	Manjaro	AVX2	medium	4	662.81	38829.74	832b4f3	gcc 12.2.0
Ryzen 1600AF	Manjaro	AVX2	large	4	1365.09	77063.30	832b4f3	gcc 12.2.0
Ryzen 1600AF	Manjaro	AVX2	tiny	6	100.82	1007.36	832b4f3	gcc 12.2.0
Ryzen 1600AF	Manjaro	AVX2	base	6	130.20	2472.55	832b4f3	gcc 12.2.0
Ryzen 1600AF	Manjaro	AVX2	small	6	256.83	9311.54	832b4f3	gcc 12.2.0
Ryzen 1600AF	Manjaro	AVX2	medium	6	657.89	28051.40	832b4f3	gcc 12.2.0
Ryzen 1600AF	Manjaro	AVX2	large	6	1190.62	54292.72	832b4f3	gcc 12.2.0
Ryzen 1600AF	Manjaro	AVX2	tiny	6	104.77	1012.70	832b4f3	clang 14.0.6
Ryzen 1600AF	Manjaro	AVX2	base	6	137.00	2212.20	832b4f3	clang 14.0.6
Ryzen 1600AF	Manjaro	AVX2	small	6	257.97	9296.33	832b4f3	clang 14.0.6
Ryzen 1600AF	Manjaro	AVX2	medium	6	624.04	28524.38	832b4f3	clang 14.0.6
Ryzen 1600AF	Manjaro	AVX2	large	6	1189.10	56445.31	832b4f3	clang 14.0.6
Ryzen 1600AF	Manjaro	AVX2	tiny	12	101.41	898.96	832b4f3	gcc 12.2.0
Ryzen 1600AF	Manjaro	AVX2	base	12	139.26	2200.78	832b4f3	gcc 12.2.0
Ryzen 1600AF	Manjaro	AVX2	small	12	256.50	8125.48	832b4f3	gcc 12.2.0
Ryzen 1600AF	Manjaro	AVX2	medium	12	623.59	29255.08	832b4f3	gcc 12.2.0
Ryzen 1600AF	Manjaro	AVX2	large	12	1192.90	51902.81	832b4f3	gcc 12.2.0

luke-jr commented 1 year ago

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]	Commit	Compiler
POWER9v2	Gentoo	`-Ofast -mcpu=native`	base.en	4/64	144.84	42708.33	85c9ac18b59125b988cda40f40d8687e1ba88a7a	clang 15.0.3
POWER9v2	Gentoo	`-Ofast -mcpu=native`	base.en	16/64	161.95	22302.28	85c9ac18b59125b988cda40f40d8687e1ba88a7a	clang 15.0.3
POWER9v2	Gentoo	`-Ofast -mcpu=native`	base.en	32/64	142.06	20263.56	85c9ac18b59125b988cda40f40d8687e1ba88a7a	clang 15.0.3
POWER9v2	Gentoo	`-Ofast -mcpu=native`	base.en	64/64	160.51	12645.79	85c9ac18b59125b988cda40f40d8687e1ba88a7a	clang 15.0.3

ggerganov commented 1 year ago

@Xavier-i WASM performance is much worse compared to native - this is expected. Today I added the bench.wasm that can be used to benchmark performance in the browser.

Link: https://whisper.ggerganov.com/bench/

j1nx commented 1 year ago

Redo of my OpenVoiceOS Raspberry Pi 4 benchmark

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON	tiny.en	4	735	9486	aa6adda26e1ee9843dddb013890e3312bee52cfe
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON	base.en	4	950	25402	aa6adda26e1ee9843dddb013890e3312bee52cfe
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON BLAS	tiny.en	4	752	9178	aa6adda26e1ee9843dddb013890e3312bee52cfe
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON BLAS	base.en	4	969	19642	aa6adda26e1ee9843dddb013890e3312bee52cfe

And just (and only) because we can, the same on a Raspberry Pi 3B+ running the same codebase / OS

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Raspberry Pi 3B+ - 1GB	OpenVoiceOS	NEON	tiny.en	4	1331	22573	aa6adda26e1ee9843dddb013890e3312bee52cfe
Raspberry Pi 3B+ - 1GB	OpenVoiceOS	NEON	base.en	4	5886	58733	aa6adda26e1ee9843dddb013890e3312bee52cfe
Raspberry Pi 3B+ - 1GB	OpenVoiceOS	NEON BLAS	tiny.en	4	1333	21184	aa6adda26e1ee9843dddb013890e3312bee52cfe
Raspberry Pi 3B+ - 1GB	OpenVoiceOS	NEON BLAS	base.en	4	4605	47877	aa6adda26e1ee9843dddb013890e3312bee52cfe

matth commented 1 year ago

I hope this isn't misplaced but I thought it interesting to share ...

I have recently finished some tests comparing whisper.cpp runtime performance against the original PyTorch version on various GPUs and CPUs.

We test against a fixed set of long form audio files (UK TV, each file ~1 hour long, mixed speech and noise) and record the runtime as a factor of real audio time.

Depending on the software and environment transcription can take anywhere between around 5x real-time to 0.14x real-time to complete.

ARM based whisper.cpp runtime is very impressive, in particular the Apple M1 performance can match that of the original PyTorch version on NVIDIA V100 and T4 gpus ...

CPU / GPU	OS	Config	Model	Threads	xRT Transcribe
Intel Xeon	Ubuntu 22.04	whisper original - pytorch cpu	medium.en	8	4.78
Intel Xeon	Ubuntu 22.04	whisper.cpp - AVX2	medium.en	8	4.44
Graviton 3	Ubuntu 22.04	whisper.cpp - NEON	medium.en	8	0.63
mac2.metal	OSX Ventura	whisper.cpp - NEON BLAS	medium.en	4	0.26
NVIDIA V100	Ubuntu 22.04	whisper original - pytorch cuda	medium.en	N/A	0.25
NVIDIA T4	Ubuntu 22.04	whisper original - pytorch cuda	medium.en	N/A	0.25
NVIDIA A10G	Ubuntu 22.04	whisper original - pytorch cuda	medium.en	N/A	0.16
NVIDIA A100	Ubuntu 22.04	whisper original - pytorch cuda	medium.en	N/A	0.14

Additionally I did some very rough power consumption tests, again whisper.cpp on the M1 is really impressive against PyTorch on the GPU.

Platform	Whisper Type	Model	Avg Power	Peak Power
Apple M1	whisper.cpp	ggml-medium.en	13202 mW	18412 mW
Nvidia T4	pytorch	medium.en	69587 mW	85650 mW

Thanks for the fantastic work @ggerganov - this is a really inspiring project and demonstrates the ARM FP16 functionality wonderfully. Off to buy some more Apple Macs now ;)

StuartIanNaylor commented 1 year ago

@matth @rgerganov Been thinking myself that perf/watts for ML is truly outstanding and just wondered if the 8gb can squeeze the medium model in as not sure how memory is shared on the m1 or is it really a case of the 16gb?

ggerganov commented 1 year ago

@matth Thanks for the data - it's interesting to see.

However, there are some caveats that are important to be considered when benchmarking the 2 implementations that I've been meaning to discuss, so here are my thoughts on this:

At a high-level, the Whisper transcription is a combination of 2 main parts:

transformer model evaluation
decoding strategy

The first part is branchless and does not depend on the audio input or the parameters that you use. For a given model, evaluating the transformer requires the same amount of operations every time. This is easy to benchmark.

The second part (decoding strategy) is different. The number of operations here depends both on the audio input contents and the decoding parameters / strategy that you use. For example, two different audio recordings with the same time length generally result in different decoded text based on the speech content and hence can take different amount of processing (even with the same decoding parameters). Also, the decoded timestamp tokens affect how the 30s sliding window of the transcription is updated and therefore can lead to a different number of transformer evaluations in total.

My understanding is that there is no "correct" decoding strategy. The OpenAI implementation generally offers 2 different strategies - Greedy and BeamSearch. Both of them are combinations of various heuristics that aim to improve the text coherency and reduce the number of catastrophic failures.

In whisper.cpp we currently have a Greedy strategy which is similar to the one in the OpenAI repo, but is not exactly the same.

So all of this means that there is no point in comparing the 2 implementations by measuring the total time to transcribe an audio, because the decoding strategy is not the same and therefore the variation will be very large due to the factors outlined above. It only makes sense to benchmark the transformer evaluation in isolation, because it is well-defined.

That is why in the benchmarks in this issue, I chose to run the Encoder on some random input buffer. The Encoder is the heavy part of the transformer and being able to evaluate it efficiently is very important and is the most defining factor for the efficiency of the implementation. It's the "engine" of the transcription. You can then put on top of it any decoding strategy that you like and this will define how accurate your transcription is. But it does not make sense to benchmark the performance of that anymore.

I think if we want to make a fair comparison with PyTorch, we need to have the bench tool implemented in python using PyTorch. Any other comparison will be flawed to some extent.

But in any case, your results are interesting - thanks for sharing them. What parameters did you use for the PyTorch runs?

Regarding the power consumption - I think there is more we can do in whisper.cpp. Currently, the thread synchronization uses busy loops which is very power inefficient because it keeps the CPU at 100%, but it gives a slight performance edge. I am thinking of adding an option that uses condition variable synchronization which will likely reduce the power usage at the cost of some performance. For some use cases, it could be beneficial to have lower power consumption.

matth commented 1 year ago

Thanks @ggerganov , we are using PyTorch whisper with default settings in that benchmark so I believe that is a beam search decoder. I will see if I can test again with the greedy decoder for a more similar comparison. I think I understand your point though - these are not like for like implementations so at a certain level the comparison is flawed.

I also neglected to measure the PyTorch version on the M1 & Graviton which was a huge oversight!

There's a motivation behind these benchmarks. Looking at various solutions as improvements to existing transcription capabilities - each solution in my mind is a balance of accuracy, completeness, runtime, financial cost and energy efficiency.

On one end you have paying humans to do the transcription, slow and expensive but very accurate and something that is still done at a massive scale in my industry. At the other end there are existing Kaldi models that are less accurate but incredibly fast for inference on the CPU and very cheap to run.

I feel larger transformer models like Whisper sit somewhat in the middle of all this - closer to human accuracy but increased associated costs over existing software.

But whisper.cpp adds to this, if we can get similar or even just acceptable accuracy and runtime but on commodity hardware the choice can start to become more about cost, efficiency and functionality. e.g. you could buy 30+ Apple Macs for the price of an NVIDIA A100 server, being able to run Whisper on a laptop enables a different set of use cases, you can cut power consumption by a huge margin, etc

I think for me this is one of the many exciting outcomes of this project :)

ggerganov commented 1 year ago

@matth Yeah - the default in PyTorch when running from the command line is BeamSearch. I haven't measure exactly, but it is significantly slower compared to Greedy.

I think regarding the total-time benchmark - it can make sense once whisper.cpp reaches the accuracy of OpenAI. Currently, due to the inferior decoding, whisper.cpp has lower transcription accuracy (based on some results I saw floating around). But when the decoding gets improved and we have comparable accuracy, then we can make a benchmark that says:

"for a given word error rate (WER) the 2 implementation take this amount of processing time on average, over some large set of audio"

And another thing I was thinking is that even if today whisper.cpp is more efficient on Apple Macs - it is not going to be always the case. If I understand correctly, it's just a matter of time for the proper Apple Silicon frameworks (Metal, MPS, etc.) to become supported in PyTorch, Tensorflow, etc and when this happens (probably very soon), the performance of whisper.cpp will be the same or possibly worse.

So yeah - just trying to adjust expectations :) Will probably write some more on this in the F.A.Q. discussion.

asmaloney commented 1 year ago

CPU	OS	Config	Model	Th	Load	Enc.	Commit
i9-9900K @ 3.60GHz	macOS 12.6.2	AVX2 BLAS	tiny.en	4	175	360	7282e2109e0748421ee73271496f5911ca2b89a7
i9-9900K @ 3.60GHz	macOS 12.6.2	AVX2 BLAS	base.en	4	233	736	7282e2109e0748421ee73271496f5911ca2b89a7
i9-9900K @ 3.60GHz	macOS 12.6.2	AVX2 BLAS	small.en	4	507	2400	7282e2109e0748421ee73271496f5911ca2b89a7
i9-9900K @ 3.60GHz	macOS 12.6.2	AVX2 BLAS	medium.en	4	1333	6860	7282e2109e0748421ee73271496f5911ca2b89a7

Using 8 threads is slightly slower to load, faster to encode:	CPU	OS	Config	Model	Th	Load	Enc.
i9-9900K @ 3.60GHz	macOS 12.6.2	AVX2 BLAS	tiny.en	8	185	283	7282e2109e0748421ee73271496f5911ca2b89a7
i9-9900K @ 3.60GHz	macOS 12.6.2	AVX2 BLAS	base.en	8	241	579	7282e2109e0748421ee73271496f5911ca2b89a7
i9-9900K @ 3.60GHz	macOS 12.6.2	AVX2 BLAS	small.en	8	526	1959	7282e2109e0748421ee73271496f5911ca2b89a7
i9-9900K @ 3.60GHz	macOS 12.6.2	AVX2 BLAS	medium.en	8	1390	6271	7282e2109e0748421ee73271496f5911ca2b89a7

mgc8 commented 1 year ago

CPU	OS	Config	Model	Th	Load	Enc.	Commit
MacBookPro M1 Max	macOS 12.6	NEON BLAS	tiny	8	65	108	a593b93
MacBookPro M1 Max	macOS 12.6	NEON BLAS	base	8	86	250	a593b93
MacBookPro M1 Max	macOS 12.6	NEON BLAS	small	8	185	789	a593b93
MacBookPro M1 Max	macOS 12.6	NEON BLAS	medium	8	493	2126	a593b93
MacBookPro M1 Max	macOS 12.6	NEON BLAS	large	8	955	3860	a593b93

There are actually 10 threads, but when using -t 10 the performance goes down. Lower numbers (such as -t 4) result in similar load performance, but slower encode (although not linear).

kha84 commented 1 year ago

AMD Ryzen 5 3400G (4 CPU cores, 8 threads) on Ubuntu 22.10 with 5.19.0-26-generic Kernel

4 threads

CPU	OS	Config	Model	Th	Load	Enc.	Commit
3400G	Ubuntu 22.10	AVX2	tiny	4	163	1415	0be6a1a
3400G	Ubuntu 22.10	AVX2	tiny.en	4	175	1351	0be6a1a
3400G	Ubuntu 22.10	AVX2	base.en	4	200	3095	0be6a1a
3400G	Ubuntu 22.10	AVX2	base	4	205	3241	0be6a1a
3400G	Ubuntu 22.10	AVX2	small.en	4	412	12343	0be6a1a
3400G	Ubuntu 22.10	AVX2	small	4	421	11983	0be6a1a
3400G	Ubuntu 22.10	AVX2	medium.en	4	995	38818	0be6a1a
3400G	Ubuntu 22.10	AVX2	medium	4	1006	38573	0be6a1a
3400G	Ubuntu 22.10	AVX2	large-v1	4			0be6a1a
3400G	Ubuntu 22.10	AVX2	large	4	1870	77302	0be6a1a

8 threads is just marginally better

CPU	OS	Config	Model	Th	Load	Enc.	Commit
3400G	Ubuntu 22.10	AVX2	tiny.en	8	191	1275	0be6a1a
3400G	Ubuntu 22.10	AVX2	tiny	8	183	1258	0be6a1a
3400G	Ubuntu 22.10	AVX2	base.en	8	232	2894	0be6a1a
3400G	Ubuntu 22.10	AVX2	base	8	231	2927	0be6a1a
3400G	Ubuntu 22.10	AVX2	small.en	8	435	11299	0be6a1a
3400G	Ubuntu 22.10	AVX2	small	8	414	11511	0be6a1a
3400G	Ubuntu 22.10	AVX2	medium.en	8	1011	37557	0be6a1a
3400G	Ubuntu 22.10	AVX2	medium	8	1049	37306	0be6a1a
3400G	Ubuntu 22.10	AVX2	large-v1	8			0be6a1a
3400G	Ubuntu 22.10	AVX2	large	8	3237	77396	0be6a1a

Someone mentioned BLAS?

ggerganov / whisper.cpp

Benchmark results #89

Encoder

memcpy

MacBook M1 Pro

Ryzen 9 5950X

ggml_mul_mat

MacBook M1 Pro

Ryzen 9 5950X

Intel(R) Core(TM) i7-8750H CPU @ 2.20GH

Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz (12)

Intel(R) Core(TM) i3-9100F CPU @ 3.60GHz (4)

Intel(R) Xeon(R) CPU E3-1220 V2 @ 3.10GHz (4)

i3-8100

Ryzen 1600AF