ggerganov / whisper.cpp

Port of OpenAI's Whisper model in C/C++
MIT License
34.24k stars 3.47k forks source link

Benchmark results #89

Open ggerganov opened 1 year ago

ggerganov commented 1 year ago

Encoder

Collection of bench results for various platforms and devices. If you want to submit info about your device, simply run the bench tool or the extra/bench-all.sh and report the results in the comments below.

Suggestions for better summary of the results are welcome

CPU OS Config Model Th Load Enc. Commit
MacBook M1 Pro MacOS 13.0.1 NEON BLAS tiny 8 71 102 206fc93
MacBook M1 Pro MacOS 13.0.1 NEON BLAS base 8 96 220 206fc93
MacBook M1 Pro MacOS 13.0.1 NEON BLAS small 8 233 685 206fc93
MacBook M1 Pro MacOS 13.0.1 NEON BLAS medium 8 603 1928 206fc93
MacBook M1 Pro MacOS 13.0.1 NEON BLAS large 8 1158 3350 206fc93
---
MacBook M1 Pro MacOS 13.0.1 NEON BLAS small 1 251 2605 206fc93
MacBook M1 Pro MacOS 13.0.1 NEON BLAS small 4 255 884 206fc93
---
Mac Mini M1 MacOS NEON BLAS tiny 4 62 194 fcf515d
Mac Mini M1 MacOS NEON BLAS base 4 81 380 fcf515d
Mac Mini M1 MacOS NEON BLAS small 4 204 1249 fcf515d
Mac Mini M1 MacOS NEON BLAS medium 4 876 3980 fcf515d
Mac Mini M1 MacOS NEON BLAS large 4 1876 7979 fcf515d
---
Ryzen 9 3900X Ubuntu 20.04 AVX2 tiny 8 107 422 fcf515d
Ryzen 9 3900X Ubuntu 20.04 AVX2 base 8 137 880 fcf515d
Ryzen 9 3900X Ubuntu 20.04 AVX2 small 8 280 2874 fcf515d
Ryzen 9 3900X Ubuntu 20.04 AVX2 medium 8 692 9610 fcf515d
Ryzen 9 3900X Ubuntu 20.04 AVX2 large 8 1317 16917 fcf515d
---
Ryzen 9 3900X Ubuntu 20.04 AVX2 BLAS tiny 4 120 780 fcf515d
Ryzen 9 3900X Ubuntu 20.04 AVX2 BLAS base 4 151 1173 fcf515d
Ryzen 9 3900X Ubuntu 20.04 AVX2 BLAS small 4 289 3062 fcf515d
Ryzen 9 3900X Ubuntu 20.04 AVX2 BLAS medium 4 711 9175 fcf515d
Ryzen 9 3900X Ubuntu 20.04 AVX2 BLAS large 4 1282 16050 fcf515d
---
Ryzen 9 5950X Ubuntu 22.04 AVX2 tiny 8 135 197 fcf515d
Ryzen 9 5950X Ubuntu 22.04 AVX2 base 8 176 421 fcf515d
Ryzen 9 5950X Ubuntu 22.04 AVX2 small 8 357 1393 fcf515d
Ryzen 9 5950X Ubuntu 22.04 AVX2 medium 8 855 4404 fcf515d
Ryzen 9 5950X Ubuntu 22.04 AVX2 large 8 1576 8118 fcf515d
---
Raspberry Pi 4 NEON tiny 4 1436 13839 fcf515d
Raspberry Pi 4 NEON base 4 1894 30552 fcf515d
---
iPhone 13 Mini iOS 16.0 NEON BLAS base 4 97 1091 fcf515d
---
MacBook M1 Pro Vivaldi WASM tiny 8 133 3785 fcf515d
MacBook M1 Pro Vivaldi WASM base 8 172 8253 fcf515d
---
MacBook M1 Pro Chrome WASM tiny 8 134 3776 fcf515d
MacBook M1 Pro Chrome WASM base 8 168 8200 fcf515d
---
MacBook M1 Pro Firefox WASM tiny 8 137 2626 fcf515d
MacBook M1 Pro Firefox WASM base 8 183 6226 fcf515d

memcpy

MacBook M1 Pro

./bench -w 1 -t 1
memcpy: 37.59 GB/s

Ryzen 9 5950X

./bench -w 1 -t 1
memcpy: 16.74 GB/s

ggml_mul_mat

MacBook M1 Pro

./bench -w 2 -t 1
ggml_mul_mat:    64 x    64: F16    330.6 GFLOPS (128 runs) / F32    466.0 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16    737.5 GFLOPS (128 runs) / F32    838.9 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16    938.6 GFLOPS (128 runs) / F32   1062.3 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16   1312.5 GFLOPS (128 runs) / F32   1835.5 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16   1765.1 GFLOPS (128 runs) / F32   2041.4 GFLOPS (128 runs)
ggml_mul_mat:  2048 x  2048: F16   1784.3 GFLOPS (104 runs) / F32   1859.2 GFLOPS (109 runs)
ggml_mul_mat:  4096 x  4096: F16   1855.1 GFLOPS ( 14 runs) / F32   1873.3 GFLOPS ( 14 runs)

Ryzen 9 5950X

WHISPER_OPENBLAS=1 make -j bench && ./bench -w 2 -t 1
ggml_mul_mat:    64 x    64: F16     56.3 GFLOPS (128 runs) / F32     70.2 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     47.8 GFLOPS (128 runs) / F32     67.0 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16    185.1 GFLOPS (128 runs) / F32    332.7 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16    386.4 GFLOPS (128 runs) / F32    658.6 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16    636.2 GFLOPS (128 runs) / F32   1012.0 GFLOPS (128 runs)
ggml_mul_mat:  2048 x  2048: F16    950.9 GFLOPS ( 56 runs) / F32   1296.8 GFLOPS ( 76 runs)
ggml_mul_mat:  4096 x  4096: F16   1168.6 GFLOPS (  9 runs) / F32   1403.1 GFLOPS ( 11 runs)
bmilde commented 1 year ago

Whats the performance gain of this against the original implementation with pytorch compiled with AVX support or the pytorch m1 backend?

Does this implementation use beam decoding? (original pytorch impl has n=5 as default and is 100% faster with n=1)

Edit: README already mentions it's greedy decoding:

Very basic greedy sampling scheme - always pick up the token with highest probability. This should be similar to the GreedyDecoder from the original python implementation, so in order to make a fair comparison between the 2 implementations, make sure to run the python code with the following parameters:

whisper --best_of None --beam_size None ...

Greedy decoding is also 2x faster in the original implementation (on a GPU).

StuartIanNaylor commented 1 year ago

Orange Pi5 4Gb, Micro-SD not NVME

Starting to touch zram swap on medium and then file swap pretty hard on large

CPU OS Config Model Th Load Enc. Commit
rk3588s Bullseye 5.10.110 NEON tiny 8 352 2876 0be6a1a
rk3588s Bullseye 5.10.110 NEON base 8 346 6213 0be6a1a
rk3588s Bullseye 5.10.110 NEON small 8 690 25808 0be6a1a
rk3588s Bullseye 5.10.110 NEON medium 8 23987 93995 0be6a1a
rk3588s Bullseye 5.10.110 NEON large 8 49633 190601 0be6a1a

Even with a 4:4 big:little its a touch faster taskset -c 4-7 ./extra/bench-all.sh

CPU OS Config Model Th Load Enc. Commit
rk3588s Bullseye 5.10.110 NEON tiny 4 356 2716 0be6a1a
rk3588s Bullseye 5.10.110 NEON base 4 417 6661 0be6a1a
rk3588s Bullseye 5.10.110 NEON small 4 943 25357 0be6a1a
rk3588s Bullseye 5.10.110 NEON medium 4 17748 90187 0be6a1a
rk3588s Bullseye 5.10.110 NEON large 4 48793 182800 0be6a1a

Compiling on a rk3588 with -march=native -ffast-math seems to give a big boost taskset -c 4-7 ./extra/bench-all.sh

CPU OS Config Model Th Load Enc. Commit
rk3588s Bullseye 5.10.110 NEON tiny 4 280 1074 0be6a1a
rk3588s Bullseye 5.10.110 NEON base 4 466 3491 0be6a1a
rk3588s Bullseye 5.10.110 NEON small 4 780 11052 0be6a1a
rk3588s Bullseye 5.10.110 NEON medium 4 15361 42252 0be6a1a
rk3588s Bullseye 5.10.110 NEON large 4 49331 91892 0be6a1a
abitofevrything commented 1 year ago

Intel Celeron N4120 (4 cores, 4 threads) on Artix Linux 6.0.12-artix1-1.

CPU OS Config Model Th Load Enc. Commit
N4120 Artix 6.0.12-artix1-1 BLAS tiny 4 330 12272 65fdcbb
N4120 Artix 6.0.12-artix1-1 BLAS base 4 65fdcbb
N4120 Artix 6.0.12-artix1-1 BLAS small 4 892 83209 65fdcbb
N4120 Artix 6.0.12-artix1-1 BLAS medium 4 5478 237677 65fdcbb
JKeddo95 commented 1 year ago

Base 14 inch M1 Macbook Pro with NEON enabled:

CPU OS Config RAM (GB) Th Model Load (ms) Enc. (ms) Total
M1 Pro OSX 12.5.1 NEON 16 8 Tiny.en 107 269.72 376.91
M1 Pro OSX 12.5.1 NEON 16 8 Base.en 92 321 413.77
M1 Pro OSX 12.5.1 NEON 16 8 Small.en 264 978 1243.24

16 Inch Base Apple M2 Pro results

CPU OS Config RAM (GB) Th Model Load (ms) Enc. (ms) Total (ms)
M2 Pro OSX 13.2 NEON 16 8 Tiny.en 118 143 261
M2 Pro OSX 13.2 NEON 16 8 Tiny 118 143 261
M2 Pro OSX 13.2 NEON 16 8 Base.en 173 235 408
M2 Pro OSX 13.2 NEON 16 8 Base 148 266 414
M2 Pro OSX 13.2 NEON 16 8 Small.en 304 739 1042
M2 Pro OSX 13.2 NEON 16 8 Small 277(?) 720 997
M2 Pro OSX 13.2 NEON 16 8 Medium.en 747 2057 2804
M2 Pro OSX 13.2 NEON 16 8 Medium 657 2055 2712
M2 Pro OSX 13.2 NEON 16 8 Large 2126 4223 6349

I couldn't get bench to run on my iPhone 12, so I have attached my ad-hoc results below with the input audio "I love transcriber apps":

CPU DGGML_USE_ACCELERATE OS Model Load Mel Sample Enc. Dec. Total (ms)
A14 Release IOS 16.1 Base.en 150 23 2 2447 112 2584

--

This might appear obvious to some, but it wasn't to me so I'll note it here: I saw much better results using large steps lengths and sample sizes with "./stream". I feel like under the hood, Whisper relies greatly on 'whole-sentence' context to infer individual words.

j1nx commented 1 year ago

With the new beta 1.1.0 release. On first notice, not to much difference. Will not rebuild without OpenBLAS as it was slightly better with it on the rpi4.

CPU OS Config Model Th Load Enc. Commit
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS tiny 4 751 9506 ecda7f786a
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS tiny.en 4 748 9295 ecda7f786a
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS base 4 971 23512 ecda7f786a
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS base.en 4 958 24263 ecda7f786a
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS small 4 2238 84720 ecda7f786a
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS small.en 4 3880 86031 ecda7f786a
fquirin commented 1 year ago

Results on 12th Gen Intel(R) Core(TM) i3-12300T:

CPU OS Config Model Th Load Enc. Commit
Core i3-12300T Debian 11 (Docker on Win11) AVX2 tiny.en 4 97 679 49b529b
Core i3-12300T Debian 11 (Docker on Win11) AVX2 tiny 4 90 580 49b529b
Core i3-12300T Debian 11 (Docker on Win11) AVX2 base 4 138 1478 49b529b

With OpenBLAS (considerably worse):

CPU OS Config Model Th Load Enc. Commit
Core i3-12300T Debian 11 (Docker on Win11) AVX2 BLAS tiny 4 117 1644 49b529b
Core i3-12300T Debian 11 (Docker on Win11) AVX2 BLAS base 4 122 2890 49b529b
johtso commented 1 year ago

The benchmarks for the macbook pro m1 are using 8 threads, but in my experience it runs nearly twice as fast with 4 threads. Am I missing something?

Edit: I just ran the benchmark with the large model.. and it actually made almost no difference whether 8 or 4 threads were used. But with real world workloads it makes a huge difference. Interesting.

StuartIanNaylor commented 1 year ago
Running memcpy benchmark with 1 thread
memcpy: 8.66 GB/s
sum:    error 136902082731.000000

Running ggml_mul_mat benchmark with 4 threads
ggml_mul_mat:    64 x    64: F16      4.2 GFLOPS (128 runs) / F32      3.5 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     10.1 GFLOPS (128 runs) / F32      6.3 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16     13.0 GFLOPS (128 runs) / F32      7.2 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16     14.0 GFLOPS ( 53 runs) / F32      7.1 GFLOPS ( 27 runs)
ggml_mul_mat:  1024 x  1024: F16     29.8 GFLOPS ( 15 runs) / F32     17.8 GFLOPS (  9 runs)
ggml_mul_mat:  2048 x  2048: F16     37.8 GFLOPS (  3 runs) / F32     19.6 GFLOPS (  3 runs)
ggml_mul_mat:  4096 x  4096: F16     40.0 GFLOPS (  3 runs) / F32     17.4 GFLOPS (  3 runs)
Running benchmark for all models CPU OS Config Model Th Load Enc. Commit
rk3588s Ubuntu 22.04 NEON tiny 4 257 1179 21c569b
rk3588s Ubuntu 22.04 NEON base 4 326 2967 21c569b
rk3588s Ubuntu 22.04 NEON small 4 661 10560 21c569b
rk3588s Ubuntu 22.04 NEON medium 4 23188 35867 21c569b
mscdex commented 1 year ago

Compiler: gcc version 12.2.0 (Ubuntu 12.2.0-3ubuntu1)

memcpy: 16.74 GB/s
sum:    error -536870997.000000
Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat:    64 x    64: F16     16.2 GFLOPS (128 runs) / F32     16.4 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     70.1 GFLOPS (128 runs) / F32     66.0 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16    133.9 GFLOPS (128 runs) / F32    105.7 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16    161.2 GFLOPS (128 runs) / F32    109.3 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16    204.4 GFLOPS ( 96 runs) / F32    121.9 GFLOPS ( 57 runs)
ggml_mul_mat:  2048 x  2048: F16    254.4 GFLOPS ( 15 runs) / F32    149.3 GFLOPS (  9 runs)
ggml_mul_mat:  4096 x  4096: F16    184.2 GFLOPS (  3 runs) / F32     54.1 GFLOPS (  3 runs)

Running ggml_mul_mat benchmark with 8 threads

ggml_mul_mat:    64 x    64: F16      8.4 GFLOPS (128 runs) / F32      9.0 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     58.1 GFLOPS (128 runs) / F32     57.6 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16    170.3 GFLOPS (128 runs) / F32    159.9 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16    315.7 GFLOPS (128 runs) / F32    230.8 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16    356.0 GFLOPS (128 runs) / F32    224.9 GFLOPS (105 runs)
ggml_mul_mat:  2048 x  2048: F16    499.5 GFLOPS ( 30 runs) / F32    292.4 GFLOPS ( 18 runs)
ggml_mul_mat:  4096 x  4096: F16    265.9 GFLOPS (  3 runs) / F32     66.2 GFLOPS (  3 runs)

Running ggml_mul_mat benchmark with 16 threads

ggml_mul_mat:    64 x    64: F16      3.6 GFLOPS (128 runs) / F32      3.0 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     16.7 GFLOPS (128 runs) / F32     27.0 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16     88.1 GFLOPS (128 runs) / F32    126.7 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16    263.5 GFLOPS (128 runs) / F32    229.5 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16    396.1 GFLOPS (128 runs) / F32    272.8 GFLOPS (128 runs)
ggml_mul_mat:  2048 x  2048: F16    498.6 GFLOPS ( 30 runs) / F32    314.9 GFLOPS ( 19 runs)
ggml_mul_mat:  4096 x  4096: F16    337.7 GFLOPS (  3 runs) / F32    112.0 GFLOPS (  3 runs)
CPU OS Config Model Th Load Enc. Commit
Ryzen 7700X (8C/16T 65W Eco Mode) Ubuntu 22.10 (6.0.9 Kernel) AVX2 tiny.en 4 104 247 78f1661
Ryzen 7700X (8C/16T 65W Eco Mode) Ubuntu 22.10 (6.0.9 Kernel) AVX2 base.en 4 130 585 78f1661
Ryzen 7700X (8C/16T 65W Eco Mode) Ubuntu 22.10 (6.0.9 Kernel) AVX2 small.en 4 264 1940 78f1661
--- -- ------ ----- -- ---- ---- ------
Ryzen 7700X (8C/16T 65W Eco Mode) Ubuntu 22.10 (6.0.9 Kernel) AVX2 tiny.en 8 99 166 78f1661
Ryzen 7700X (8C/16T 65W Eco Mode) Ubuntu 22.10 (6.0.9 Kernel) AVX2 base.en 8 123 329 78f1661
Ryzen 7700X (8C/16T 65W Eco Mode) Ubuntu 22.10 (6.0.9 Kernel) AVX2 small.en 8 262 1148 78f1661
--- -- ------ ----- -- ---- ---- ------
Ryzen 7700X (8C/16T 65W Eco Mode) Ubuntu 22.10 (6.0.9 Kernel) AVX2 tiny.en 16 100 160 78f1661
Ryzen 7700X (8C/16T 65W Eco Mode) Ubuntu 22.10 (6.0.9 Kernel) AVX2 base.en 16 123 338 78f1661
Ryzen 7700X (8C/16T 65W Eco Mode) Ubuntu 22.10 (6.0.9 Kernel) AVX2 small.en 16 262 1139 78f1661
braydenm commented 1 year ago

Tested on my M2 Macbook Air:

system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |

./extra/bench-all.sh Running memcpy benchmark with 1 thread memcpy: 31.42 GB/s sum: ok -536870910.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat: 64 x 64: F16 11.8 GFLOPS (128 runs) / F32 10.6 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 89.9 GFLOPS (128 runs) / F32 74.7 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 434.5 GFLOPS (128 runs) / F32 419.9 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 885.4 GFLOPS (128 runs) / F32 913.2 GFLOPS (128 runs) ggml_mul_mat: 1024 x 1024: F16 1023.4 GFLOPS (128 runs) / F32 1037.7 GFLOPS (128 runs) ggml_mul_mat: 2048 x 2048: F16 971.6 GFLOPS ( 57 runs) / F32 950.1 GFLOPS ( 56 runs) ggml_mul_mat: 4096 x 4096: F16 914.9 GFLOPS ( 7 runs) / F32 820.7 GFLOPS ( 6 runs)

CPU OS Config Model Th Load Enc. Commit
M2 OSX 13.0.1 NEON BLAS tiny 4 63 153 1a91c19
M2 OSX 13.0.1 NEON BLAS base 4 92 329 1a91c19
M2 OSX 13.0.1 NEON BLAS small 4 198 1014 1a91c19
M2 OSX 13.0.1 NEON BLAS medium 4 564 3042 1a91c19
M2 OSX 13.0.1 NEON BLAS large 4 1152 5466 1a91c19

Running ggml_mul_mat benchmark with 8 threads

ggml_mul_mat: 64 x 64: F16 5.7 GFLOPS (128 runs) / F32 3.9 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 45.0 GFLOPS (128 runs) / F32 25.8 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 272.7 GFLOPS (128 runs) / F32 166.1 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 747.6 GFLOPS (128 runs) / F32 748.8 GFLOPS (128 runs) ggml_mul_mat: 1024 x 1024: F16 998.7 GFLOPS (128 runs) / F32 895.8 GFLOPS (128 runs) ggml_mul_mat: 2048 x 2048: F16 716.0 GFLOPS ( 42 runs) / F32 717.2 GFLOPS ( 42 runs) ggml_mul_mat: 4096 x 4096: F16 790.4 GFLOPS ( 6 runs) / F32 726.3 GFLOPS ( 6 runs)

CPU OS Config Model Th Load Enc. Commit
M2 OSX 13.0.1 NEON BLAS tiny 8 66 154 1a91c19
M2 OSX 13.0.1 NEON BLAS base 8 92 346 1a91c19
M2 OSX 13.0.1 NEON BLAS small 8 211 1171 1a91c19
M2 OSX 13.0.1 NEON BLAS medium 8 562 3848 1a91c19
M2 OSX 13.0.1 NEON BLAS large 8 1079 6230 1a91c19
febriansasi commented 1 year ago

This is bench result :

whisper_init_from_file: loading model from 'models/ggml-base.en.bin' whisper_model_load: loading model whisper_model_load: n_vocab = 51864 whisper_model_load: n_audio_ctx = 1500 whisper_model_load: n_audio_state = 512 whisper_model_load: n_audio_head = 8 whisper_model_load: n_audio_layer = 6 whisper_model_load: n_text_ctx = 448 whisper_model_load: n_text_state = 512 whisper_model_load: n_text_head = 8 whisper_model_load: n_text_layer = 6 whisper_model_load: n_mels = 80 whisper_model_load: f16 = 1 whisper_model_load: type = 2 whisper_model_load: mem required = 500.00 MB (+ 6.00 MB per decoder) whisper_model_load: kv self size = 5.25 MB whisper_model_load: kv cross size = 17.58 MB whisper_model_load: adding 1607 extra tokens whisper_model_load: model ctx = 140.60 MB whisper_model_load: model size = 140.54 MB

system_info: n_threads = 4 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

whisper_print_timings: fallbacks = 0 p / 0 h whisper_print_timings: load time = 1245.39 ms whisper_print_timings: mel time = 0.00 ms whisper_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run) whisper_print_timings: encode time = 88596.32 ms / 1 runs (88596.32 ms per run) whisper_print_timings: decode time = 0.00 ms / 1 runs ( 0.00 ms per run) whisper_print_timings: total time = 89841.85 ms

This is cpuinfo :

processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 42 model name : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz stepping : 7 microcode : 0x2f cpu MHz : 2990.383 cache size : 3072 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 2 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts md_clear flush_l1d vmx flags : vnmi preemption_timer invvpid ept_x_only flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit mmio_unknown bogomips : 4983.97 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management:

processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 42 model name : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz stepping : 7 microcode : 0x2f cpu MHz : 2990.384 cache size : 3072 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 2 apicid : 1 initial apicid : 1 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts md_clear flush_l1d vmx flags : vnmi preemption_timer invvpid ept_x_only flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit mmio_unknown bogomips : 4983.97 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management:

processor : 2 vendor_id : GenuineIntel cpu family : 6 model : 42 model name : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz stepping : 7 microcode : 0x2f cpu MHz : 2990.384 cache size : 3072 KB physical id : 0 siblings : 4 core id : 1 cpu cores : 2 apicid : 2 initial apicid : 2 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts md_clear flush_l1d vmx flags : vnmi preemption_timer invvpid ept_x_only flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit mmio_unknown bogomips : 4983.97 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management:

processor : 3 vendor_id : GenuineIntel cpu family : 6 model : 42 model name : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz stepping : 7 microcode : 0x2f cpu MHz : 2990.384 cache size : 3072 KB physical id : 0 siblings : 4 core id : 1 cpu cores : 2 apicid : 3 initial apicid : 3 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts md_clear flush_l1d vmx flags : vnmi preemption_timer invvpid ept_x_only flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit mmio_unknown bogomips : 4983.97 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management:

./bench -w 1 -t 1

memcpy: 3.35 GB/s sum: error -536870997.000000 ./bench -w 2 -t 1

ggml_mul_mat: 64 x 64: F16 0.7 GFLOPS (128 runs) / F32 3.3 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 0.7 GFLOPS (128 runs) / F32 3.7 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 0.6 GFLOPS ( 18 runs) / F32 3.3 GFLOPS ( 99 runs) ggml_mul_mat: 512 x 512: F16 0.6 GFLOPS ( 3 runs) / F32 3.6 GFLOPS ( 14 runs) ggml_mul_mat: 1024 x 1024: F16 0.7 GFLOPS ( 3 runs) / F32 2.3 GFLOPS ( 3 runs) ggml_mul_mat: 2048 x 2048: F16 0.7 GFLOPS ( 3 runs) / F32 2.4 GFLOPS ( 3 runs) ggml_mul_mat: 4096 x 4096: F16 1.2 GFLOPS ( 3 runs) / F32 3.0 GFLOPS ( 3 runs)

Thinkpad T520, on Linux Mint Debian Edition, with commented out AVX1 on Makefile

rainmanjam commented 1 year ago

Usage: ./bench.sh [n_threads]

Running memcpy benchmark with 1 thread

memcpy: 38.84 GB/s sum: ok -536870910.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat: 64 x 64: F16 9.8 GFLOPS (128 runs) / F32 8.4 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 69.4 GFLOPS (128 runs) / F32 62.1 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 455.3 GFLOPS (128 runs) / F32 383.8 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 1141.1 GFLOPS (128 runs) / F32 1550.2 GFLOPS (128 runs) ggml_mul_mat: 1024 x 1024: F16 2302.0 GFLOPS (128 runs) / F32 2962.9 GFLOPS (128 runs) ggml_mul_mat: 2048 x 2048: F16 3035.6 GFLOPS (128 runs) / F32 3217.5 GFLOPS (128 runs) ggml_mul_mat: 4096 x 4096: F16 3431.7 GFLOPS ( 25 runs) / F32 3510.6 GFLOPS ( 26 runs)

Running benchmark for all models This can take a while!

CPU OS Config Model Th Load Enc. Commit
M1 Ultra 13.2 NEON BLAS tiny 4 71 139 2bee265
M1 Ultra 13.2 NEON BLAS base 4 95 266 2bee265
M1 Ultra 13.2 NEON BLAS small 4 222 806 2bee265
M1 Ultra 13.2 NEON BLAS medium 4 598 2175 2bee265
M1 Ultra 13.2 NEON BLAS large 4 1165 3895 2bee265
fitzsim commented 1 year ago

Here are new results for POWER9, now that #300 is closed.

Running memcpy benchmark with 1 thread

memcpy: 6.32 GB/s
sum:    error 136902082731.000000

Running ggml_mul_mat benchmark with 32 threads

ggml_mul_mat:    64 x    64: F16      0.4 GFLOPS (128 runs) / F32      0.4 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16      2.8 GFLOPS (128 runs) / F32      2.8 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16     13.4 GFLOPS (128 runs) / F32     23.0 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16     32.9 GFLOPS (123 runs) / F32     87.9 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16     47.9 GFLOPS ( 23 runs) / F32    127.4 GFLOPS ( 60 runs)
ggml_mul_mat:  2048 x  2048: F16     58.5 GFLOPS (  4 runs) / F32     67.3 GFLOPS (  4 runs)
ggml_mul_mat:  4096 x  4096: F16     23.8 GFLOPS (  3 runs) / F32     21.2 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!
CPU OS Config Model Th Load Enc. Commit Compiler
POWER9 Debian 11 tiny 32 75 1283 3b010f9 GCC 10.2.1
POWER9 Debian 11 base 32 96 2786 3b010f9 GCC 10.2.1
POWER9 Debian 11 small 32 182 8534 3b010f9 GCC 10.2.1
POWER9 Debian 11 medium 32 463 22282 3b010f9 GCC 10.2.1
POWER9 Debian 11 large 32 838 41106 3b010f9 GCC 10.2.1
FlippFuzz commented 1 year ago

I got referred here from https://github.com/openai/whisper/discussions/978#discussioncomment-5093839 This seems really interesting.

I'm running on Oracle Cloud's free tier, which contains 4x Ampere A1 CPUs and 24G RAM.


Compiler:

I CC:       cc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
I CXX:      g++ (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0

Default (no changes)

~/whisper.cpp$ extra/bench-all.sh
Usage: ./bench.sh [n_threads]

Running memcpy benchmark with 1 thread

memcpy: 10.92 GB/s
sum:    error 136902082731.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat:    64 x    64: F16      1.0 GFLOPS (128 runs) / F32      0.7 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     16.8 GFLOPS (128 runs) / F32     13.2 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16     18.5 GFLOPS (128 runs) / F32     41.8 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16     21.5 GFLOPS ( 81 runs) / F32     35.4 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16     23.2 GFLOPS ( 11 runs) / F32     41.4 GFLOPS ( 20 runs)
ggml_mul_mat:  2048 x  2048: F16     23.4 GFLOPS (  3 runs) / F32     32.6 GFLOPS (  3 runs)
ggml_mul_mat:  4096 x  4096: F16     22.5 GFLOPS (  3 runs) / F32     21.4 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!
CPU OS Config Model Th Load Enc. Commit
Ampere A1 Ubuntu 22.04 NEON tiny 4 83 1832 ca21f7a
Ampere A1 Ubuntu 22.04 NEON base 4 120 4767 ca21f7a
Ampere A1 Ubuntu 22.04 NEON small 4 273 17529 ca21f7a
Ampere A1 Ubuntu 22.04 NEON medium 4 739 59794 ca21f7a
Ampere A1 Ubuntu 22.04 NEON large 4 1436 115771 ca21f7a

With changes mentioned in https://github.com/openai/whisper/discussions/978#discussioncomment-5093839 Thanks again @jan-grzybek-ampere

~/whisper.cpp$ extra/bench-all.sh

Running memcpy benchmark with 1 thread

memcpy: 10.88 GB/s
sum:    error 136902082731.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat:    64 x    64: F16      2.0 GFLOPS (128 runs) / F32      1.7 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     14.3 GFLOPS (128 runs) / F32     33.6 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16     40.7 GFLOPS (128 runs) / F32     54.3 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16     97.5 GFLOPS (128 runs) / F32     31.4 GFLOPS (117 runs)
ggml_mul_mat:  1024 x  1024: F16     87.1 GFLOPS ( 41 runs) / F32     41.0 GFLOPS ( 20 runs)
ggml_mul_mat:  2048 x  2048: F16     74.3 GFLOPS (  5 runs) / F32     33.4 GFLOPS (  3 runs)
ggml_mul_mat:  4096 x  4096: F16     50.4 GFLOPS (  3 runs) / F32     21.5 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!
CPU OS Config Model Th Load Enc. Commit
Ampere A1 Ubuntu 22.04 NEON tiny 4 84 619 ca21f7a
Ampere A1 Ubuntu 22.04 NEON base 4 124 2036 ca21f7a
Ampere A1 Ubuntu 22.04 NEON small 4 293 5872 ca21f7a
Ampere A1 Ubuntu 22.04 NEON medium 4 817 22064 ca21f7a
Ampere A1 Ubuntu 22.04 NEON large 4 1446 37996 ca21f7a
FlippFuzz commented 1 year ago

Done a bit of reading and done several more tests.

According to https://community.arm.com/arm-community-blogs/b/tools-software-ides-blog/posts/compiler-flags-across-architectures-march-mtune-and-mcpu , the recommendation is to use -mcpu=native and I did indeed get the best performance with it. Will put in a pull request to use -mcpu=native for aarch64. No significant difference between GCC 11.3 and GCC 12.1 on Ubuntu 22.04.


-march=armv8.2-a+fp16, gcc-11.3

Performance seems slightly worse compared to yesterday's test in https://github.com/ggerganov/whisper.cpp/issues/89#issuecomment-1443688585 I re-ran all of the following tests one after another to hopefully obtain comparable figures. This is a free instance on Oracle Cloud and perhaps others are using the other cores on the CPU.

make clean
make main bench
./extra/bench-all.sh

Running memcpy benchmark with 1 thread

memcpy: 10.82 GB/s
sum:    error 136902082731.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat:    64 x    64: F16      1.8 GFLOPS (128 runs) / F32      2.0 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     40.7 GFLOPS (128 runs) / F32     12.7 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16     52.9 GFLOPS (128 runs) / F32     32.8 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16     97.3 GFLOPS (128 runs) / F32     32.1 GFLOPS (120 runs)
ggml_mul_mat:  1024 x  1024: F16     77.0 GFLOPS ( 36 runs) / F32     35.1 GFLOPS ( 17 runs)
ggml_mul_mat:  2048 x  2048: F16     64.0 GFLOPS (  4 runs) / F32     25.9 GFLOPS (  3 runs)
ggml_mul_mat:  4096 x  4096: F16     45.8 GFLOPS (  3 runs) / F32     21.0 GFLOPS (  3 runs)
CPU OS Config Model Th Load Enc. Commit
Ampere A1 Ubuntu 22.04 NEON tiny 4 85 662 ca21f7a
Ampere A1 Ubuntu 22.04 NEON base 4 121 2039 ca21f7a
Ampere A1 Ubuntu 22.04 NEON small 4 281 6667 ca21f7a
Ampere A1 Ubuntu 22.04 NEON medium 4 760 25355 ca21f7a
Ampere A1 Ubuntu 22.04 NEON large 4 1456 45563 ca21f7a

-mcpu=native, gcc-11.3

make clean
make main bench
./extra/bench-all.sh

Running memcpy benchmark with 1 thread

memcpy: 10.85 GB/s
sum:    error 136902082731.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat:    64 x    64: F16      7.9 GFLOPS (128 runs) / F32      1.8 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16      7.5 GFLOPS (128 runs) / F32     12.6 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16     51.8 GFLOPS (128 runs) / F32     54.4 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16     96.3 GFLOPS (128 runs) / F32     31.2 GFLOPS (117 runs)
ggml_mul_mat:  1024 x  1024: F16     74.1 GFLOPS ( 35 runs) / F32     33.5 GFLOPS ( 16 runs)
ggml_mul_mat:  2048 x  2048: F16     67.1 GFLOPS (  4 runs) / F32     27.0 GFLOPS (  3 runs)
ggml_mul_mat:  4096 x  4096: F16     49.3 GFLOPS (  3 runs) / F32     21.7 GFLOPS (  3 runs)
CPU OS Config Model Th Load Enc. Commit
Ampere A1 Ubuntu 22.04 NEON tiny 4 85 655 ca21f7a
Ampere A1 Ubuntu 22.04 NEON base 4 121 2002 ca21f7a
Ampere A1 Ubuntu 22.04 NEON small 4 283 6923 ca21f7a
Ampere A1 Ubuntu 22.04 NEON medium 4 762 24085 ca21f7a
Ampere A1 Ubuntu 22.04 NEON large 4 1459 43846 ca21f7a

-mcpu=native, gcc-12.1

make clean
make CC=gcc-12 CXX=g++-12 main bench
./extra/bench-all.sh

Running memcpy benchmark with 1 thread

memcpy: 11.01 GB/s
sum:    error 136902082731.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat:    64 x    64: F16      8.0 GFLOPS (128 runs) / F32      8.0 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     12.0 GFLOPS (128 runs) / F32     12.6 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16     55.7 GFLOPS (128 runs) / F32     41.8 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16     95.1 GFLOPS (128 runs) / F32     30.2 GFLOPS (113 runs)
ggml_mul_mat:  1024 x  1024: F16     67.1 GFLOPS ( 32 runs) / F32     33.0 GFLOPS ( 16 runs)
ggml_mul_mat:  2048 x  2048: F16     64.2 GFLOPS (  4 runs) / F32     26.8 GFLOPS (  3 runs)
ggml_mul_mat:  4096 x  4096: F16     46.1 GFLOPS (  3 runs) / F32     21.4 GFLOPS (  3 runs)
CPU OS Config Model Th Load Enc. Commit
Ampere A1 Ubuntu 22.04 NEON tiny 4 84 613 ca21f7a
Ampere A1 Ubuntu 22.04 NEON base 4 122 2086 ca21f7a
Ampere A1 Ubuntu 22.04 NEON small 4 286 6375 ca21f7a
Ampere A1 Ubuntu 22.04 NEON medium 4 761 24667 ca21f7a
Ampere A1 Ubuntu 22.04 NEON large 4 1457 43826 ca21f7a
jaybinks commented 1 year ago

I confirmed your findings, and interestingly enough, I found the performance worse with OpenBLAS.

On Sat, 25 Feb 2023 at 12:06, FlippFuzz @.***> wrote:

Done a bit of reading and done several more tests.

According to https://community.arm.com/arm-community-blogs/b/tools-software-ides-blog/posts/compiler-flags-across-architectures-march-mtune-and-mcpu , the recommendation is to use -mcpu=native and I did indeed get the best performance with it. Will put in a pull request to use -mcpu=native for aarch64. No significant difference between GCC 11.3 and GCC 12.1 on Ubuntu 22.04.

-march=armv8.2-a+fp16, gcc-11.3

Performance seems slightly worse compared to yesterday's test in #89 (comment) https://github.com/ggerganov/whisper.cpp/issues/89#issuecomment-1443688585 I re-ran all of the following tests one after another to hopefully obtain comparable figures. This is a free instance on Oracle Cloud and perhaps others are using the other cores on the CPU.

make clean make main bench ./extra/bench-all.sh

Running memcpy benchmark with 1 thread

memcpy: 10.82 GB/s sum: error 136902082731.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat: 64 x 64: F16 1.8 GFLOPS (128 runs) / F32 2.0 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 40.7 GFLOPS (128 runs) / F32 12.7 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 52.9 GFLOPS (128 runs) / F32 32.8 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 97.3 GFLOPS (128 runs) / F32 32.1 GFLOPS (120 runs) ggml_mul_mat: 1024 x 1024: F16 77.0 GFLOPS ( 36 runs) / F32 35.1 GFLOPS ( 17 runs) ggml_mul_mat: 2048 x 2048: F16 64.0 GFLOPS ( 4 runs) / F32 25.9 GFLOPS ( 3 runs) ggml_mul_mat: 4096 x 4096: F16 45.8 GFLOPS ( 3 runs) / F32 21.0 GFLOPS ( 3 runs)

CPU OS Config Model Th Load Enc. Commit Ampere A1 Ubuntu 22.04 NEON tiny 4 85 662 ca21f7a https://github.com/ggerganov/whisper.cpp/commit/ca21f7ab16694384fb74b1ba4f68b39f16540d23 Ampere A1 Ubuntu 22.04 NEON base 4 121 2039 ca21f7a https://github.com/ggerganov/whisper.cpp/commit/ca21f7ab16694384fb74b1ba4f68b39f16540d23 Ampere A1 Ubuntu 22.04 NEON small 4 281 6667 ca21f7a https://github.com/ggerganov/whisper.cpp/commit/ca21f7ab16694384fb74b1ba4f68b39f16540d23 Ampere A1 Ubuntu 22.04 NEON medium 4 760 25355 ca21f7a https://github.com/ggerganov/whisper.cpp/commit/ca21f7ab16694384fb74b1ba4f68b39f16540d23 Ampere A1 Ubuntu 22.04 NEON large 4 1456 45563 ca21f7a https://github.com/ggerganov/whisper.cpp/commit/ca21f7ab16694384fb74b1ba4f68b39f16540d23

-mcpu=native, gcc-11.3

make clean make main bench ./extra/bench-all.sh

Running memcpy benchmark with 1 thread

memcpy: 10.85 GB/s sum: error 136902082731.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat: 64 x 64: F16 7.9 GFLOPS (128 runs) / F32 1.8 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 7.5 GFLOPS (128 runs) / F32 12.6 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 51.8 GFLOPS (128 runs) / F32 54.4 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 96.3 GFLOPS (128 runs) / F32 31.2 GFLOPS (117 runs) ggml_mul_mat: 1024 x 1024: F16 74.1 GFLOPS ( 35 runs) / F32 33.5 GFLOPS ( 16 runs) ggml_mul_mat: 2048 x 2048: F16 67.1 GFLOPS ( 4 runs) / F32 27.0 GFLOPS ( 3 runs) ggml_mul_mat: 4096 x 4096: F16 49.3 GFLOPS ( 3 runs) / F32 21.7 GFLOPS ( 3 runs)

CPU OS Config Model Th Load Enc. Commit Ampere A1 Ubuntu 22.04 NEON tiny 4 85 655 ca21f7a https://github.com/ggerganov/whisper.cpp/commit/ca21f7ab16694384fb74b1ba4f68b39f16540d23 Ampere A1 Ubuntu 22.04 NEON base 4 121 2002 ca21f7a https://github.com/ggerganov/whisper.cpp/commit/ca21f7ab16694384fb74b1ba4f68b39f16540d23 Ampere A1 Ubuntu 22.04 NEON small 4 283 6923 ca21f7a https://github.com/ggerganov/whisper.cpp/commit/ca21f7ab16694384fb74b1ba4f68b39f16540d23 Ampere A1 Ubuntu 22.04 NEON medium 4 762 24085 ca21f7a https://github.com/ggerganov/whisper.cpp/commit/ca21f7ab16694384fb74b1ba4f68b39f16540d23 Ampere A1 Ubuntu 22.04 NEON large 4 1459 43846 ca21f7a https://github.com/ggerganov/whisper.cpp/commit/ca21f7ab16694384fb74b1ba4f68b39f16540d23

-mcpu=native, gcc-12.1

make clean make CC=gcc-12 CXX=g++-12 main bench ./extra/bench-all.sh

Running memcpy benchmark with 1 thread

memcpy: 11.01 GB/s sum: error 136902082731.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat: 64 x 64: F16 8.0 GFLOPS (128 runs) / F32 8.0 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 12.0 GFLOPS (128 runs) / F32 12.6 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 55.7 GFLOPS (128 runs) / F32 41.8 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 95.1 GFLOPS (128 runs) / F32 30.2 GFLOPS (113 runs) ggml_mul_mat: 1024 x 1024: F16 67.1 GFLOPS ( 32 runs) / F32 33.0 GFLOPS ( 16 runs) ggml_mul_mat: 2048 x 2048: F16 64.2 GFLOPS ( 4 runs) / F32 26.8 GFLOPS ( 3 runs) ggml_mul_mat: 4096 x 4096: F16 46.1 GFLOPS ( 3 runs) / F32 21.4 GFLOPS ( 3 runs)

CPU OS Config Model Th Load Enc. Commit Ampere A1 Ubuntu 22.04 NEON tiny 4 84 613 ca21f7a https://github.com/ggerganov/whisper.cpp/commit/ca21f7ab16694384fb74b1ba4f68b39f16540d23 Ampere A1 Ubuntu 22.04 NEON base 4 122 2086 ca21f7a https://github.com/ggerganov/whisper.cpp/commit/ca21f7ab16694384fb74b1ba4f68b39f16540d23 Ampere A1 Ubuntu 22.04 NEON small 4 286 6375 ca21f7a https://github.com/ggerganov/whisper.cpp/commit/ca21f7ab16694384fb74b1ba4f68b39f16540d23 Ampere A1 Ubuntu 22.04 NEON medium 4 761 24667 ca21f7a https://github.com/ggerganov/whisper.cpp/commit/ca21f7ab16694384fb74b1ba4f68b39f16540d23 Ampere A1 Ubuntu 22.04 NEON large 4 1457 43826 ca21f7a https://github.com/ggerganov/whisper.cpp/commit/ca21f7ab16694384fb74b1ba4f68b39f16540d23

— Reply to this email directly, view it on GitHub https://github.com/ggerganov/whisper.cpp/issues/89#issuecomment-1444918927, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALQR6226FF7KCLW6VNZK6TWZFSIZANCNFSM6AAAAAAROFTFJE . You are receiving this because you commented.Message ID: @.***>

-- Sincerely

Jay

NathanSweet commented 1 year ago

whisper-bin-x64

>bench.exe
whisper_init_from_file: loading model from 'models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem required  =  215.00 MB (+    6.00 MB per decoder)
whisper_model_load: kv self size  =    5.25 MB
whisper_model_load: kv cross size =   17.58 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =  140.60 MB
whisper_model_load: model size    =  140.54 MB

system_info: n_threads = 4 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 |

whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:     load time =   109.45 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   encode time =   919.30 ms /     1 runs (  919.30 ms per run)
whisper_print_timings:   decode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =  1032.75 ms
>bench -w 1 -t 1
memcpy: 24.58 GB/s
sum:    error -536870819.000000
>bench -w 2 -t 1
ggml_mul_mat:    64 x    64: F16     22.7 GFLOPS (128 runs) / F32     38.7 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     34.6 GFLOPS (128 runs) / F32     45.6 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16     44.2 GFLOPS (128 runs) / F32     54.5 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16     50.5 GFLOPS (128 runs) / F32     55.3 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16     53.2 GFLOPS ( 25 runs) / F32     65.7 GFLOPS ( 31 runs)
ggml_mul_mat:  2048 x  2048: F16     54.9 GFLOPS (  4 runs) / F32     61.8 GFLOPS (  4 runs)
ggml_mul_mat:  4096 x  4096: F16     50.7 GFLOPS (  3 runs) / F32     19.9 GFLOPS (  3 runs)

That last one is less than the 5950X above, weird. Oh, OpenBLAS below:

whisper-blas-bin-x64

>bench
whisper_init_from_file: loading model from 'models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem required  =  215.00 MB (+    6.00 MB per decoder)
whisper_model_load: kv self size  =    5.25 MB
whisper_model_load: kv cross size =   17.58 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =  140.60 MB
whisper_model_load: model size    =  140.54 MB

system_info: n_threads = 4 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |

whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:     load time =   101.76 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   encode time =   602.63 ms /     1 runs (  602.63 ms per run)
whisper_print_timings:   decode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =   705.80 ms
>bench -w 1 -t 1
memcpy: 24.30 GB/s
sum:    error -536870819.000000
>bench -w 2 -t 1
ggml_mul_mat:    64 x    64: F16     89.4 GFLOPS (128 runs) / F32    119.6 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     27.6 GFLOPS (128 runs) / F32     31.0 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16    172.9 GFLOPS (128 runs) / F32    222.0 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16    596.8 GFLOPS (128 runs) / F32    926.4 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16   1257.0 GFLOPS (128 runs) / F32   1887.7 GFLOPS (128 runs)
ggml_mul_mat:  2048 x  2048: F16   1726.5 GFLOPS (101 runs) / F32   2193.9 GFLOPS (128 runs)
ggml_mul_mat:  4096 x  4096: F16   2109.8 GFLOPS ( 16 runs) / F32   2237.5 GFLOPS ( 17 runs)
tim-gromeyer commented 1 year ago

memcpy: 7.20 GB/s sum: error -536870997.000000

CPU OS Config Model Th Load Enc. Commit
AMD Ryzen 3 3200U Linux Mint 21.1 AVX2 tiny 4 109 3417 09e9068
AMD Ryzen 3 3200U Linux Mint 21.1 AVX2 base 4 180 7907 09e9068
AMD Ryzen 3 3200U Linux Mint 21.1 AVX2 small 4 419 30899 09e9068
AMD Ryzen 3 3200U Linux Mint 21.1 AVX2 medium 4 1851 106542 09e9068
AMD Ryzen 3 3200U Linux Mint 21.1 AVX2 large 4 4715 203455 09e9068
Karl-Han commented 1 year ago

memcpy: 15.57 GB/s

Running ggml_mul_mat benchmark with 8 threads

ggml_mul_mat: 64 x 64: F16 6.1 GFLOPS (128 runs) / F32 6.2 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 40.1 GFLOPS (128 runs) / F32 38.7 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 147.9 GFLOPS (128 runs) / F32 110.1 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 264.9 GFLOPS (128 runs) / F32 134.4 GFLOPS (128 runs) ggml_mul_mat: 1024 x 1024: F16 289.5 GFLOPS (128 runs) / F32 151.9 GFLOPS ( 71 runs) ggml_mul_mat: 2048 x 2048: F16 290.6 GFLOPS ( 17 runs) / F32 70.7 GFLOPS ( 5 runs) ggml_mul_mat: 4096 x 4096: F16 114.0 GFLOPS ( 3 runs) / F32 62.7 GFLOPS ( 3 runs)

CPU OS Config Model Th Load Enc. Commit
AMD Ryzen 7 5800HS Linux RHEL8.7 AVX2 tiny 8 50 361 09e9068
AMD Ryzen 7 5800HS Linux RHEL8.7 AVX2 base 8 70 1000 09e9068
AMD Ryzen 7 5800HS Linux RHEL8.7 AVX2 small 8 185 2264 09e9068
AMD Ryzen 7 5800HS Linux RHEL8.7 AVX2 medium 8 587 8421 09e9068
AMD Ryzen 7 5800HS Linux RHEL8.7 AVX2 large 8 2296 15759 09e9068

Running ggml_mul_mat benchmark with 16 threads

ggml_mul_mat: 64 x 64: F16 2.1 GFLOPS (128 runs) / F32 1.9 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 19.6 GFLOPS (128 runs) / F32 14.8 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 68.1 GFLOPS (128 runs) / F32 84.5 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 200.5 GFLOPS (128 runs) / F32 141.4 GFLOPS (128 runs) ggml_mul_mat: 1024 x 1024: F16 271.0 GFLOPS (127 runs) / F32 163.7 GFLOPS ( 77 runs) ggml_mul_mat: 2048 x 2048: F16 205.5 GFLOPS ( 12 runs) / F32 71.6 GFLOPS ( 5 runs) ggml_mul_mat: 4096 x 4096: F16 142.3 GFLOPS ( 3 runs) / F32 63.0 GFLOPS ( 3 runs)

CPU OS Config Model Th Load Enc. Commit
AMD Ryzen 7 5800HS Linux RHEL8.7 AVX2 tiny 16 52 329 09e9068
AMD Ryzen 7 5800HS Linux RHEL8.7 AVX2 base 16 72 723 09e9068
AMD Ryzen 7 5800HS Linux RHEL8.7 AVX2 small 16 188 2214 09e9068
AMD Ryzen 7 5800HS Linux RHEL8.7 AVX2 medium 16 698 10889 09e9068
AMD Ryzen 7 5800HS Linux RHEL8.7 AVX2 large 16 1619 16640 09e9068
owengaspard commented 1 year ago

MacBook Pro 14" with M2 Pro

CPU OS Config Model Th Load Enc. Commit
Apple M2 Pro macOS 13.2 NEON BLAS tiny 8 76 161 09e9068
Apple M2 Pro macOS 13.2 NEON BLAS base 8 104 318 09e9068
Apple M2 Pro macOS 13.2 NEON BLAS small 8 221 975 09e9068
Apple M2 Pro macOS 13.2 NEON BLAS medium 8 969 2692 09e9068
Apple M2 Pro macOS 13.2 NEON BLAS large 8 1939 4959 09e9068
oceancloud82 commented 1 year ago

NVIDIA Jetson Nano, without GPU optimization: base-en

 ./bin/main -f samples/jfk.wav 
whisper_init_from_file_no_state: loading model from 'models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem required  =  215.00 MB (+    6.00 MB per decoder)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =  140.60 MB
whisper_model_load: model size    =  140.54 MB
whisper_init_state: kv self size  =    5.25 MB
whisper_init_state: kv cross size =   17.58 MB

system_info: n_threads = 4 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | 

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...

[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.

whisper_print_timings:     load time =   354.49 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   712.86 ms
whisper_print_timings:   sample time =    79.37 ms /    27 runs (    2.94 ms per run)
whisper_print_timings:   encode time = 24406.28 ms /     1 runs (24406.28 ms per run)
whisper_print_timings:   decode time =  1284.84 ms /    27 runs (   47.59 ms per run)
whisper_print_timings:    total time = 26908.31 ms

tiny-en

./bin/main -m ./models/ggml-tiny.en.bin  -f ./samples/jfk.wav 
whisper_init_from_file_no_state: loading model from './models/ggml-tiny.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 1
whisper_model_load: mem required  =  127.00 MB (+    3.00 MB per decoder)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =   73.58 MB
whisper_model_load: model size    =   73.54 MB
whisper_init_state: kv self size  =    2.62 MB
whisper_init_state: kv cross size =    8.79 MB

system_info: n_threads = 4 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | 

main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...

[00:00:00.000 --> 00:00:07.740]   And so my fellow Americans ask not what your country can do for you
[00:00:07.740 --> 00:00:10.740]   ask what you can do for your country

whisper_print_timings:     load time =   204.60 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   564.90 ms
whisper_print_timings:   sample time =    72.13 ms /    26 runs (    2.77 ms per run)
whisper_print_timings:   encode time =  9232.34 ms /     1 runs ( 9232.34 ms per run)
whisper_print_timings:   decode time =   616.00 ms /    26 runs (   23.69 ms per run)
whisper_print_timings:    total time = 10745.65 ms
gkovacsp commented 1 year ago

MacBook Pro 14" with M2 Pro 10 Cores, 32GB RAM macOS Ventura 13.2 Benchmarks running at 8 threads memcpy: 40.68 GB/s

| CPU          | OS     | Config     | Model    | Th | Load | Enc. | Commit  |
| ------------ | ------ | ---------- | -------- | -- | ---- | ---- | ------- |
| Apple M1 Pro | 13.2.1 |  NEON BLAS | tiny     | 8  | 45   | 93   | 09e9068 |
| Apple M1 Pro | 13.2.1 |  NEON BLAS | base     | 8  | 68   | 187  | 09e9068 |
| Apple M1 Pro | 13.2.1 |  NEON BLAS | small    | 8  | 179  | 702  | 09e9068 |
| Apple M1 Pro | 13.2.1 |  NEON BLAS | medium   | 8  | 496  | 2227 | 09e9068 |
| Apple M1 Pro | 13.2.1 |  NEON BLAS | large    | 8  | 1037 | 3796 | 09e9068 |

Running ggml_mul_mat benchmark with 8 threads

ggml_mul_mat:    64 x    64: F16      4.6 GFLOPS (128 runs) / F32      4.1 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     46.6 GFLOPS (128 runs) / F32     36.4 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16    294.2 GFLOPS (128 runs) / F32    238.8 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16    611.0 GFLOPS (128 runs) / F32    712.5 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16    770.9 GFLOPS (128 runs) / F32    700.3 GFLOPS (128 runs)
ggml_mul_mat:  2048 x  2048: F16    902.8 GFLOPS ( 53 runs) / F32    906.9 GFLOPS ( 53 runs)
ggml_mul_mat:  4096 x  4096: F16   1521.2 GFLOPS ( 12 runs) / F32   1469.3 GFLOPS ( 11 runs)
clarsen commented 1 year ago

MacBook Pro 16" with M2 Max 12 Cores, 96GB RAM macOS Ventura 13.3 Benchmarks running at 4 threads (4 threads were faster than 8 threads for ggml_mul_mat but about same for model load/encode) memcpy: 49.94 GB/s sum: ok -536870910.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat:    64 x    64: F16     11.2 GFLOPS (128 runs) / F32      9.3 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     83.0 GFLOPS (128 runs) / F32     73.7 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16    505.2 GFLOPS (128 runs) / F32    488.2 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16   1018.0 GFLOPS (128 runs) / F32   1196.3 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16   1796.2 GFLOPS (128 runs) / F32   2087.4 GFLOPS (128 runs)
ggml_mul_mat:  2048 x  2048: F16   1638.8 GFLOPS ( 96 runs) / F32   1673.7 GFLOPS ( 98 runs)
ggml_mul_mat:  4096 x  4096: F16   1995.2 GFLOPS ( 15 runs) / F32   2037.8 GFLOPS ( 15 runs)

Running benchmark for all models This can take a while!

CPU OS Config Model Th Load Enc. Commit
Apple M2 Max 13.3 NEON BLAS tiny 4 41 118 0a2d121
Apple M2 Max 13.3 NEON BLAS base 4 61 230 0a2d121
Apple M2 Max 13.3 NEON BLAS small 4 153 734 0a2d121
Apple M2 Max 13.3 NEON BLAS medium 4 448 1979 0a2d121
Apple M2 Max 13.3 NEON BLAS large 4 882 3553 0a2d121
patsevanton commented 1 year ago

Running memcpy benchmark with 1 thread

memcpy: 7.03 GB/s sum: error -536870997.000000 - how fix ??

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat:    64 x    64: F16      8.9 GFLOPS (128 runs) / F32     10.0 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     53.3 GFLOPS (128 runs) / F32     47.9 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16     91.7 GFLOPS (128 runs) / F32     99.4 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16    134.2 GFLOPS (128 runs) / F32     94.8 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16    182.9 GFLOPS ( 86 runs) / F32    121.2 GFLOPS ( 57 runs)
ggml_mul_mat:  2048 x  2048: F16    180.0 GFLOPS ( 11 runs) / F32     42.4 GFLOPS (  3 runs)
ggml_mul_mat:  4096 x  4096: F16     59.1 GFLOPS (  3 runs) / F32     31.5 GFLOPS (  3 runs)

Running benchmark for all models This can take a while!

CPU OS Config Model Th Load Enc. Commit
Ryzen 7 PRO 5850U Ubuntu 22.04.2 AVX2 tiny 4 69 495 0a2d121
Ryzen 7 PRO 5850U Ubuntu 22.04.2 AVX2 base 4 111 1128 0a2d121
Ryzen 7 PRO 5850U Ubuntu 22.04.2 AVX2 small 4 264 3992 0a2d121
Ryzen 7 PRO 5850U Ubuntu 22.04.2 AVX2 medium 4 806 12230 0a2d121
Ryzen 7 PRO 5850U Ubuntu 22.04.2 AVX2 large 4 1919 25574 0a2d121
patsevanton commented 1 year ago

memcpy: 9.49 GB/s sum: error -536870997.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat: 64 x 64: F16 8.8 GFLOPS (128 runs) / F32 10.0 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 35.4 GFLOPS (128 runs) / F32 49.2 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 61.9 GFLOPS (128 runs) / F32 95.1 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 64.3 GFLOPS (128 runs) / F32 86.5 GFLOPS (128 runs) ggml_mul_mat: 1024 x 1024: F16 74.4 GFLOPS ( 35 runs) / F32 39.9 GFLOPS ( 19 runs) ggml_mul_mat: 2048 x 2048: F16 56.9 GFLOPS ( 4 runs) / F32 31.1 GFLOPS ( 3 runs) ggml_mul_mat: 4096 x 4096: F16 56.9 GFLOPS ( 3 runs) / F32 30.1 GFLOPS ( 3 runs)

Running benchmark for all models This can take a while!

CPU OS Config Model Th Load Enc. Commit
Ryzen 5 5500U Ubuntu 22.04.2 AVX2 tiny 4 67 761 0a2d121
Ryzen 5 5500U Ubuntu 22.04.2 AVX2 base 4 96 2040 0a2d121
Ryzen 5 5500U Ubuntu 22.04.2 AVX2 small 4 239 7639 0a2d121
Ryzen 5 5500U Ubuntu 22.04.2 AVX2 medium 4 657 23735 0a2d121
Ryzen 5 5500U Ubuntu 22.04.2 AVX2 large 4 1302 45006 0a2d121
espegro commented 1 year ago

HP Z440, Xeon E5-2690v4, 64Gb, Rocky Linux 9.1

memcpy: 10.94 GB/s sum: error -536870997.000000

./bench -w 2 ggml_mul_mat: 64 x 64: F16 4.8 GFLOPS (128 runs) / F32 4.8 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 23.1 GFLOPS (128 runs) / F32 18.7 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 52.5 GFLOPS (128 runs) / F32 35.1 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 69.6 GFLOPS (128 runs) / F32 44.4 GFLOPS (128 runs) ggml_mul_mat: 1024 x 1024: F16 78.8 GFLOPS ( 37 runs) / F32 49.2 GFLOPS ( 23 runs) ggml_mul_mat: 2048 x 2048: F16 83.6 GFLOPS ( 5 runs) / F32 50.8 GFLOPS ( 3 runs) ggml_mul_mat: 4096 x 4096: F16 64.5 GFLOPS ( 3 runs) / F32 21.8 GFLOPS ( 3 runs)

system_info: n_threads = 28 / 28 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

whisper_print_timings: load time = 1031.43 ms whisper_print_timings: fallbacks = 0 p / 0 h whisper_print_timings: mel time = 0.00 ms whisper_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run) whisper_print_timings: encode time = 13121.63 ms / 1 runs (13121.63 ms per run) whisper_print_timings: decode time = 0.00 ms / 1 runs ( 0.00 ms per run) whisper_print_timings: total time = 14219.33 ms

model: large

montagao commented 1 year ago

very impressed

CPU OS Config Model Th Load Enc. Commit
MacBook M1 Max macOS 13.0 beta (22A5321d) NEON BLAS medium 8 488 2344 0a2d121
MacBook M1 Max macOS 13.0 beta (22A5321d) NEON BLAS large 8 1070 3209 0a2d121
jon-chuang commented 1 year ago

What am I doing wrong? 17.6 GFlops on a Ryzen 6850H

WHISPER_OPENBLAS=1 make -j bench && ./bench -w 2 -t 1
I whisper.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -mavx -mavx2 -mfma -mf16c -msse3 -DGGML_USE_OPENBLAS -I/usr/local/include/openblas
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread
I LDFLAGS:  -lopenblas
I CC:       cc (Ubuntu 9.5.0-1ubuntu1~22.04) 9.5.0
I CXX:      g++ (Ubuntu 9.5.0-1ubuntu1~22.04) 9.5.0

make: 'bench' is up to date.
ggml_mul_mat:    64 x    64: F16     12.6 GFLOPS (128 runs) / F32      9.8 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     19.4 GFLOPS (128 runs) / F32     12.5 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16     27.0 GFLOPS (128 runs) / F32     18.4 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16     50.3 GFLOPS (128 runs) / F32     28.1 GFLOPS (105 runs)
ggml_mul_mat:  1024 x  1024: F16     59.0 GFLOPS ( 28 runs) / F32     27.0 GFLOPS ( 13 runs)
ggml_mul_mat:  2048 x  2048: F16     43.0 GFLOPS (  3 runs) / F32     11.4 GFLOPS (  3 runs)
ggml_mul_mat:  4096 x  4096: F16     17.6 GFLOPS (  3 runs) / F32      6.6 GFLOPS (  3 runs)
flexchar commented 1 year ago

MacBook Pro M2 Max 96 GB 16-inch, 2023 13.3.1 (22E261)

I tried running 8 and 12 threads. They were a few ms slower than 4 threads. So the default 4threads is the key it seems. I also have not compiled anything apple specific. Just git clone and make.

> ./extra/bench-all.sh 8 Usage: ./bench.sh [n_threads]

Running memcpy benchmark with 1 thread

memcpy: 50.22 GB/s sum: ok -536870910.000000

Running ggml_mul_mat benchmark with 8 threads

ggml_mul_mat: 64 x 64: F16 5.0 GFLOPS (128 runs) / F32 4.7 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 46.1 GFLOPS (128 runs) / F32 38.3 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 294.0 GFLOPS (128 runs) / F32 243.7 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 574.5 GFLOPS (128 runs) / F32 272.9 GFLOPS (128 runs) ggml_mul_mat: 1024 x 1024: F16 736.6 GFLOPS (128 runs) / F32 750.8 GFLOPS (128 runs) ggml_mul_mat: 2048 x 2048: F16 973.7 GFLOPS ( 57 runs) / F32 993.7 GFLOPS ( 58 runs) ggml_mul_mat: 4096 x 4096: F16 1554.5 GFLOPS ( 12 runs) / F32 1553.6 GFLOPS ( 12 runs)

Running benchmark for all models This can take a while!

CPU OS Config Model Th Load Enc. Commit
NEON BLAS tiny 8 40 101 c23588c
NEON BLAS base 8 61 223 c23588c
NEON BLAS small 8 154 961 c23588c
NEON BLAS medium 8 436 2534 c23588c
NEON BLAS large 8 867 4100 c23588c
flexchar commented 1 year ago

Same hardware as in the post before. I've just tried converting to CoreML models and here are the results. The personal impression of running STT seemed very good - much faster.


./extra/bench-all.sh 4 Usage: ./bench.sh [n_threads]

Running memcpy benchmark with 1 thread

memcpy: 49.33 GB/s sum: ok -536870910.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat: 64 x 64: F16 9.1 GFLOPS (128 runs) / F32 8.2 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 70.7 GFLOPS (128 runs) / F32 77.0 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 350.7 GFLOPS (128 runs) / F32 435.9 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 1060.0 GFLOPS (128 runs) / F32 1254.3 GFLOPS (128 runs) ggml_mul_mat: 1024 x 1024: F16 1611.0 GFLOPS (128 runs) / F32 1652.4 GFLOPS (128 runs) ggml_mul_mat: 2048 x 2048: F16 1887.2 GFLOPS (110 runs) / F32 1900.9 GFLOPS (111 runs) ggml_mul_mat: 4096 x 4096: F16 1806.0 GFLOPS ( 14 runs) / F32 1849.3 GFLOPS ( 14 runs)

Running benchmark for all models This can take a while!

CPU OS Config Model Th Load Enc. Commit
NEON BLAS COREML tiny 4 42 30 c23588c
NEON BLAS COREML base 4 60 49 c23588c
NEON BLAS COREML small 4 151 169 c23588c
NEON BLAS COREML medium 4 430 737 c23588c
NEON BLAS COREML large 4 885 1672 c23588c
StuartIanNaylor commented 1 year ago

Dell 3050 Micro Running memcpy benchmark with 1 thread memcpy: 11.49 GB/s sum: error -536870997.000000

Running ggml_mul_mat benchmark with 4 threads ggml_mul_mat: 64 x 64: F16 7.7 GFLOPS (128 runs) / F32 3.3 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 27.7 GFLOPS (128 runs) / F32 7.5 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 50.8 GFLOPS (128 runs) / F32 8.8 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 59.4 GFLOPS (128 runs) / F32 9.0 GFLOPS ( 34 runs) ggml_mul_mat: 1024 x 1024: F16 51.5 GFLOPS ( 24 runs) / F32 8.4 GFLOPS ( 4 runs) ggml_mul_mat: 2048 x 2048: F16 46.3 GFLOPS ( 3 runs) / F32 8.1 GFLOPS ( 3 runs) ggml_mul_mat: 4096 x 4096: F16 47.3 GFLOPS ( 3 runs) / F32 8.1 GFLOPS ( 3 runs)

CPU OS Config Model Th Load Enc. Commit
i3-7100t Ubuntu 22.04 AVX2 tiny 4 84 1125 c23588c
i3-7100t Ubuntu 22.04 AVX2 base 4 128 2616 c23588c
i3-7100t Ubuntu 22.04 AVX2 small 4 339 10127 c23588c
i3-7100t Ubuntu 22.04 AVX2 medium 4 991 39383 c23588c
i3-7100t Ubuntu 22.04 AVX2 large 4 2922 74488 c23588c
j1nx commented 1 year ago

Lenovo thinkcentre m720q

Running memcpy benchmark with 1 thread

memcpy: 6.54 GB/s sum: error -536870997.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat: 64 x 64: F16 8.6 GFLOPS (128 runs) / F32 4.5 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 38.8 GFLOPS (128 runs) / F32 7.9 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 76.2 GFLOPS (128 runs) / F32 9.6 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 87.4 GFLOPS (128 runs) / F32 10.0 GFLOPS ( 38 runs) ggml_mul_mat: 1024 x 1024: F16 89.7 GFLOPS ( 42 runs) / F32 10.1 GFLOPS ( 5 runs) ggml_mul_mat: 2048 x 2048: F16 67.7 GFLOPS ( 4 runs) / F32 9.1 GFLOPS ( 3 runs) ggml_mul_mat: 4096 x 4096: F16 54.7 GFLOPS ( 3 runs) / F32 8.6 GFLOPS ( 3 runs)

Running benchmark for all models This can take a while!

CPU OS Config Model Th Load Enc. Commit
i5-8500T OpenVoiceOS AVX2 tiny.en 4 79 686 70567ef
i5-8500T OpenVoiceOS AVX2 base.en 4 121 1600 70567ef
i5-8500T OpenVoiceOS AVX2 small.en 4 320 6197 70567ef
i5-8500T OpenVoiceOS AVX2 medium.en 4 928 20276 70567ef

Running memcpy benchmark with 1 thread

memcpy: 7.16 GB/s sum: error -536870997.000000

Running ggml_mul_mat benchmark with 6 threads

ggml_mul_mat: 64 x 64: F16 1.9 GFLOPS (128 runs) / F32 1.8 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 29.7 GFLOPS (128 runs) / F32 7.3 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 65.5 GFLOPS (128 runs) / F32 14.5 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 123.4 GFLOPS (128 runs) / F32 15.2 GFLOPS ( 57 runs) ggml_mul_mat: 1024 x 1024: F16 127.5 GFLOPS ( 60 runs) / F32 14.7 GFLOPS ( 7 runs) ggml_mul_mat: 2048 x 2048: F16 93.3 GFLOPS ( 6 runs) / F32 13.3 GFLOPS ( 3 runs) ggml_mul_mat: 4096 x 4096: F16 70.0 GFLOPS ( 3 runs) / F32 12.5 GFLOPS ( 3 runs)

Running benchmark for all models This can take a while!

CPU OS Config Model Th Load Enc. Commit
i5-8500T OpenVoiceOS AVX2 tiny.en 6 78 511 70567ef
i5-8500T OpenVoiceOS AVX2 base.en 6 118 1264 70567ef
i5-8500T OpenVoiceOS AVX2 small.en 6 320 4587 70567ef
i5-8500T OpenVoiceOS AVX2 medium.en 6 928 16303 70567ef
emcodem commented 1 year ago

Yet another M1 Ultra but look at the bottom, comparision to Const-Me GPU version: memcpy: 42.66 GB/s sum: ok -536870910.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat: 64 x 64: F16 9.1 GFLOPS (128 runs) / F32 7.1 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 68.2 GFLOPS (128 runs) / F32 68.5 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 465.0 GFLOPS (128 runs) / F32 386.2 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 1131.9 GFLOPS (128 runs) / F32 1437.0 GFLOPS (128 runs) ggml_mul_mat: 1024 x 1024: F16 2188.9 GFLOPS (128 runs) / F32 2519.6 GFLOPS (128 runs) ggml_mul_mat: 2048 x 2048: F16 2938.8 GFLOPS (128 runs) / F32 2996.5 GFLOPS (128 runs) ggml_mul_mat: 4096 x 4096: F16 3074.7 GFLOPS ( 23 runs) / F32 3167.2 GFLOPS ( 24 runs)

| CPU | OS | Config | Model | Th | Load | Enc. | Commit | | M1 Ultra | Ventura 13.3.1 | NEON BLAS | large | 4 | 858 | 3649 | 70567ef |

Much more interesting i find the comparison i did to a Win10 Core i9 9900K with Nvidia A4000 using the Const-Me Version. I used a 10 minute portion of a "real" tv show (-l de, about 56k tokens known in the model). Note that the power consumption has been measured too, it is not just guessing.

const-me whisper gpu (~450-550W real power consumption while 100% gpu utilisation, cpu is mostly bored) A4000 1x parallel 93s
A4000 2x parallel both finish at 180s A4000 4x parallel 3 finish after 317s, 1 finishes at 453s

MACOS, M1 Ultra (70-90W real power consumption while 100% "cpu" utilisation) whisper cpp - default settings, 1 core, 4 threads Macos 1x : 155 s Macos 2x parallel: 196 s - all finish at same time Macos 4x parallel: 274s - all finish at same time Macos 6x parallel: 462s - all finish at same time

Also some other tests with different commandline params, on the M1 only, with 1 file: -p8 (threads default 4) - system unresponsive while processing 120.3 seconds

-p 4 (default threads 4, ~80% cpu utilisation) 79.37545

-bs 2 -p 4 101.01730

-t 16 threads (processors default 1) 148.713

-p 8 -t 2 98.91152

We currently use the Const-me GPU version on Nvidia A5000 because on an intel cpu it delivers much faster results than this cpp version could do. Also it looks like Const-me version does not go anywhere while this repository is vibrant.

As a conclusion i can say that even if i hate it but we are buying this Mac because it delivers faster results, more throughput and all while consuming only 20% of power. Also, it has much better processing power distribution between mutliple parallel processes, i bet i can even use nice to give priorities while on the GPU there are no priorities whatsoever possible.

At our usage amount that means we saved the full costs of the mac (~4000 euros) after 2-3 years of operations (due to lower power costs and A/C) compared to running it on windows/gpu which we bought about the same initial price. Even if i could now safely say we dont need A5000 but just some gamer card for 600 euros, looking at the power costs these days i'd still prefer the mac. (Thanks god i dont need to put it into Active directory or such, so i have an easy time just using it as a slave processing machine)

StuartIanNaylor commented 1 year ago

It would be great if watts idle/peak could be posted as I have been posting benches for RK3588 devices that prob gives the minimum usable results and even then a tad slow. In that price range I just posted a I3-7100T that was picked up for £64 off ebay which is approx 8 watts idle / 30 peak. I used to be a bit of an Apple hater in terms of bling tech, but bang for buck the M1 Mini is surprisingly good value and in that race-till-idle likely could process quite a number of zones especially because of diversification of use.

I am on disability so even though cheap the £849.00 for the 16gb version could prob be the basis of the ultimate home-assistant in something similar to https://github.com/ggerganov/whisper.cpp/blob/master/examples/talk-llama/talk-llama.cpp So likely I will continue posting in the £64 range :)

But what Apple/Arm provide per watt currently is pretty special and for 24/365 in the energy expensive world that is pretty important. Dunno how many people could post idle & peak wattages also but it would be really interesting especially with CPU vs GPU than just out right speed.

StuartIanNaylor commented 1 year ago

Rock 5b

memcpy: 8.78 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     7.2 GFLOPS (128 runs) | Q4_1     7.6 GFLOPS (128 runs) | Q4_2     6.9 GFLOPS (128 runs)
  64 x   64: Q5_0     6.8 GFLOPS (128 runs) | Q5_1     7.0 GFLOPS (128 runs) | Q8_0     7.1 GFLOPS (128 runs)
  64 x   64: F16      8.6 GFLOPS (128 runs) | F32      7.5 GFLOPS (128 runs)
 128 x  128: Q4_0    22.8 GFLOPS (128 runs) | Q4_1    22.4 GFLOPS (128 runs) | Q4_2    19.6 GFLOPS (128 runs)
 128 x  128: Q5_0    19.5 GFLOPS (128 runs) | Q5_1    20.7 GFLOPS (128 runs) | Q8_0    22.7 GFLOPS (128 runs)
 128 x  128: F16     28.3 GFLOPS (128 runs) | F32     29.4 GFLOPS (128 runs)
 256 x  256: Q4_0    40.6 GFLOPS (128 runs) | Q4_1    37.6 GFLOPS (128 runs) | Q4_2    30.5 GFLOPS (128 runs)
 256 x  256: Q5_0    31.2 GFLOPS (128 runs) | Q5_1    31.9 GFLOPS (128 runs) | Q8_0    49.1 GFLOPS (128 runs)
 256 x  256: F16     51.8 GFLOPS (128 runs) | F32     36.9 GFLOPS (128 runs)
 512 x  512: Q4_0    52.0 GFLOPS (128 runs) | Q4_1    45.4 GFLOPS (128 runs) | Q4_2    35.7 GFLOPS (128 runs)
 512 x  512: Q5_0    37.4 GFLOPS (128 runs) | Q5_1    36.9 GFLOPS (128 runs) | Q8_0    64.9 GFLOPS (128 runs)
 512 x  512: F16     76.9 GFLOPS (128 runs) | F32     30.7 GFLOPS (115 runs)
1024 x 1024: Q4_0    56.6 GFLOPS ( 27 runs) | Q4_1    47.5 GFLOPS ( 23 runs) | Q4_2    37.5 GFLOPS ( 18 runs)
1024 x 1024: Q5_0    39.5 GFLOPS ( 19 runs) | Q5_1    37.7 GFLOPS ( 18 runs) | Q8_0    71.1 GFLOPS ( 34 runs)
1024 x 1024: F16     49.0 GFLOPS ( 23 runs) | F32     22.4 GFLOPS ( 11 runs)
2048 x 2048: Q4_0    54.2 GFLOPS (  4 runs) | Q4_1    44.6 GFLOPS (  3 runs) | Q4_2    38.5 GFLOPS (  3 runs)
2048 x 2048: Q5_0    37.4 GFLOPS (  3 runs) | Q5_1    35.5 GFLOPS (  3 runs) | Q8_0    61.0 GFLOPS (  4 runs)
2048 x 2048: F16     41.3 GFLOPS (  3 runs) | F32     19.0 GFLOPS (  3 runs)
4096 x 4096: Q4_0    56.2 GFLOPS (  3 runs) | Q4_1    45.4 GFLOPS (  3 runs) | Q4_2    38.7 GFLOPS (  3 runs)
4096 x 4096: Q5_0    40.7 GFLOPS (  3 runs) | Q5_1    37.3 GFLOPS (  3 runs) | Q8_0    63.2 GFLOPS (  3 runs)
4096 x 4096: F16     40.0 GFLOPS (  3 runs) | F32     17.5 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| rk3588 | Ubuntu 20.04.6 LTS |  NEON | tiny | 4 | 102 | 1191 | be5911a |
| rk3588 | Ubuntu 20.04.6 LTS |  NEON | base | 4 | 140 | 2861 | be5911a |
| rk3588 | Ubuntu 20.04.6 LTS |  NEON | small | 4 | 393 | 10576 | be5911a |
| rk3588 | Ubuntu 20.04.6 LTS |  NEON | medium | 4 | 10289 | 36042 | be5911a |
| rk3588 | Ubuntu 20.04.6 LTS |  NEON | large | 4 | 2099 | 70740 | be5911a |
fquirin commented 1 year ago

How do you get these numbers @StuartIanNaylor ? 😲 Isn't the Rock 5b basically the same as the Orange Pi 5?

Orange Pi 5 8GB:

Running memcpy benchmark

memcpy: 10.14 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     4.7 GFLOPS (128 runs) | Q4_1     4.8 GFLOPS (128 runs) | Q4_2     4.6 GFLOPS (128 runs)
  64 x   64: Q5_0     4.2 GFLOPS (128 runs) | Q5_1     4.4 GFLOPS (128 runs) | Q8_0     4.4 GFLOPS (128 runs)
  64 x   64: F16      4.8 GFLOPS (128 runs) | F32      4.4 GFLOPS (128 runs)
 128 x  128: Q4_0     4.2 GFLOPS (128 runs) | Q4_1     9.8 GFLOPS (128 runs) | Q4_2    10.0 GFLOPS (128 runs)
 128 x  128: Q5_0     8.4 GFLOPS (128 runs) | Q5_1     8.2 GFLOPS (128 runs) | Q8_0    10.3 GFLOPS (128 runs)
 128 x  128: F16     10.3 GFLOPS (128 runs) | F32     10.7 GFLOPS (128 runs)
 256 x  256: Q4_0    34.7 GFLOPS (128 runs) | Q4_1    34.9 GFLOPS (128 runs) | Q4_2    33.9 GFLOPS (128 runs)
 256 x  256: Q5_0    26.2 GFLOPS (128 runs) | Q5_1    24.9 GFLOPS (128 runs) | Q8_0    36.1 GFLOPS (128 runs)
 256 x  256: F16     36.4 GFLOPS (128 runs) | F32     38.4 GFLOPS (128 runs)
 512 x  512: Q4_0    22.2 GFLOPS ( 83 runs) | Q4_1    26.1 GFLOPS ( 98 runs) | Q4_2    35.5 GFLOPS (128 runs)
 512 x  512: Q5_0    42.4 GFLOPS (128 runs) | Q5_1    26.8 GFLOPS (100 runs) | Q8_0    35.8 GFLOPS (128 runs)
 512 x  512: F16     21.6 GFLOPS ( 81 runs) | F32     31.5 GFLOPS (118 runs)
1024 x 1024: Q4_0    32.4 GFLOPS ( 16 runs) | Q4_1    44.1 GFLOPS ( 21 runs) | Q4_2    39.7 GFLOPS ( 19 runs)
1024 x 1024: Q5_0    42.3 GFLOPS ( 20 runs) | Q5_1    40.4 GFLOPS ( 20 runs) | Q8_0    41.2 GFLOPS ( 20 runs)
1024 x 1024: F16     46.8 GFLOPS ( 22 runs) | F32     42.1 GFLOPS ( 20 runs)
2048 x 2048: Q4_0    50.9 GFLOPS (  4 runs) | Q4_1    48.6 GFLOPS (  3 runs) | Q4_2    48.0 GFLOPS (  3 runs)
2048 x 2048: Q5_0    46.7 GFLOPS (  3 runs) | Q5_1    47.8 GFLOPS (  3 runs) | Q8_0    46.4 GFLOPS (  3 runs)
2048 x 2048: F16     46.1 GFLOPS (  3 runs) | F32     44.8 GFLOPS (  3 runs)
4096 x 4096: Q4_0    42.2 GFLOPS (  3 runs) | Q4_1    36.7 GFLOPS (  3 runs) | Q4_2    33.0 GFLOPS (  3 runs)
4096 x 4096: Q5_0    38.5 GFLOPS (  3 runs) | Q5_1    44.7 GFLOPS (  3 runs) | Q8_0    44.7 GFLOPS (  3 runs)
4096 x 4096: F16     44.4 GFLOPS (  3 runs) | F32     44.5 GFLOPS (  3 runs)
CPU OS Config Model Th Load Enc. Commit
RK3588S Armbian 11 - 5.10.110 NEON BLAS tiny 4 193 3748 be5911a
RK3588S Armbian 11 - 5.10.110 NEON BLAS tiny-q5_0 4 156 3341 be5911a
RK3588S Armbian 11 - 5.10.110 NEON BLAS base 4 253 7359 be5911a
RK3588S Armbian 11 - 5.10.110 NEON BLAS base-q5_0 4 178 7307 be5911a

[EDIT: a bit better without OpenBLAS although the GFLOPS are considerably lower O_o]

CPU OS Config Model Th Load Enc. Commit
RK3588S Armbian 11 - 5.10.110 NEON tiny 4 111 3170 be5911a
RK3588S Armbian 11 - 5.10.110 NEON tiny-q5_0 4 205 2817 be5911a
RK3588S Armbian 11 - 5.10.110 NEON base 4 248 6385 be5911a
RK3588S Armbian 11 - 5.10.110 NEON base-q5_0 4 140 6198 be5911a

[EDIT2: getting very unstable results right now 🤔 ]

CPU OS Config Model Th Load Enc. Commit
RK3588S Armbian 11 - 5.10.110 NEON tiny 4 269 1722 be5911a
RK3588S Armbian 11 - 5.10.110 NEON tiny-q5_0 4 104 2746 be5911a
RK3588S Armbian 11 - 5.10.110 NEON base 4 243 7063 be5911a
RK3588S Armbian 11 - 5.10.110 NEON base-q5_0 4 135 6516 be5911a
StuartIanNaylor commented 1 year ago

Likely I don't use Armbian but the supplied server image by Radxa and also the OPI version. Generally I stay clear of Armbian due to a pet hate of there epic init script that replaces standard installs and /etc and often blind sights me.

I add some tricks and tips I gathered when Radxa do a community board bring up. I have changed my pref for the scheduler and set it to performance and also and I dunno why but using taskset to make sure it just uses the big cores has a slight perf boost.

So running again I get

memcpy: 8.56 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     7.3 GFLOPS (128 runs) | Q4_1     7.8 GFLOPS (128 runs) | Q4_2     6.9 GFLOPS (128 runs)
  64 x   64: Q5_0     6.2 GFLOPS (128 runs) | Q5_1     6.7 GFLOPS (128 runs) | Q8_0     7.0 GFLOPS (128 runs)
  64 x   64: F16      2.4 GFLOPS (128 runs) | F32      8.5 GFLOPS (128 runs)
 128 x  128: Q4_0    23.2 GFLOPS (128 runs) | Q4_1    24.1 GFLOPS (128 runs) | Q4_2    19.9 GFLOPS (128 runs)
 128 x  128: Q5_0    15.4 GFLOPS (128 runs) | Q5_1    21.0 GFLOPS (128 runs) | Q8_0    26.6 GFLOPS (128 runs)
 128 x  128: F16     35.0 GFLOPS (128 runs) | F32     28.6 GFLOPS (128 runs)
 256 x  256: Q4_0    41.2 GFLOPS (128 runs) | Q4_1    38.7 GFLOPS (128 runs) | Q4_2    30.5 GFLOPS (128 runs)
 256 x  256: Q5_0    31.2 GFLOPS (128 runs) | Q5_1    31.9 GFLOPS (128 runs) | Q8_0    49.1 GFLOPS (128 runs)
 256 x  256: F16     65.0 GFLOPS (128 runs) | F32     43.5 GFLOPS (128 runs)
 512 x  512: Q4_0    52.0 GFLOPS (128 runs) | Q4_1    45.4 GFLOPS (128 runs) | Q4_2    35.3 GFLOPS (128 runs)
 512 x  512: Q5_0    37.4 GFLOPS (128 runs) | Q5_1    36.8 GFLOPS (128 runs) | Q8_0    64.9 GFLOPS (128 runs)
 512 x  512: F16     78.1 GFLOPS (128 runs) | F32     30.6 GFLOPS (114 runs)
1024 x 1024: Q4_0    56.4 GFLOPS ( 27 runs) | Q4_1    47.4 GFLOPS ( 23 runs) | Q4_2    37.5 GFLOPS ( 18 runs)
1024 x 1024: Q5_0    39.5 GFLOPS ( 19 runs) | Q5_1    37.7 GFLOPS ( 18 runs) | Q8_0    70.8 GFLOPS ( 33 runs)
1024 x 1024: F16     47.2 GFLOPS ( 22 runs) | F32     21.8 GFLOPS ( 11 runs)
2048 x 2048: Q4_0    54.4 GFLOPS (  4 runs) | Q4_1    45.3 GFLOPS (  3 runs) | Q4_2    38.6 GFLOPS (  3 runs)
2048 x 2048: Q5_0    37.4 GFLOPS (  3 runs) | Q5_1    35.6 GFLOPS (  3 runs) | Q8_0    59.8 GFLOPS (  4 runs)
2048 x 2048: F16     41.2 GFLOPS (  3 runs) | F32     20.6 GFLOPS (  3 runs)
4096 x 4096: Q4_0    56.9 GFLOPS (  3 runs) | Q4_1    46.6 GFLOPS (  3 runs) | Q4_2    38.9 GFLOPS (  3 runs)
4096 x 4096: Q5_0    41.1 GFLOPS (  3 runs) | Q5_1    37.4 GFLOPS (  3 runs) | Q8_0    62.9 GFLOPS (  3 runs)
4096 x 4096: F16     39.8 GFLOPS (  3 runs) | F32     17.6 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| <todo> | <todo> |  NEON | tiny | 4 | 96 | 1199 | be5911a |
| <todo> | <todo> |  NEON | base | 4 | 137 | 2875 | be5911a |
| <todo> | <todo> |  NEON | small | 4 | 343 | 10635 | be5911a |
| <todo> | <todo> |  NEON | medium | 4 | 1013 | 35174 | be5911a |
| <todo> | <todo> |  NEON | large | 4 | 2019 | 71678 | be5911a |

If I run without previously echo performance | tee /sys/bus/cpu/devices/cpu[046]/cpufreq/scaling_governor /sys/class/devfreq/dmc/governor as the rk3588[x] is a tri-cluster 4-2-2 and dunno about the dmc but it was something we where using at that time. Prefix (taskset -c 4-7) to further enforce not using the efficiency cores.

The ondemand governor seems to load balance whilst at least Whisper.cpp a race-till-idle more like how Android is set up does seem to give a perf boost with little loss in efficiency, if none.

Without bench gives

memcpy: 7.82 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     3.1 GFLOPS (128 runs) | Q4_1     2.8 GFLOPS (128 runs) | Q                                                                                                          4_2     2.4 GFLOPS (128 runs)
  64 x   64: Q5_0     2.3 GFLOPS (128 runs) | Q5_1     2.2 GFLOPS (128 runs) | Q                                                                                                          8_0     2.7 GFLOPS (128 runs)
  64 x   64: F16      3.1 GFLOPS (128 runs) | F32      2.6 GFLOPS (128 runs)
 128 x  128: Q4_0     7.1 GFLOPS (128 runs) | Q4_1     7.0 GFLOPS (128 runs) | Q                                                                                                          4_2     6.2 GFLOPS (128 runs)
 128 x  128: Q5_0     5.4 GFLOPS (128 runs) | Q5_1     5.4 GFLOPS (128 runs) | Q                                                                                                          8_0     7.2 GFLOPS (128 runs)
 128 x  128: F16      9.3 GFLOPS (128 runs) | F32      5.9 GFLOPS (128 runs)
 256 x  256: Q4_0    10.1 GFLOPS (128 runs) | Q4_1     9.5 GFLOPS (128 runs) | Q                                                                                                          4_2     8.4 GFLOPS (128 runs)
 256 x  256: Q5_0     7.4 GFLOPS (128 runs) | Q5_1     6.9 GFLOPS (128 runs) | Q                                                                                                          8_0    10.9 GFLOPS (128 runs)
 256 x  256: F16     13.4 GFLOPS (128 runs) | F32      7.9 GFLOPS (128 runs)
 512 x  512: Q4_0    10.9 GFLOPS ( 41 runs) | Q4_1    10.4 GFLOPS ( 39 runs) | Q                                                                                                          4_2     8.5 GFLOPS ( 32 runs)
 512 x  512: Q5_0     8.9 GFLOPS ( 34 runs) | Q5_1     8.2 GFLOPS ( 31 runs) | Q                                                                                                          8_0    12.1 GFLOPS ( 46 runs)
 512 x  512: F16     14.5 GFLOPS ( 54 runs) | F32      8.7 GFLOPS ( 33 runs)
1024 x 1024: Q4_0    26.9 GFLOPS ( 13 runs) | Q4_1    24.9 GFLOPS ( 12 runs) | Q                                                                                                          4_2    21.7 GFLOPS ( 11 runs)
1024 x 1024: Q5_0    23.0 GFLOPS ( 11 runs) | Q5_1    22.0 GFLOPS ( 11 runs) | Q                                                                                                          8_0    29.1 GFLOPS ( 14 runs)
1024 x 1024: F16     28.2 GFLOPS ( 14 runs) | F32     17.9 GFLOPS (  9 runs)
2048 x 2048: Q4_0    50.1 GFLOPS (  3 runs) | Q4_1    41.3 GFLOPS (  3 runs) | Q                                                                                                          4_2    36.7 GFLOPS (  3 runs)
2048 x 2048: Q5_0    36.0 GFLOPS (  3 runs) | Q5_1    33.2 GFLOPS (  3 runs) | Q                                                                                                          8_0    53.7 GFLOPS (  4 runs)
2048 x 2048: F16     37.5 GFLOPS (  3 runs) | F32     19.3 GFLOPS (  3 runs)
4096 x 4096: Q4_0    55.7 GFLOPS (  3 runs) | Q4_1    43.7 GFLOPS (  3 runs) | Q                                                                                                          4_2    39.4 GFLOPS (  3 runs)
4096 x 4096: Q5_0    40.5 GFLOPS (  3 runs) | Q5_1    36.1 GFLOPS (  3 runs) | Q                                                                                                          8_0    65.8 GFLOPS (  3 runs)
4096 x 4096: F16     36.8 GFLOPS (  3 runs) | F32     18.5 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| <todo> | <todo> |  NEON | tiny | 4 | 171 | 1817 | be5911a |
| <todo> | <todo> |  NEON | base | 4 | 255 | 3529 | be5911a |
| <todo> | <todo> |  NEON | small | 4 | 433 | 11208 | be5911a |
| <todo> | <todo> |  NEON | medium | 4 | 1814 | 36829 | be5911a |
| <todo> | <todo> |  NEON | large | 4 | 36647 | 71393 | be5911a |

I will tack on the OPI5 next as think it is a smidge faster. So without again

memcpy: 8.26 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     3.1 GFLOPS (128 runs) | Q4_1     3.3 GFLOPS (128 runs) | Q4_2     3.4 GFLOPS (128 runs)
  64 x   64: Q5_0     1.7 GFLOPS (128 runs) | Q5_1     3.1 GFLOPS (128 runs) | Q8_0     2.9 GFLOPS (128 runs)
  64 x   64: F16      4.0 GFLOPS (128 runs) | F32      3.5 GFLOPS (128 runs)
 128 x  128: Q4_0     7.8 GFLOPS (128 runs) | Q4_1     6.6 GFLOPS (128 runs) | Q4_2     6.7 GFLOPS (128 runs)
 128 x  128: Q5_0     5.6 GFLOPS (128 runs) | Q5_1     5.4 GFLOPS (128 runs) | Q8_0     8.7 GFLOPS (128 runs)
 128 x  128: F16     10.1 GFLOPS (128 runs) | F32      6.3 GFLOPS (128 runs)
 256 x  256: Q4_0    10.5 GFLOPS (128 runs) | Q4_1     9.1 GFLOPS (128 runs) | Q4_2     7.9 GFLOPS (128 runs)
 256 x  256: Q5_0     7.0 GFLOPS (128 runs) | Q5_1     6.7 GFLOPS (128 runs) | Q8_0    12.6 GFLOPS (128 runs)
 256 x  256: F16     12.6 GFLOPS (128 runs) | F32      7.5 GFLOPS (128 runs)
 512 x  512: Q4_0    11.9 GFLOPS ( 45 runs) | Q4_1    10.8 GFLOPS ( 41 runs) | Q4_2    10.0 GFLOPS ( 38 runs)
 512 x  512: Q5_0     8.5 GFLOPS ( 32 runs) | Q5_1     7.9 GFLOPS ( 30 runs) | Q8_0    14.5 GFLOPS ( 54 runs)
 512 x  512: F16     14.2 GFLOPS ( 53 runs) | F32      8.3 GFLOPS ( 32 runs)
1024 x 1024: Q4_0    30.4 GFLOPS ( 15 runs) | Q4_1    28.9 GFLOPS ( 14 runs) | Q4_2    23.6 GFLOPS ( 11 runs)
1024 x 1024: Q5_0    23.0 GFLOPS ( 11 runs) | Q5_1    23.5 GFLOPS ( 12 runs) | Q8_0    37.4 GFLOPS ( 18 runs)
1024 x 1024: F16     33.9 GFLOPS ( 16 runs) | F32     18.0 GFLOPS (  9 runs)
2048 x 2048: Q4_0    51.4 GFLOPS (  4 runs) | Q4_1    42.5 GFLOPS (  3 runs) | Q4_2    36.5 GFLOPS (  3 runs)
2048 x 2048: Q5_0    36.0 GFLOPS (  3 runs) | Q5_1    32.7 GFLOPS (  3 runs) | Q8_0    59.0 GFLOPS (  4 runs)
2048 x 2048: F16     39.4 GFLOPS (  3 runs) | F32     17.5 GFLOPS (  3 runs)
4096 x 4096: Q4_0    58.8 GFLOPS (  3 runs) | Q4_1    47.0 GFLOPS (  3 runs) | Q4_2    39.7 GFLOPS (  3 runs)
4096 x 4096: Q5_0    40.8 GFLOPS (  3 runs) | Q5_1    37.3 GFLOPS (  3 runs) | Q8_0    65.1 GFLOPS (  3 runs)
4096 x 4096: F16     40.6 GFLOPS (  3 runs) | F32     18.6 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| <todo> | <todo> |  NEON | tiny | 4 | 133 | 1235 | be5911a |
| <todo> | <todo> |  NEON | base | 4 | 232 | 2941 | be5911a |
| <todo> | <todo> |  NEON | small | 4 | 470 | 10870 | be5911a |
| <todo> | <todo> |  NEON | medium | 4 | 23195 | 36162 | be5911a |
| <todo> | <todo> |  NEON | large | 4 | 46511 | 90187 | be5911a |

Then as sudo orangepi-config set the perf governor (no dmc) taskset -c 4-7 ,/extra/bench-all.sh

memcpy: 8.22 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     0.7 GFLOPS (128 runs) | Q4_1     1.6 GFLOPS (128 runs) | Q                                                                                                  4_2     1.0 GFLOPS (128 runs)
  64 x   64: Q5_0     0.6 GFLOPS (128 runs) | Q5_1     0.8 GFLOPS (128 runs) | Q                                                                                                  8_0     1.4 GFLOPS (128 runs)
  64 x   64: F16      1.9 GFLOPS (128 runs) | F32      0.8 GFLOPS (128 runs)
 128 x  128: Q4_0     8.9 GFLOPS (128 runs) | Q4_1     3.8 GFLOPS (128 runs) | Q                                                                                                  4_2     3.1 GFLOPS (128 runs)
 128 x  128: Q5_0     5.8 GFLOPS (128 runs) | Q5_1     3.8 GFLOPS (128 runs) | Q                                                                                                  8_0     7.8 GFLOPS (128 runs)
 128 x  128: F16      5.2 GFLOPS (128 runs) | F32      3.6 GFLOPS (128 runs)
 256 x  256: Q4_0    13.1 GFLOPS (128 runs) | Q4_1    12.1 GFLOPS (128 runs) | Q                                                                                                  4_2    12.1 GFLOPS (128 runs)
 256 x  256: Q5_0    12.8 GFLOPS (128 runs) | Q5_1    13.4 GFLOPS (128 runs) | Q                                                                                                  8_0    17.9 GFLOPS (128 runs)
 256 x  256: F16     17.6 GFLOPS (128 runs) | F32     11.0 GFLOPS (128 runs)
 512 x  512: Q4_0    33.3 GFLOPS (125 runs) | Q4_1    34.7 GFLOPS (128 runs) | Q                                                                                                  4_2    21.9 GFLOPS ( 82 runs)
 512 x  512: Q5_0    21.4 GFLOPS ( 80 runs) | Q5_1    22.4 GFLOPS ( 84 runs) | Q                                                                                                  8_0    35.2 GFLOPS (128 runs)
 512 x  512: F16     37.1 GFLOPS (128 runs) | F32     23.2 GFLOPS ( 87 runs)
1024 x 1024: Q4_0    54.9 GFLOPS ( 26 runs) | Q4_1    44.3 GFLOPS ( 21 runs) | Q                                                                                                  4_2    31.4 GFLOPS ( 15 runs)
1024 x 1024: Q5_0    35.7 GFLOPS ( 17 runs) | Q5_1    32.1 GFLOPS ( 15 runs) | Q                                                                                                  8_0    66.5 GFLOPS ( 31 runs)
1024 x 1024: F16     45.0 GFLOPS ( 21 runs) | F32     19.6 GFLOPS ( 10 runs)
2048 x 2048: Q4_0    54.6 GFLOPS (  4 runs) | Q4_1    45.2 GFLOPS (  3 runs) | Q                                                                                                  4_2    38.4 GFLOPS (  3 runs)
2048 x 2048: Q5_0    37.9 GFLOPS (  3 runs) | Q5_1    34.7 GFLOPS (  3 runs) | Q                                                                                                  8_0    59.9 GFLOPS (  4 runs)
2048 x 2048: F16     40.5 GFLOPS (  3 runs) | F32     20.0 GFLOPS (  3 runs)
4096 x 4096: Q4_0    59.5 GFLOPS (  3 runs) | Q4_1    47.7 GFLOPS (  3 runs) | Q                                                                                                  4_2    40.1 GFLOPS (  3 runs)
4096 x 4096: Q5_0    42.7 GFLOPS (  3 runs) | Q5_1    39.6 GFLOPS (  3 runs) | Q                                                                                                  8_0    60.7 GFLOPS (  3 runs)
4096 x 4096: F16     35.5 GFLOPS (  3 runs) | F32     20.8 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| <todo> | <todo> |  NEON | tiny | 4 | 119 | 1178 | be5911a |
| <todo> | <todo> |  NEON | base | 4 | 168 | 2910 | be5911a |
| <todo> | <todo> |  NEON | small | 4 | 399 | 10784 | be5911a |
| <todo> | <todo> |  NEON | medium | 4 | 23469 | 35952 | be5911a |
| <todo> | <todo> |  NEON | large | 4 | 47147 | 76405 | be5911a |

I ran that again as think transformers do bounce around abit to end up with the same tokens.

memcpy: 9.46 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     7.1 GFLOPS (128 runs) | Q4_1     7.6 GFLOPS (128 runs) | Q4_2     6.6 GFLOPS (128 runs)
  64 x   64: Q5_0     6.3 GFLOPS (128 runs) | Q5_1     6.9 GFLOPS (128 runs) | Q8_0     6.6 GFLOPS (128 runs)
  64 x   64: F16      7.8 GFLOPS (128 runs) | F32      7.3 GFLOPS (128 runs)
 128 x  128: Q4_0    23.8 GFLOPS (128 runs) | Q4_1    25.0 GFLOPS (128 runs) | Q4_2     8.5 GFLOPS (128 runs)
 128 x  128: Q5_0    19.1 GFLOPS (128 runs) | Q5_1    20.8 GFLOPS (128 runs) | Q8_0    26.4 GFLOPS (128 runs)
 128 x  128: F16     34.8 GFLOPS (128 runs) | F32     28.6 GFLOPS (128 runs)
 256 x  256: Q4_0    43.4 GFLOPS (128 runs) | Q4_1    42.0 GFLOPS (128 runs) | Q4_2    31.3 GFLOPS (128 runs)
 256 x  256: Q5_0    30.5 GFLOPS (128 runs) | Q5_1    32.0 GFLOPS (128 runs) | Q8_0    41.7 GFLOPS (128 runs)
 256 x  256: F16     60.0 GFLOPS (128 runs) | F32     42.9 GFLOPS (128 runs)
 512 x  512: Q4_0    56.5 GFLOPS (128 runs) | Q4_1    49.5 GFLOPS (128 runs) | Q4_2    36.6 GFLOPS (128 runs)
 512 x  512: Q5_0    36.7 GFLOPS (128 runs) | Q5_1    36.8 GFLOPS (128 runs) | Q8_0    69.9 GFLOPS (128 runs)
 512 x  512: F16     78.5 GFLOPS (128 runs) | F32     30.1 GFLOPS (113 runs)
1024 x 1024: Q4_0    62.7 GFLOPS ( 30 runs) | Q4_1    52.2 GFLOPS ( 25 runs) | Q4_2    38.9 GFLOPS ( 19 runs)
1024 x 1024: Q5_0    39.2 GFLOPS ( 19 runs) | Q5_1    38.2 GFLOPS ( 18 runs) | Q8_0    76.2 GFLOPS ( 36 runs)
1024 x 1024: F16     46.7 GFLOPS ( 22 runs) | F32     21.6 GFLOPS ( 11 runs)
2048 x 2048: Q4_0    60.4 GFLOPS (  4 runs) | Q4_1    50.3 GFLOPS (  3 runs) | Q4_2    39.6 GFLOPS (  3 runs)
2048 x 2048: Q5_0    37.9 GFLOPS (  3 runs) | Q5_1    35.4 GFLOPS (  3 runs) | Q8_0    66.5 GFLOPS (  4 runs)
2048 x 2048: F16     33.8 GFLOPS (  3 runs) | F32     15.0 GFLOPS (  3 runs)
4096 x 4096: Q4_0    64.2 GFLOPS (  3 runs) | Q4_1    51.2 GFLOPS (  3 runs) | Q4_2    40.2 GFLOPS (  3 runs)
4096 x 4096: Q5_0    40.7 GFLOPS (  3 runs) | Q5_1    37.2 GFLOPS (  3 runs) | Q8_0    71.5 GFLOPS (  3 runs)
4096 x 4096: F16     38.5 GFLOPS (  3 runs) | F32     20.3 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| <todo> | <todo> |  NEON | tiny | 4 | 103 | 1166 | be5911a |
| <todo> | <todo> |  NEON | base | 4 | 152 | 2888 | be5911a |
| <todo> | <todo> |  NEON | small | 4 | 379 | 10892 | be5911a |
| <todo> | <todo> |  NEON | medium | 4 | 22649 | 35767 | be5911a |
| <todo> | <todo> |  NEON | large | 4 | 45427 | 73967 | be5911a |

But don't seem to get that much variance, race-till-idle is just preference.

fquirin commented 1 year ago

Prefix (taskset -c 4-7) to further enforce not using the efficiency cores.

Tried that played with the CPU settings (performance mode etc.), even added some better cooling but it still keeps jumping all over the place with the tiny model at ~2s (in the good runs) while 'htop' shows consistent 100% load on the performance cores. Q5 models are sometimes a few ms faster sometimes slower. When I do the same tests with the CTranslate2 Whisper version results are pretty stable and always about twice as fast.

StuartIanNaylor commented 1 year ago

Dunno just to show the next run is very consistant and considerabilly faster... ?

memcpy: 10.52 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     2.5 GFLOPS (128 runs) | Q4_1     2.5 GFLOPS (128 runs) | Q4_2     1.3 GFLOPS (128 runs)
  64 x   64: Q5_0     1.0 GFLOPS (128 runs) | Q5_1     0.6 GFLOPS (128 runs) | Q8_0     0.8 GFLOPS (128 runs)
  64 x   64: F16      1.0 GFLOPS (128 runs) | F32      1.8 GFLOPS (128 runs)
 128 x  128: Q4_0     2.8 GFLOPS (128 runs) | Q4_1     2.2 GFLOPS (128 runs) | Q4_2     6.7 GFLOPS (128 runs)
 128 x  128: Q5_0     3.2 GFLOPS (128 runs) | Q5_1     5.5 GFLOPS (128 runs) | Q8_0     3.0 GFLOPS (128 runs)
 128 x  128: F16     11.2 GFLOPS (128 runs) | F32      8.5 GFLOPS (128 runs)
 256 x  256: Q4_0    13.5 GFLOPS (128 runs) | Q4_1     8.8 GFLOPS (128 runs) | Q4_2     9.9 GFLOPS (128 runs)
 256 x  256: Q5_0    10.7 GFLOPS (128 runs) | Q5_1     6.7 GFLOPS (128 runs) | Q8_0     7.3 GFLOPS (128 runs)
 256 x  256: F16     18.3 GFLOPS (128 runs) | F32     10.1 GFLOPS (128 runs)
 512 x  512: Q4_0    36.4 GFLOPS (128 runs) | Q4_1    31.2 GFLOPS (117 runs) | Q4_2    19.0 GFLOPS ( 71 runs)
 512 x  512: Q5_0    18.5 GFLOPS ( 69 runs) | Q5_1    20.4 GFLOPS ( 77 runs) | Q8_0    30.7 GFLOPS (115 runs)
 512 x  512: F16     33.8 GFLOPS (126 runs) | F32     20.7 GFLOPS ( 79 runs)
1024 x 1024: Q4_0    40.0 GFLOPS ( 19 runs) | Q4_1    36.4 GFLOPS ( 18 runs) | Q4_2    29.6 GFLOPS ( 14 runs)
1024 x 1024: Q5_0    32.9 GFLOPS ( 16 runs) | Q5_1    30.6 GFLOPS ( 15 runs) | Q8_0    54.2 GFLOPS ( 26 runs)
1024 x 1024: F16     44.1 GFLOPS ( 21 runs) | F32     20.0 GFLOPS ( 10 runs)
2048 x 2048: Q4_0    57.7 GFLOPS (  4 runs) | Q4_1    47.7 GFLOPS (  3 runs) | Q4_2    38.7 GFLOPS (  3 runs)
2048 x 2048: Q5_0    37.8 GFLOPS (  3 runs) | Q5_1    35.1 GFLOPS (  3 runs) | Q8_0    63.6 GFLOPS (  4 runs)
2048 x 2048: F16     33.6 GFLOPS (  3 runs) | F32     14.8 GFLOPS (  3 runs)
4096 x 4096: Q4_0    61.9 GFLOPS (  3 runs) | Q4_1    50.2 GFLOPS (  3 runs) | Q4_2    38.8 GFLOPS (  3 runs)
4096 x 4096: Q5_0    40.6 GFLOPS (  3 runs) | Q5_1    37.9 GFLOPS (  3 runs) | Q8_0    70.4 GFLOPS (  3 runs)
4096 x 4096: F16     38.0 GFLOPS (  3 runs) | F32     20.8 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| <todo> | <todo> |  NEON | tiny | 4 | 134 | 1176 | be5911a |
| <todo> | <todo> |  NEON | base | 4 | 179 | 2964 | be5911a |
| <todo> | <todo> |  NEON | small | 4 | 416 | 11037 | be5911a |
| <todo> | <todo> |  NEON | medium | 4 | 23462 | 36469 | be5911a |
| <todo> | <todo> |  NEON | large | 4 | 47286 | 77494 | be5911a |
tazz4843 commented 1 year ago

System76 Pangolin (pang12) w/ Ryzen 7 6800U (8c16t) @ 2.7GHz + 32GB DDR5 at 6400MT/s Models stored on a Samsung 970 Evo Plus

Running memcpy benchmark with 1 thread

memcpy: 11.18 GB/s
sum:    error -536870997.000000

Running ggml_mul_mat benchmark with 16 threads

ggml_mul_mat:   64 x   64: Q4_0     0.9 GFLOPS (128 runs) / Q4_1     0.4 GFLOPS (128 runs) / F16     1.2 GFLOPS (128 runs) / F32     1.2 GFLOPS (128 runs)
ggml_mul_mat:  128 x  128: Q4_0     6.1 GFLOPS (128 runs) / Q4_1     7.5 GFLOPS (128 runs) / F16     4.6 GFLOPS (128 runs) / F32    10.0 GFLOPS (128 runs)
ggml_mul_mat:  256 x  256: Q4_0    26.2 GFLOPS (128 runs) / Q4_1    42.3 GFLOPS (128 runs) / F16    19.9 GFLOPS (128 runs) / F32    47.9 GFLOPS (128 runs)
ggml_mul_mat:  512 x  512: Q4_0    66.6 GFLOPS (128 runs) / Q4_1    98.6 GFLOPS (128 runs) / F16    90.1 GFLOPS (128 runs) / F32   110.4 GFLOPS (128 runs)
ggml_mul_mat: 1024 x 1024: Q4_0    97.8 GFLOPS ( 46 runs) / Q4_1   154.3 GFLOPS ( 72 runs) / F16   158.7 GFLOPS ( 74 runs) / F32   132.2 GFLOPS ( 62 runs)
ggml_mul_mat: 2048 x 2048: Q4_0   126.7 GFLOPS (  8 runs) / Q4_1   164.8 GFLOPS ( 10 runs) / F16   164.1 GFLOPS ( 10 runs) / F32    96.4 GFLOPS (  6 runs)
ggml_mul_mat: 4096 x 4096: Q4_0   138.6 GFLOPS (  3 runs) / Q4_1   166.9 GFLOPS (  3 runs) / F16   136.0 GFLOPS (  3 runs) / F32    62.8 GFLOPS (  3 runs)
CPU OS Config Model Th Load Enc. Commit
Ryzen 7 6800U Arch Linux AVX2 tiny 16 37 510 9c61f5f
Ryzen 7 6800U Arch Linux AVX2 base 16 51 1222 9c61f5f
Ryzen 7 6800U Arch Linux AVX2 small 16 123 4283 9c61f5f
Ryzen 7 6800U Arch Linux AVX2 medium 16 341 14178 9c61f5f
Ryzen 7 6800U Arch Linux AVX2 large 16 650 25801 9c61f5f
Tetsuya81 commented 1 year ago

MacBook Air M2 24GB 2022 (CoreML model)

It is interesting to note that when converted to a CoreML model and executed, even a Macbook Air M2 has a processing speed close to that of a high-spec Mac, perhaps because the specifications of the Neural engine are the same for the same generation of Apple Silicon.

./extra/bench-all.sh 4 Usage: ./bench.sh [n_threads]

Running memcpy benchmark with 1 thread

memcpy: 34.33 GB/s sum: ok -536870910.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat: 64 x 64: F16 11.4 GFLOPS (128 runs) / F32 10.5 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 89.0 GFLOPS (128 runs) / F32 74.8 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 422.6 GFLOPS (128 runs) / F32 419.4 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 793.4 GFLOPS (128 runs) / F32 801.8 GFLOPS (128 runs) ggml_mul_mat: 1024 x 1024: F16 827.0 GFLOPS (128 runs) / F32 849.3 GFLOPS (128 runs) ggml_mul_mat: 2048 x 2048: F16 821.8 GFLOPS ( 48 runs) / F32 773.4 GFLOPS ( 46 runs) ggml_mul_mat: 4096 x 4096: F16 765.2 GFLOPS ( 6 runs) / F32 743.6 GFLOPS ( 6 runs)

Running benchmark for all models This can take a while!

CPU OS Config Model Th Load Enc. Commit
NEON BLAS COREML tiny 4 c23588c
NEON BLAS COREML base 4 c23588c
M2 13.3.1 (a)(22E772610a) NEON BLAS COREML small 4 153 199 c23588c
M2 13.3.1 (a)(22E772610a) NEON BLAS COREML medium 4 450 746 c23588c
M2 13.3.1 (a)(22E772610a) NEON BLAS COREML large 4 1053 1439 c23588c
nickovs commented 1 year ago
CPU OS Config Model Th Load Enc. Commit
Raspberry Pi 4 2GB  Bullseye 6.1.21-v8+  OPENBLAS tiny.en 4 393  7882  14bee39
Raspberry Pi 4 2GB  Bullseye 6.1.21-v8+  OPENBLAS tiny.en-q5 4 265  8564  14bee39
Raspberry Pi 4 2GB  Bullseye 6.1.21-v8+  OPENBLAS base.en 4 571  16328  14bee39
Raspberry Pi 4 2GB  Bullseye 6.1.21-v8+  OPENBLAS base.en-q5 4 306  16169  14bee39

Tests performed using Raspberry Pi OS libopenblas-dev package (version 0.3.13+ds-3).

StuartIanNaylor commented 1 year ago

Ryzen 3 2200GE (Lenovo M715q)

Running memcpy benchmark

memcpy: 12.14 GB/s (1 thread)
sum:    -536869898.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     5.3 GFLOPS (128 runs) | Q4_1     1.6 GFLOPS (128 runs) | Q4_2     5.2 GFLOPS (128 runs)
  64 x   64: Q5_0     5.5 GFLOPS (128 runs) | Q5_1     1.7 GFLOPS (128 runs) | Q8_0     1.7 GFLOPS (128 runs)
  64 x   64: F16      1.1 GFLOPS (128 runs) | F32      2.0 GFLOPS (128 runs)
 128 x  128: Q4_0     9.9 GFLOPS (128 runs) | Q4_1    10.8 GFLOPS (128 runs) | Q4_2     9.8 GFLOPS (128 runs)
 128 x  128: Q5_0    16.7 GFLOPS (128 runs) | Q5_1    19.0 GFLOPS (128 runs) | Q8_0    20.6 GFLOPS (128 runs)
 128 x  128: F16      9.4 GFLOPS (128 runs) | F32     29.8 GFLOPS (128 runs)
 256 x  256: Q4_0    26.1 GFLOPS (128 runs) | Q4_1    29.4 GFLOPS (128 runs) | Q4_2    31.2 GFLOPS (128 runs)
 256 x  256: Q5_0    28.4 GFLOPS (128 runs) | Q5_1    31.0 GFLOPS (128 runs) | Q8_0    32.5 GFLOPS (128 runs)
 256 x  256: F16     21.5 GFLOPS (128 runs) | F32     41.6 GFLOPS (128 runs)
 512 x  512: Q4_0    41.4 GFLOPS (128 runs) | Q4_1    42.7 GFLOPS (128 runs) | Q4_2    43.2 GFLOPS (128 runs)
 512 x  512: Q5_0    39.2 GFLOPS (128 runs) | Q5_1    37.2 GFLOPS (128 runs) | Q8_0    56.7 GFLOPS (128 runs)
 512 x  512: F16     29.3 GFLOPS (110 runs) | F32     56.0 GFLOPS (128 runs)
1024 x 1024: Q4_0    52.5 GFLOPS ( 25 runs) | Q4_1    51.6 GFLOPS ( 25 runs) | Q4_2    48.3 GFLOPS ( 23 runs)
1024 x 1024: Q5_0    44.1 GFLOPS ( 21 runs) | Q5_1    41.9 GFLOPS ( 20 runs) | Q8_0    71.4 GFLOPS ( 34 runs)
1024 x 1024: F16     30.4 GFLOPS ( 15 runs) | F32     35.5 GFLOPS ( 17 runs)
2048 x 2048: Q4_0    54.6 GFLOPS (  4 runs) | Q4_1    50.6 GFLOPS (  3 runs) | Q4_2    49.8 GFLOPS (  3 runs)
2048 x 2048: Q5_0    44.8 GFLOPS (  3 runs) | Q5_1    40.8 GFLOPS (  3 runs) | Q8_0    67.1 GFLOPS (  4 runs)
2048 x 2048: F16     29.1 GFLOPS (  3 runs) | F32     20.0 GFLOPS (  3 runs)
4096 x 4096: Q4_0    54.3 GFLOPS (  3 runs) | Q4_1    50.0 GFLOPS (  3 runs) | Q4_2    49.5 GFLOPS (  3 runs)
4096 x 4096: Q5_0    44.7 GFLOPS (  3 runs) | Q5_1    40.2 GFLOPS (  3 runs) | Q8_0    64.0 GFLOPS (  3 runs)
4096 x 4096: F16     28.3 GFLOPS (  3 runs) | F32     19.7 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| Ryzen 3 2200GE |  Ubuntu 22.04.2 |  AVX2 | tiny | 4 | 68 | 1676 | 2b6a074 |
| Ryzen 3 2200GE |  Ubuntu 22.04.2 |  AVX2 | base | 4 | 96 | 3850 | 2b6a074 |
| Ryzen 3 2200GE |  Ubuntu 22.04.2 |  AVX2 | small | 4 | 235 | 14734 | 2b6a074 |
| Ryzen 3 2200GE |  Ubuntu 22.04.2 |  AVX2 | medium | 4 | 660 | 49288 | 2b6a074 |
| Ryzen 3 2200GE |  Ubuntu 22.04.2 |  AVX2 | large | 4 | 1302 | 105757 | 2b6a074 |
kaspar030 commented 1 year ago

This is what I get with clblast on an AMD RX6700XT:

Running memcpy benchmark

memcpy: 11.94 GB/s (1 thread)
sum: -536869898.000000

Running ggml_mul_mat benchmark with 16 threads

Initializing CLBlast (First Run)...
Attempting to use: Platform=0, Device=0 (If invalid, program will crash)
Using Platform: AMD Accelerated Parallel Processing Device: gfx1031
64 x 64: Q4_0 0.8 GFLOPS (128 runs) | Q4_1 0.8 GFLOPS (128 runs)
64 x 64: Q5_0 0.8 GFLOPS (128 runs) | Q5_1 0.8 GFLOPS (128 runs) | Q8_0 0.8 GFLOPS (128 runs) 64 x 64: F16 0.8 GFLOPS (128 runs) | F32 0.8 GFLOPS (128 runs) 128 x 128: Q4_0 5.6 GFLOPS (128 runs) | Q4_1 5.6 GFLOPS (128 runs) 128 x 128: Q5_0 6.1 GFLOPS (128 runs) | Q5_1 5.7 GFLOPS (128 runs) | Q8_0 6.1 GFLOPS (128 runs) 128 x 128: F16 5.8 GFLOPS (128 runs) | F32 6.0 GFLOPS (128 runs) 256 x 256: Q4_0 43.4 GFLOPS (128 runs) | Q4_1 40.3 GFLOPS (128 runs) 256 x 256: Q5_0 38.2 GFLOPS (128 runs) | Q5_1 39.2 GFLOPS (128 runs) | Q8_0 39.0 GFLOPS (128 runs) 256 x 256: F16 38.3 GFLOPS (128 runs) | F32 38.6 GFLOPS (128 runs) 512 x 512: Q4_0 210.9 GFLOPS (128 runs) | Q4_1 212.8 GFLOPS (128 runs) 512 x 512: Q5_0 212.0 GFLOPS (128 runs) | Q5_1 213.2 GFLOPS (128 runs) | Q8_0 210.2 GFLOPS (128 runs) 512 x 512: F16 195.5 GFLOPS (128 runs) | F32 208.7 GFLOPS (128 runs) 1024 x 1024: Q4_0 1280.6 GFLOPS (128 runs) | Q4_1 1289.0 GFLOPS (128 runs) 1024 x 1024: Q5_0 1292.2 GFLOPS (128 runs) | Q5_1 1287.4 GFLOPS (128 runs) | Q8_0 1271.0 GFLOPS (128 runs) 1024 x 1024: F16 1025.9 GFLOPS (128 runs) | F32 1227.8 GFLOPS (128 runs) 2048 x 2048: Q4_0 3423.2 GFLOPS (128 runs) | Q4_1 3414.1 GFLOPS (128 runs) 2048 x 2048: Q5_0 3393.6 GFLOPS (128 runs) | Q5_1 3385.8 GFLOPS (128 runs) | Q8_0 3385.2 GFLOPS (128 runs) 2048 x 2048: F16 2434.4 GFLOPS (128 runs) | F32 3045.8 GFLOPS (128 runs) 4096 x 4096: Q4_0 4187.6 GFLOPS ( 31 runs) | Q4_1 4193.6 GFLOPS ( 31 runs) 4096 x 4096: Q5_0 4204.3 GFLOPS ( 31 runs) | Q5_1 4187.1 GFLOPS ( 31 runs) | Q8_0 4135.0 GFLOPS ( 31 runs) 4096 x 4096: F16 3491.1 GFLOPS ( 26 runs) | F32 3911.3 GFLOPS ( 29 runs)

Running benchmark for all models This can take a while!

CPU OS Config Model Th Load Enc. Commit
Ryzen 5950X / RX6700XT Arch AVX2 BLAS tiny 16 382 603 95b02d7
Ryzen 5950X / RX6700XT Arch AVX2 BLAS base 16 371 717 95b02d7
Ryzen 5950X / RX6700XT Arch AVX2 BLAS small 16 427 1271 95b02d7
Ryzen 5950X / RX6700XT Arch AVX2 BLAS medium 16 636 2784 95b02d7
Ryzen 5950X / RX6700XT Arch AVX2 BLAS large 16 868 4308 95b02d7
randomshinichi commented 1 year ago

Thinkpad T480, Core i7 8550U

Usage: ./bench.sh [n_threads] [encoder-only]

Running memcpy benchmark

memcpy: 12.67 GB/s (1 thread)
sum:    -536869898.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     6.1 GFLOPS (128 runs) | Q4_1     6.4 GFLOPS (128 runs)
  64 x   64: Q5_0     6.6 GFLOPS (128 runs) | Q5_1     6.7 GFLOPS (128 runs) | Q8_0     6.3 GFLOPS (128 runs)
  64 x   64: F16      7.8 GFLOPS (128 runs) | F32      5.4 GFLOPS (128 runs)
 128 x  128: Q4_0    25.3 GFLOPS (128 runs) | Q4_1    25.5 GFLOPS (128 runs)
 128 x  128: Q5_0    29.6 GFLOPS (128 runs) | Q5_1    26.9 GFLOPS (128 runs) | Q8_0    31.7 GFLOPS (128 runs)
 128 x  128: F16     34.8 GFLOPS (128 runs) | F32     13.8 GFLOPS (128 runs)
 256 x  256: Q4_0    49.9 GFLOPS (128 runs) | Q4_1    43.3 GFLOPS (128 runs)
 256 x  256: Q5_0    46.6 GFLOPS (128 runs) | Q5_1    45.4 GFLOPS (128 runs) | Q8_0    64.0 GFLOPS (128 runs)
 256 x  256: F16     61.2 GFLOPS (128 runs) | F32     18.7 GFLOPS (128 runs)
 512 x  512: Q4_0    66.7 GFLOPS (128 runs) | Q4_1    54.7 GFLOPS (128 runs)
 512 x  512: Q5_0    53.5 GFLOPS (128 runs) | Q5_1    57.9 GFLOPS (128 runs) | Q8_0    80.6 GFLOPS (128 runs)
 512 x  512: F16     65.5 GFLOPS (128 runs) | F32     22.2 GFLOPS ( 83 runs)
1024 x 1024: Q4_0    77.7 GFLOPS ( 37 runs) | Q4_1    66.9 GFLOPS ( 32 runs)
1024 x 1024: Q5_0    66.3 GFLOPS ( 31 runs) | Q5_1    60.2 GFLOPS ( 29 runs) | Q8_0    91.6 GFLOPS ( 44 runs)
1024 x 1024: F16     63.8 GFLOPS ( 30 runs) | F32     21.2 GFLOPS ( 10 runs)
2048 x 2048: Q4_0    74.3 GFLOPS (  5 runs) | Q4_1    71.1 GFLOPS (  5 runs)
2048 x 2048: Q5_0    59.5 GFLOPS (  4 runs) | Q5_1    56.4 GFLOPS (  4 runs) | Q8_0    90.2 GFLOPS (  6 runs)
2048 x 2048: F16     49.9 GFLOPS (  3 runs) | F32     15.9 GFLOPS (  3 runs)
4096 x 4096: Q4_0    61.1 GFLOPS (  3 runs) | Q4_1    54.7 GFLOPS (  3 runs)
4096 x 4096: Q5_0    48.4 GFLOPS (  3 runs) | Q5_1    45.1 GFLOPS (  3 runs) | Q8_0    62.7 GFLOPS (  3 runs)
4096 x 4096: F16     38.4 GFLOPS (  3 runs) | F32     12.9 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |

I don't know why it stopped when it wanted to run the benchmark for all models? I have ggml-base.en.bin, and I have for-tests-ggml*.bin.

StuartIanNaylor commented 1 year ago

@randomshinichi That is what its does when the non en models are not avail

mark-beeby commented 1 year ago

Jetson Orin Nano (Developer Kit) - Unoptimised install (no CLBlast, CUBLAS etc)

Running memcpy benchmark

memcpy: 6.28 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     4.1 GFLOPS (128 runs) | Q4_1     4.2 GFLOPS (128 runs)
  64 x   64: Q5_0     4.2 GFLOPS (128 runs) | Q5_1     4.1 GFLOPS (128 runs) | Q8_0     4.6 GFLOPS (128 runs)
  64 x   64: F16      4.0 GFLOPS (128 runs) | F32      5.2 GFLOPS (128 runs)
 128 x  128: Q4_0    12.9 GFLOPS (128 runs) | Q4_1    13.2 GFLOPS (128 runs)
 128 x  128: Q5_0    12.7 GFLOPS (128 runs) | Q5_1    12.5 GFLOPS (128 runs) | Q8_0    14.1 GFLOPS (128 runs)
 128 x  128: F16      9.3 GFLOPS (128 runs) | F32     20.9 GFLOPS (128 runs)
 256 x  256: Q4_0    17.9 GFLOPS (128 runs) | Q4_1    17.5 GFLOPS (128 runs)
 256 x  256: Q5_0    17.8 GFLOPS (128 runs) | Q5_1    16.2 GFLOPS (128 runs) | Q8_0    20.3 GFLOPS (128 runs)
 256 x  256: F16     10.4 GFLOPS (128 runs) | F32     28.8 GFLOPS (128 runs)
 512 x  512: Q4_0    21.1 GFLOPS ( 79 runs) | Q4_1    20.0 GFLOPS ( 75 runs)
 512 x  512: Q5_0    18.6 GFLOPS ( 70 runs) | Q5_1    19.1 GFLOPS ( 72 runs) | Q8_0    22.0 GFLOPS ( 83 runs)
 512 x  512: F16     10.5 GFLOPS ( 40 runs) | F32     25.7 GFLOPS ( 97 runs)
1024 x 1024: Q4_0    20.6 GFLOPS ( 10 runs) | Q4_1    20.4 GFLOPS ( 10 runs)
1024 x 1024: Q5_0    20.2 GFLOPS ( 10 runs) | Q5_1    18.7 GFLOPS (  9 runs) | Q8_0    23.2 GFLOPS ( 11 runs)
1024 x 1024: F16     11.4 GFLOPS (  6 runs) | F32     16.6 GFLOPS (  8 runs)
2048 x 2048: Q4_0    22.3 GFLOPS (  3 runs) | Q4_1    22.4 GFLOPS (  3 runs)
2048 x 2048: Q5_0    22.0 GFLOPS (  3 runs) | Q5_1    20.9 GFLOPS (  3 runs) | Q8_0    25.8 GFLOPS (  3 runs)
2048 x 2048: F16     11.9 GFLOPS (  3 runs) | F32     11.5 GFLOPS (  3 runs)
4096 x 4096: Q4_0    22.7 GFLOPS (  3 runs) | Q4_1    22.6 GFLOPS (  3 runs)
4096 x 4096: Q5_0    22.2 GFLOPS (  3 runs) | Q5_1    21.0 GFLOPS (  3 runs) | Q8_0    26.2 GFLOPS (  3 runs)
4096 x 4096: F16     12.0 GFLOPS (  3 runs) | F32     13.1 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!
CPU OS Config Model Th Load Enc. Commit
6-core Arm Cortex-A78AE Ubuntu 20.04 NEON tiny 4 117 3631 5e2b340
6-core Arm Cortex-A78AE Ubuntu 20.04 NEON base 4 153 8603 5e2b340
6-core Arm Cortex-A78AE Ubuntu 20.04 NEON small 4 323 33605 5e2b340
6-core Arm Cortex-A78AE Ubuntu 20.04 NEON medium 4 1059 111404 5e2b340
6-core Arm Cortex-A78AE Ubuntu 20.04 NEON large 4 3187 222130 5e2b340
StuartIanNaylor commented 1 year ago

Jetson Orin Nano (Developer Kit)

Running memcpy benchmark

memcpy: 6.28 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     4.1 GFLOPS (128 runs) | Q4_1     4.2 GFLOPS (128 runs)
  64 x   64: Q5_0     4.2 GFLOPS (128 runs) | Q5_1     4.1 GFLOPS (128 runs) | Q8_0     4.6 GFLOPS (128 runs)
  64 x   64: F16      4.0 GFLOPS (128 runs) | F32      5.2 GFLOPS (128 runs)
 128 x  128: Q4_0    12.9 GFLOPS (128 runs) | Q4_1    13.2 GFLOPS (128 runs)
 128 x  128: Q5_0    12.7 GFLOPS (128 runs) | Q5_1    12.5 GFLOPS (128 runs) | Q8_0    14.1 GFLOPS (128 runs)
 128 x  128: F16      9.3 GFLOPS (128 runs) | F32     20.9 GFLOPS (128 runs)
 256 x  256: Q4_0    17.9 GFLOPS (128 runs) | Q4_1    17.5 GFLOPS (128 runs)
 256 x  256: Q5_0    17.8 GFLOPS (128 runs) | Q5_1    16.2 GFLOPS (128 runs) | Q8_0    20.3 GFLOPS (128 runs)
 256 x  256: F16     10.4 GFLOPS (128 runs) | F32     28.8 GFLOPS (128 runs)
 512 x  512: Q4_0    21.1 GFLOPS ( 79 runs) | Q4_1    20.0 GFLOPS ( 75 runs)
 512 x  512: Q5_0    18.6 GFLOPS ( 70 runs) | Q5_1    19.1 GFLOPS ( 72 runs) | Q8_0    22.0 GFLOPS ( 83 runs)
 512 x  512: F16     10.5 GFLOPS ( 40 runs) | F32     25.7 GFLOPS ( 97 runs)
1024 x 1024: Q4_0    20.6 GFLOPS ( 10 runs) | Q4_1    20.4 GFLOPS ( 10 runs)
1024 x 1024: Q5_0    20.2 GFLOPS ( 10 runs) | Q5_1    18.7 GFLOPS (  9 runs) | Q8_0    23.2 GFLOPS ( 11 runs)
1024 x 1024: F16     11.4 GFLOPS (  6 runs) | F32     16.6 GFLOPS (  8 runs)
2048 x 2048: Q4_0    22.3 GFLOPS (  3 runs) | Q4_1    22.4 GFLOPS (  3 runs)
2048 x 2048: Q5_0    22.0 GFLOPS (  3 runs) | Q5_1    20.9 GFLOPS (  3 runs) | Q8_0    25.8 GFLOPS (  3 runs)
2048 x 2048: F16     11.9 GFLOPS (  3 runs) | F32     11.5 GFLOPS (  3 runs)
4096 x 4096: Q4_0    22.7 GFLOPS (  3 runs) | Q4_1    22.6 GFLOPS (  3 runs)
4096 x 4096: Q5_0    22.2 GFLOPS (  3 runs) | Q5_1    21.0 GFLOPS (  3 runs) | Q8_0    26.2 GFLOPS (  3 runs)
4096 x 4096: F16     12.0 GFLOPS (  3 runs) | F32     13.1 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

CPU OS Config Model Th Load Enc. Commit 6-core Arm Cortex-A78AE Ubuntu 20.04 NEON tiny 4 117 3631 5e2b340 6-core Arm Cortex-A78AE Ubuntu 20.04 NEON base 4 153 8603 5e2b340 6-core Arm Cortex-A78AE Ubuntu 20.04 NEON small 4 323 33605 5e2b340 6-core Arm Cortex-A78AE Ubuntu 20.04 NEON medium 4 1059 111404 5e2b340 6-core Arm Cortex-A78AE Ubuntu 20.04 NEON large 4 3187 222130 5e2b340

@mark-beeby You sure everything is correct with your distro as your results are really bad, to what I was expecting. As been looking forward to see what a Orin nano could do.

Check out an rk3588 https://github.com/ggerganov/whisper.cpp/issues/89#issuecomment-1529989153 as that is an A76x4 with DDR4 not DDR5...

Also interested in what you get with cuBlas https://github.com/ggerganov/whisper.cpp#opencl-gpu-support-via-clblast

mark-beeby commented 1 year ago

Jetson Orin Nano (Developer Kit) - CUBLAS

Usage: ./bench.sh [n_threads] [encoder-only]

Running memcpy benchmark

memcpy: 6.26 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     1.0 GFLOPS (128 runs) | Q4_1     0.9 GFLOPS (128 runs)
  64 x   64: Q5_0     0.7 GFLOPS (128 runs) | Q5_1     0.9 GFLOPS (128 runs) | Q8_0     1.0 GFLOPS (128 runs)
  64 x   64: F16      1.0 GFLOPS (128 runs) | F32      0.9 GFLOPS (128 runs)
 128 x  128: Q4_0     6.8 GFLOPS (128 runs) | Q4_1     7.3 GFLOPS (128 runs)
 128 x  128: Q5_0     7.8 GFLOPS (128 runs) | Q5_1     7.8 GFLOPS (128 runs) | Q8_0     7.8 GFLOPS (128 runs)
 128 x  128: F16      8.0 GFLOPS (128 runs) | F32      7.7 GFLOPS (128 runs)
 256 x  256: Q4_0    57.1 GFLOPS (128 runs) | Q4_1    62.5 GFLOPS (128 runs)
 256 x  256: Q5_0    62.3 GFLOPS (128 runs) | Q5_1    62.8 GFLOPS (128 runs) | Q8_0    64.6 GFLOPS (128 runs)
 256 x  256: F16     38.7 GFLOPS (128 runs) | F32     38.6 GFLOPS (128 runs)
 512 x  512: Q4_0   248.6 GFLOPS (128 runs) | Q4_1   250.9 GFLOPS (128 runs)
 512 x  512: Q5_0   250.2 GFLOPS (128 runs) | Q5_1   248.7 GFLOPS (128 runs) | Q8_0   247.8 GFLOPS (128 runs)
 512 x  512: F16    215.2 GFLOPS (128 runs) | F32    210.5 GFLOPS (128 runs)
1024 x 1024: Q4_0   884.6 GFLOPS (128 runs) | Q4_1   882.7 GFLOPS (128 runs)
1024 x 1024: Q5_0   879.2 GFLOPS (128 runs) | Q5_1   872.7 GFLOPS (128 runs) | Q8_0   632.0 GFLOPS (128 runs)
1024 x 1024: F16    651.2 GFLOPS (128 runs) | F32    627.2 GFLOPS (128 runs)
2048 x 2048: Q4_0  1349.9 GFLOPS ( 79 runs) | Q4_1  1337.1 GFLOPS ( 78 runs)
2048 x 2048: Q5_0  1332.3 GFLOPS ( 78 runs) | Q5_1  1327.7 GFLOPS ( 78 runs) | Q8_0  1304.8 GFLOPS ( 76 runs)
2048 x 2048: F16   1401.6 GFLOPS ( 82 runs) | F32   1140.0 GFLOPS ( 67 runs)
4096 x 4096: Q4_0  1967.6 GFLOPS ( 15 runs) | Q4_1  1962.9 GFLOPS ( 15 runs)
4096 x 4096: Q5_0  1956.3 GFLOPS ( 15 runs) | Q5_1  1952.7 GFLOPS ( 15 runs) | Q8_0  1929.9 GFLOPS ( 15 runs)
4096 x 4096: F16   2603.2 GFLOPS ( 19 runs) | F32   1742.4 GFLOPS ( 13 runs)

Running benchmark for all models
This can take a while!
CPU OS Config Model Th Load Enc. Commit
6-core Arm Cortex-A78AE Ubuntu 20.04 NEON BLAS tiny 4 1296 544 5e2b340
6-core Arm Cortex-A78AE Ubuntu 20.04 NEON BLAS base 4 1350 1015 5e2b340
6-core Arm Cortex-A78AE Ubuntu 20.04 NEON BLAS small 4 1557 2901 5e2b340
6-core Arm Cortex-A78AE Ubuntu 20.04 NEON BLAS medium 4 2303 7977 5e2b340
6-core Arm Cortex-A78AE Ubuntu 20.04 NEON BLAS large 4 6716 12913 5e2b340

@StuartIanNaylor I've struggled to get clblast installed, and moved back to a CUDA install, and after a few hiccups and setting export CUDA_VISIBLE_DEVICES=0 I got the much more favourable results above. Hope that helps!

tazz4843 commented 1 year ago

New desktop I built - CPU i7-13700K (turbo overclock +200MHz base), DDR5 @ 5600MT/s, GPU Intel Arc A770 LE

I tried differing numbers of thread counts, before settling on 20. Anything past 20 resulted in a drop in performance, which is obviously going to happen.

Running memcpy benchmark

memcpy: 23.16 GB/s (1 thread)
sum:    -536869898.000000

Running ggml_mul_mat benchmark with 20 threads

Initializing CLBlast (First Run)...
Attempting to use: Platform=0, Device=0 (If invalid, program will crash)
Using Platform: Intel(R) OpenCL HD Graphics Device: Intel(R) Arc(TM) A770 Graphics
  64 x   64: Q4_0     0.9 GFLOPS (128 runs) | Q4_1     1.0 GFLOPS (128 runs)
  64 x   64: Q5_0     1.0 GFLOPS (128 runs) | Q5_1     1.0 GFLOPS (128 runs) | Q8_0     1.0 GFLOPS (128 runs)
  64 x   64: F16      1.0 GFLOPS (128 runs) | F32      1.0 GFLOPS (128 runs)
 128 x  128: Q4_0     5.6 GFLOPS (128 runs) | Q4_1     5.8 GFLOPS (128 runs)
 128 x  128: Q5_0     5.7 GFLOPS (128 runs) | Q5_1     5.4 GFLOPS (128 runs) | Q8_0     5.0 GFLOPS (128 runs)
 128 x  128: F16      5.6 GFLOPS (128 runs) | F32      5.5 GFLOPS (128 runs)
 256 x  256: Q4_0    40.4 GFLOPS (128 runs) | Q4_1    38.9 GFLOPS (128 runs)
 256 x  256: Q5_0    40.7 GFLOPS (128 runs) | Q5_1    40.3 GFLOPS (128 runs) | Q8_0    38.5 GFLOPS (128 runs)
 256 x  256: F16     40.8 GFLOPS (128 runs) | F32     40.8 GFLOPS (128 runs)
 512 x  512: Q4_0   260.5 GFLOPS (128 runs) | Q4_1   264.6 GFLOPS (128 runs)
 512 x  512: Q5_0   234.3 GFLOPS (128 runs) | Q5_1   254.8 GFLOPS (128 runs) | Q8_0   260.2 GFLOPS (128 runs)
 512 x  512: F16    223.7 GFLOPS (128 runs) | F32    261.0 GFLOPS (128 runs)
1024 x 1024: Q4_0  1158.0 GFLOPS (128 runs) | Q4_1  1158.2 GFLOPS (128 runs)
1024 x 1024: Q5_0  1119.2 GFLOPS (128 runs) | Q5_1  1157.4 GFLOPS (128 runs) | Q8_0  1125.5 GFLOPS (128 runs)
1024 x 1024: F16    871.3 GFLOPS (128 runs) | F32   1029.7 GFLOPS (128 runs)
2048 x 2048: Q4_0  2847.7 GFLOPS (128 runs) | Q4_1  2749.8 GFLOPS (128 runs)
2048 x 2048: Q5_0  2752.3 GFLOPS (128 runs) | Q5_1  2879.4 GFLOPS (128 runs) | Q8_0  2770.3 GFLOPS (128 runs)
2048 x 2048: F16   2061.0 GFLOPS (120 runs) | F32   2504.5 GFLOPS (128 runs)
4096 x 4096: Q4_0  4681.2 GFLOPS ( 35 runs) | Q4_1  4637.2 GFLOPS ( 34 runs)
4096 x 4096: Q5_0  4646.7 GFLOPS ( 34 runs) | Q5_1  4586.6 GFLOPS ( 34 runs) | Q8_0  4589.7 GFLOPS ( 34 runs)
4096 x 4096: F16   3444.7 GFLOPS ( 26 runs) | F32   4128.2 GFLOPS ( 31 runs)
CPU OS Config Model Th Load Enc. Commit
Intel Core i7-13700K Arch Linux AVX2 BLAS tiny 20 145 417 5e2b340
Intel Core i7-13700K Arch Linux AVX2 BLAS base 20 161 560 5e2b340
Intel Core i7-13700K Arch Linux AVX2 BLAS small 20 281 1072 5e2b340
Intel Core i7-13700K Arch Linux AVX2 BLAS medium 20 606 2771 5e2b340
Intel Core i7-13700K Arch Linux AVX2 BLAS large 20 1116 4105 5e2b340

CPU power draw during these last tests averaged 140 watts, peaking at 141. GPU metrics are currently not exposed in Linux for Arc, so I'm unable to check what that was drawing.