Open ggerganov opened 1 year ago
Whats the performance gain of this against the original implementation with pytorch compiled with AVX support or the pytorch m1 backend?
Does this implementation use beam decoding? (original pytorch impl has n=5 as default and is 100% faster with n=1)
Edit: README already mentions it's greedy decoding:
Very basic greedy sampling scheme - always pick up the token with highest probability. This should be similar to the GreedyDecoder from the original python implementation, so in order to make a fair comparison between the 2 implementations, make sure to run the python code with the following parameters:
whisper --best_of None --beam_size None ...
Greedy decoding is also 2x faster in the original implementation (on a GPU).
Orange Pi5 4Gb, Micro-SD not NVME
Starting to touch zram swap on medium and then file swap pretty hard on large
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
rk3588s | Bullseye 5.10.110 | NEON | tiny | 8 | 352 | 2876 | 0be6a1a |
rk3588s | Bullseye 5.10.110 | NEON | base | 8 | 346 | 6213 | 0be6a1a |
rk3588s | Bullseye 5.10.110 | NEON | small | 8 | 690 | 25808 | 0be6a1a |
rk3588s | Bullseye 5.10.110 | NEON | medium | 8 | 23987 | 93995 | 0be6a1a |
rk3588s | Bullseye 5.10.110 | NEON | large | 8 | 49633 | 190601 | 0be6a1a |
Even with a 4:4 big:little its a touch faster taskset -c 4-7 ./extra/bench-all.sh
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
rk3588s | Bullseye 5.10.110 | NEON | tiny | 4 | 356 | 2716 | 0be6a1a |
rk3588s | Bullseye 5.10.110 | NEON | base | 4 | 417 | 6661 | 0be6a1a |
rk3588s | Bullseye 5.10.110 | NEON | small | 4 | 943 | 25357 | 0be6a1a |
rk3588s | Bullseye 5.10.110 | NEON | medium | 4 | 17748 | 90187 | 0be6a1a |
rk3588s | Bullseye 5.10.110 | NEON | large | 4 | 48793 | 182800 | 0be6a1a |
Compiling on a rk3588 with -march=native -ffast-math seems to give a big boost taskset -c 4-7 ./extra/bench-all.sh
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
rk3588s | Bullseye 5.10.110 | NEON | tiny | 4 | 280 | 1074 | 0be6a1a |
rk3588s | Bullseye 5.10.110 | NEON | base | 4 | 466 | 3491 | 0be6a1a |
rk3588s | Bullseye 5.10.110 | NEON | small | 4 | 780 | 11052 | 0be6a1a |
rk3588s | Bullseye 5.10.110 | NEON | medium | 4 | 15361 | 42252 | 0be6a1a |
rk3588s | Bullseye 5.10.110 | NEON | large | 4 | 49331 | 91892 | 0be6a1a |
Intel Celeron N4120 (4 cores, 4 threads) on Artix Linux 6.0.12-artix1-1.
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
N4120 | Artix 6.0.12-artix1-1 | BLAS | tiny | 4 | 330 | 12272 | 65fdcbb |
N4120 | Artix 6.0.12-artix1-1 | BLAS | base | 4 | 65fdcbb | ||
N4120 | Artix 6.0.12-artix1-1 | BLAS | small | 4 | 892 | 83209 | 65fdcbb |
N4120 | Artix 6.0.12-artix1-1 | BLAS | medium | 4 | 5478 | 237677 | 65fdcbb |
Base 14 inch M1 Macbook Pro with NEON enabled:
CPU | OS | Config | RAM (GB) | Th | Model | Load (ms) | Enc. (ms) | Total |
---|---|---|---|---|---|---|---|---|
M1 Pro | OSX 12.5.1 | NEON | 16 | 8 | Tiny.en | 107 | 269.72 | 376.91 |
M1 Pro | OSX 12.5.1 | NEON | 16 | 8 | Base.en | 92 | 321 | 413.77 |
M1 Pro | OSX 12.5.1 | NEON | 16 | 8 | Small.en | 264 | 978 | 1243.24 |
16 Inch Base Apple M2 Pro results
CPU | OS | Config | RAM (GB) | Th | Model | Load (ms) | Enc. (ms) | Total (ms) |
---|---|---|---|---|---|---|---|---|
M2 Pro | OSX 13.2 | NEON | 16 | 8 | Tiny.en | 118 | 143 | 261 |
M2 Pro | OSX 13.2 | NEON | 16 | 8 | Tiny | 118 | 143 | 261 |
M2 Pro | OSX 13.2 | NEON | 16 | 8 | Base.en | 173 | 235 | 408 |
M2 Pro | OSX 13.2 | NEON | 16 | 8 | Base | 148 | 266 | 414 |
M2 Pro | OSX 13.2 | NEON | 16 | 8 | Small.en | 304 | 739 | 1042 |
M2 Pro | OSX 13.2 | NEON | 16 | 8 | Small | 277(?) | 720 | 997 |
M2 Pro | OSX 13.2 | NEON | 16 | 8 | Medium.en | 747 | 2057 | 2804 |
M2 Pro | OSX 13.2 | NEON | 16 | 8 | Medium | 657 | 2055 | 2712 |
M2 Pro | OSX 13.2 | NEON | 16 | 8 | Large | 2126 | 4223 | 6349 |
I couldn't get bench to run on my iPhone 12, so I have attached my ad-hoc results below with the input audio "I love transcriber apps":
CPU | DGGML_USE_ACCELERATE | OS | Model | Load | Mel | Sample | Enc. | Dec. | Total (ms) |
---|---|---|---|---|---|---|---|---|---|
A14 | Release | IOS 16.1 | Base.en | 150 | 23 | 2 | 2447 | 112 | 2584 |
--
This might appear obvious to some, but it wasn't to me so I'll note it here: I saw much better results using large steps lengths and sample sizes with "./stream". I feel like under the hood, Whisper relies greatly on 'whole-sentence' context to infer individual words.
With the new beta 1.1.0 release. On first notice, not to much difference. Will not rebuild without OpenBLAS as it was slightly better with it on the rpi4.
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
Raspberry Pi 4 - 2GB | OpenVoiceOS | NEON BLAS | tiny | 4 | 751 | 9506 | ecda7f786a |
Raspberry Pi 4 - 2GB | OpenVoiceOS | NEON BLAS | tiny.en | 4 | 748 | 9295 | ecda7f786a |
Raspberry Pi 4 - 2GB | OpenVoiceOS | NEON BLAS | base | 4 | 971 | 23512 | ecda7f786a |
Raspberry Pi 4 - 2GB | OpenVoiceOS | NEON BLAS | base.en | 4 | 958 | 24263 | ecda7f786a |
Raspberry Pi 4 - 2GB | OpenVoiceOS | NEON BLAS | small | 4 | 2238 | 84720 | ecda7f786a |
Raspberry Pi 4 - 2GB | OpenVoiceOS | NEON BLAS | small.en | 4 | 3880 | 86031 | ecda7f786a |
Results on 12th Gen Intel(R) Core(TM) i3-12300T:
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
Core i3-12300T | Debian 11 (Docker on Win11) | AVX2 | tiny.en | 4 | 97 | 679 | 49b529b |
Core i3-12300T | Debian 11 (Docker on Win11) | AVX2 | tiny | 4 | 90 | 580 | 49b529b |
Core i3-12300T | Debian 11 (Docker on Win11) | AVX2 | base | 4 | 138 | 1478 | 49b529b |
With OpenBLAS (considerably worse):
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
Core i3-12300T | Debian 11 (Docker on Win11) | AVX2 BLAS | tiny | 4 | 117 | 1644 | 49b529b |
Core i3-12300T | Debian 11 (Docker on Win11) | AVX2 BLAS | base | 4 | 122 | 2890 | 49b529b |
The benchmarks for the macbook pro m1 are using 8 threads, but in my experience it runs nearly twice as fast with 4 threads. Am I missing something?
Edit: I just ran the benchmark with the large model.. and it actually made almost no difference whether 8 or 4 threads were used. But with real world workloads it makes a huge difference. Interesting.
Running memcpy benchmark with 1 thread
memcpy: 8.66 GB/s
sum: error 136902082731.000000
Running ggml_mul_mat benchmark with 4 threads
ggml_mul_mat: 64 x 64: F16 4.2 GFLOPS (128 runs) / F32 3.5 GFLOPS (128 runs)
ggml_mul_mat: 128 x 128: F16 10.1 GFLOPS (128 runs) / F32 6.3 GFLOPS (128 runs)
ggml_mul_mat: 256 x 256: F16 13.0 GFLOPS (128 runs) / F32 7.2 GFLOPS (128 runs)
ggml_mul_mat: 512 x 512: F16 14.0 GFLOPS ( 53 runs) / F32 7.1 GFLOPS ( 27 runs)
ggml_mul_mat: 1024 x 1024: F16 29.8 GFLOPS ( 15 runs) / F32 17.8 GFLOPS ( 9 runs)
ggml_mul_mat: 2048 x 2048: F16 37.8 GFLOPS ( 3 runs) / F32 19.6 GFLOPS ( 3 runs)
ggml_mul_mat: 4096 x 4096: F16 40.0 GFLOPS ( 3 runs) / F32 17.4 GFLOPS ( 3 runs)
Running benchmark for all models | CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|---|
rk3588s | Ubuntu 22.04 | NEON | tiny | 4 | 257 | 1179 | 21c569b | |
rk3588s | Ubuntu 22.04 | NEON | base | 4 | 326 | 2967 | 21c569b | |
rk3588s | Ubuntu 22.04 | NEON | small | 4 | 661 | 10560 | 21c569b | |
rk3588s | Ubuntu 22.04 | NEON | medium | 4 | 23188 | 35867 | 21c569b |
Compiler: gcc version 12.2.0 (Ubuntu 12.2.0-3ubuntu1)
memcpy: 16.74 GB/s
sum: error -536870997.000000
Running ggml_mul_mat benchmark with 4 threads
ggml_mul_mat: 64 x 64: F16 16.2 GFLOPS (128 runs) / F32 16.4 GFLOPS (128 runs)
ggml_mul_mat: 128 x 128: F16 70.1 GFLOPS (128 runs) / F32 66.0 GFLOPS (128 runs)
ggml_mul_mat: 256 x 256: F16 133.9 GFLOPS (128 runs) / F32 105.7 GFLOPS (128 runs)
ggml_mul_mat: 512 x 512: F16 161.2 GFLOPS (128 runs) / F32 109.3 GFLOPS (128 runs)
ggml_mul_mat: 1024 x 1024: F16 204.4 GFLOPS ( 96 runs) / F32 121.9 GFLOPS ( 57 runs)
ggml_mul_mat: 2048 x 2048: F16 254.4 GFLOPS ( 15 runs) / F32 149.3 GFLOPS ( 9 runs)
ggml_mul_mat: 4096 x 4096: F16 184.2 GFLOPS ( 3 runs) / F32 54.1 GFLOPS ( 3 runs)
Running ggml_mul_mat benchmark with 8 threads
ggml_mul_mat: 64 x 64: F16 8.4 GFLOPS (128 runs) / F32 9.0 GFLOPS (128 runs)
ggml_mul_mat: 128 x 128: F16 58.1 GFLOPS (128 runs) / F32 57.6 GFLOPS (128 runs)
ggml_mul_mat: 256 x 256: F16 170.3 GFLOPS (128 runs) / F32 159.9 GFLOPS (128 runs)
ggml_mul_mat: 512 x 512: F16 315.7 GFLOPS (128 runs) / F32 230.8 GFLOPS (128 runs)
ggml_mul_mat: 1024 x 1024: F16 356.0 GFLOPS (128 runs) / F32 224.9 GFLOPS (105 runs)
ggml_mul_mat: 2048 x 2048: F16 499.5 GFLOPS ( 30 runs) / F32 292.4 GFLOPS ( 18 runs)
ggml_mul_mat: 4096 x 4096: F16 265.9 GFLOPS ( 3 runs) / F32 66.2 GFLOPS ( 3 runs)
Running ggml_mul_mat benchmark with 16 threads
ggml_mul_mat: 64 x 64: F16 3.6 GFLOPS (128 runs) / F32 3.0 GFLOPS (128 runs)
ggml_mul_mat: 128 x 128: F16 16.7 GFLOPS (128 runs) / F32 27.0 GFLOPS (128 runs)
ggml_mul_mat: 256 x 256: F16 88.1 GFLOPS (128 runs) / F32 126.7 GFLOPS (128 runs)
ggml_mul_mat: 512 x 512: F16 263.5 GFLOPS (128 runs) / F32 229.5 GFLOPS (128 runs)
ggml_mul_mat: 1024 x 1024: F16 396.1 GFLOPS (128 runs) / F32 272.8 GFLOPS (128 runs)
ggml_mul_mat: 2048 x 2048: F16 498.6 GFLOPS ( 30 runs) / F32 314.9 GFLOPS ( 19 runs)
ggml_mul_mat: 4096 x 4096: F16 337.7 GFLOPS ( 3 runs) / F32 112.0 GFLOPS ( 3 runs)
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
Ryzen 7700X (8C/16T 65W Eco Mode) | Ubuntu 22.10 (6.0.9 Kernel) | AVX2 | tiny.en | 4 | 104 | 247 | 78f1661 |
Ryzen 7700X (8C/16T 65W Eco Mode) | Ubuntu 22.10 (6.0.9 Kernel) | AVX2 | base.en | 4 | 130 | 585 | 78f1661 |
Ryzen 7700X (8C/16T 65W Eco Mode) | Ubuntu 22.10 (6.0.9 Kernel) | AVX2 | small.en | 4 | 264 | 1940 | 78f1661 |
--- | -- | ------ | ----- | -- | ---- | ---- | ------ |
Ryzen 7700X (8C/16T 65W Eco Mode) | Ubuntu 22.10 (6.0.9 Kernel) | AVX2 | tiny.en | 8 | 99 | 166 | 78f1661 |
Ryzen 7700X (8C/16T 65W Eco Mode) | Ubuntu 22.10 (6.0.9 Kernel) | AVX2 | base.en | 8 | 123 | 329 | 78f1661 |
Ryzen 7700X (8C/16T 65W Eco Mode) | Ubuntu 22.10 (6.0.9 Kernel) | AVX2 | small.en | 8 | 262 | 1148 | 78f1661 |
--- | -- | ------ | ----- | -- | ---- | ---- | ------ |
Ryzen 7700X (8C/16T 65W Eco Mode) | Ubuntu 22.10 (6.0.9 Kernel) | AVX2 | tiny.en | 16 | 100 | 160 | 78f1661 |
Ryzen 7700X (8C/16T 65W Eco Mode) | Ubuntu 22.10 (6.0.9 Kernel) | AVX2 | base.en | 16 | 123 | 338 | 78f1661 |
Ryzen 7700X (8C/16T 65W Eco Mode) | Ubuntu 22.10 (6.0.9 Kernel) | AVX2 | small.en | 16 | 262 | 1139 | 78f1661 |
Tested on my M2 Macbook Air:
system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
./extra/bench-all.sh Running memcpy benchmark with 1 thread memcpy: 31.42 GB/s sum: ok -536870910.000000
Running ggml_mul_mat benchmark with 4 threads
ggml_mul_mat: 64 x 64: F16 11.8 GFLOPS (128 runs) / F32 10.6 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 89.9 GFLOPS (128 runs) / F32 74.7 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 434.5 GFLOPS (128 runs) / F32 419.9 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 885.4 GFLOPS (128 runs) / F32 913.2 GFLOPS (128 runs) ggml_mul_mat: 1024 x 1024: F16 1023.4 GFLOPS (128 runs) / F32 1037.7 GFLOPS (128 runs) ggml_mul_mat: 2048 x 2048: F16 971.6 GFLOPS ( 57 runs) / F32 950.1 GFLOPS ( 56 runs) ggml_mul_mat: 4096 x 4096: F16 914.9 GFLOPS ( 7 runs) / F32 820.7 GFLOPS ( 6 runs)
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
M2 | OSX 13.0.1 | NEON BLAS | tiny | 4 | 63 | 153 | 1a91c19 |
M2 | OSX 13.0.1 | NEON BLAS | base | 4 | 92 | 329 | 1a91c19 |
M2 | OSX 13.0.1 | NEON BLAS | small | 4 | 198 | 1014 | 1a91c19 |
M2 | OSX 13.0.1 | NEON BLAS | medium | 4 | 564 | 3042 | 1a91c19 |
M2 | OSX 13.0.1 | NEON BLAS | large | 4 | 1152 | 5466 | 1a91c19 |
Running ggml_mul_mat benchmark with 8 threads
ggml_mul_mat: 64 x 64: F16 5.7 GFLOPS (128 runs) / F32 3.9 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 45.0 GFLOPS (128 runs) / F32 25.8 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 272.7 GFLOPS (128 runs) / F32 166.1 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 747.6 GFLOPS (128 runs) / F32 748.8 GFLOPS (128 runs) ggml_mul_mat: 1024 x 1024: F16 998.7 GFLOPS (128 runs) / F32 895.8 GFLOPS (128 runs) ggml_mul_mat: 2048 x 2048: F16 716.0 GFLOPS ( 42 runs) / F32 717.2 GFLOPS ( 42 runs) ggml_mul_mat: 4096 x 4096: F16 790.4 GFLOPS ( 6 runs) / F32 726.3 GFLOPS ( 6 runs)
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
M2 | OSX 13.0.1 | NEON BLAS | tiny | 8 | 66 | 154 | 1a91c19 |
M2 | OSX 13.0.1 | NEON BLAS | base | 8 | 92 | 346 | 1a91c19 |
M2 | OSX 13.0.1 | NEON BLAS | small | 8 | 211 | 1171 | 1a91c19 |
M2 | OSX 13.0.1 | NEON BLAS | medium | 8 | 562 | 3848 | 1a91c19 |
M2 | OSX 13.0.1 | NEON BLAS | large | 8 | 1079 | 6230 | 1a91c19 |
This is bench result :
whisper_init_from_file: loading model from 'models/ggml-base.en.bin' whisper_model_load: loading model whisper_model_load: n_vocab = 51864 whisper_model_load: n_audio_ctx = 1500 whisper_model_load: n_audio_state = 512 whisper_model_load: n_audio_head = 8 whisper_model_load: n_audio_layer = 6 whisper_model_load: n_text_ctx = 448 whisper_model_load: n_text_state = 512 whisper_model_load: n_text_head = 8 whisper_model_load: n_text_layer = 6 whisper_model_load: n_mels = 80 whisper_model_load: f16 = 1 whisper_model_load: type = 2 whisper_model_load: mem required = 500.00 MB (+ 6.00 MB per decoder) whisper_model_load: kv self size = 5.25 MB whisper_model_load: kv cross size = 17.58 MB whisper_model_load: adding 1607 extra tokens whisper_model_load: model ctx = 140.60 MB whisper_model_load: model size = 140.54 MB
system_info: n_threads = 4 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
whisper_print_timings: fallbacks = 0 p / 0 h whisper_print_timings: load time = 1245.39 ms whisper_print_timings: mel time = 0.00 ms whisper_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run) whisper_print_timings: encode time = 88596.32 ms / 1 runs (88596.32 ms per run) whisper_print_timings: decode time = 0.00 ms / 1 runs ( 0.00 ms per run) whisper_print_timings: total time = 89841.85 ms
This is cpuinfo :
processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 42 model name : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz stepping : 7 microcode : 0x2f cpu MHz : 2990.383 cache size : 3072 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 2 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts md_clear flush_l1d vmx flags : vnmi preemption_timer invvpid ept_x_only flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit mmio_unknown bogomips : 4983.97 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management:
processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 42 model name : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz stepping : 7 microcode : 0x2f cpu MHz : 2990.384 cache size : 3072 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 2 apicid : 1 initial apicid : 1 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts md_clear flush_l1d vmx flags : vnmi preemption_timer invvpid ept_x_only flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit mmio_unknown bogomips : 4983.97 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management:
processor : 2 vendor_id : GenuineIntel cpu family : 6 model : 42 model name : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz stepping : 7 microcode : 0x2f cpu MHz : 2990.384 cache size : 3072 KB physical id : 0 siblings : 4 core id : 1 cpu cores : 2 apicid : 2 initial apicid : 2 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts md_clear flush_l1d vmx flags : vnmi preemption_timer invvpid ept_x_only flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit mmio_unknown bogomips : 4983.97 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management:
processor : 3 vendor_id : GenuineIntel cpu family : 6 model : 42 model name : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz stepping : 7 microcode : 0x2f cpu MHz : 2990.384 cache size : 3072 KB physical id : 0 siblings : 4 core id : 1 cpu cores : 2 apicid : 3 initial apicid : 3 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts md_clear flush_l1d vmx flags : vnmi preemption_timer invvpid ept_x_only flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit mmio_unknown bogomips : 4983.97 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management:
./bench -w 1 -t 1
memcpy: 3.35 GB/s sum: error -536870997.000000 ./bench -w 2 -t 1
ggml_mul_mat: 64 x 64: F16 0.7 GFLOPS (128 runs) / F32 3.3 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 0.7 GFLOPS (128 runs) / F32 3.7 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 0.6 GFLOPS ( 18 runs) / F32 3.3 GFLOPS ( 99 runs) ggml_mul_mat: 512 x 512: F16 0.6 GFLOPS ( 3 runs) / F32 3.6 GFLOPS ( 14 runs) ggml_mul_mat: 1024 x 1024: F16 0.7 GFLOPS ( 3 runs) / F32 2.3 GFLOPS ( 3 runs) ggml_mul_mat: 2048 x 2048: F16 0.7 GFLOPS ( 3 runs) / F32 2.4 GFLOPS ( 3 runs) ggml_mul_mat: 4096 x 4096: F16 1.2 GFLOPS ( 3 runs) / F32 3.0 GFLOPS ( 3 runs)
Thinkpad T520, on Linux Mint Debian Edition, with commented out AVX1 on Makefile
Usage: ./bench.sh [n_threads]
Running memcpy benchmark with 1 thread
memcpy: 38.84 GB/s sum: ok -536870910.000000
Running ggml_mul_mat benchmark with 4 threads
ggml_mul_mat: 64 x 64: F16 9.8 GFLOPS (128 runs) / F32 8.4 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 69.4 GFLOPS (128 runs) / F32 62.1 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 455.3 GFLOPS (128 runs) / F32 383.8 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 1141.1 GFLOPS (128 runs) / F32 1550.2 GFLOPS (128 runs) ggml_mul_mat: 1024 x 1024: F16 2302.0 GFLOPS (128 runs) / F32 2962.9 GFLOPS (128 runs) ggml_mul_mat: 2048 x 2048: F16 3035.6 GFLOPS (128 runs) / F32 3217.5 GFLOPS (128 runs) ggml_mul_mat: 4096 x 4096: F16 3431.7 GFLOPS ( 25 runs) / F32 3510.6 GFLOPS ( 26 runs)
Running benchmark for all models This can take a while!
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
M1 Ultra | 13.2 | NEON BLAS | tiny | 4 | 71 | 139 | 2bee265 |
M1 Ultra | 13.2 | NEON BLAS | base | 4 | 95 | 266 | 2bee265 |
M1 Ultra | 13.2 | NEON BLAS | small | 4 | 222 | 806 | 2bee265 |
M1 Ultra | 13.2 | NEON BLAS | medium | 4 | 598 | 2175 | 2bee265 |
M1 Ultra | 13.2 | NEON BLAS | large | 4 | 1165 | 3895 | 2bee265 |
Here are new results for POWER9, now that #300 is closed.
Running memcpy benchmark with 1 thread
memcpy: 6.32 GB/s
sum: error 136902082731.000000
Running ggml_mul_mat benchmark with 32 threads
ggml_mul_mat: 64 x 64: F16 0.4 GFLOPS (128 runs) / F32 0.4 GFLOPS (128 runs)
ggml_mul_mat: 128 x 128: F16 2.8 GFLOPS (128 runs) / F32 2.8 GFLOPS (128 runs)
ggml_mul_mat: 256 x 256: F16 13.4 GFLOPS (128 runs) / F32 23.0 GFLOPS (128 runs)
ggml_mul_mat: 512 x 512: F16 32.9 GFLOPS (123 runs) / F32 87.9 GFLOPS (128 runs)
ggml_mul_mat: 1024 x 1024: F16 47.9 GFLOPS ( 23 runs) / F32 127.4 GFLOPS ( 60 runs)
ggml_mul_mat: 2048 x 2048: F16 58.5 GFLOPS ( 4 runs) / F32 67.3 GFLOPS ( 4 runs)
ggml_mul_mat: 4096 x 4096: F16 23.8 GFLOPS ( 3 runs) / F32 21.2 GFLOPS ( 3 runs)
Running benchmark for all models
This can take a while!
CPU | OS | Config | Model | Th | Load | Enc. | Commit | Compiler |
---|---|---|---|---|---|---|---|---|
POWER9 | Debian 11 | tiny | 32 | 75 | 1283 | 3b010f9 | GCC 10.2.1 | |
POWER9 | Debian 11 | base | 32 | 96 | 2786 | 3b010f9 | GCC 10.2.1 | |
POWER9 | Debian 11 | small | 32 | 182 | 8534 | 3b010f9 | GCC 10.2.1 | |
POWER9 | Debian 11 | medium | 32 | 463 | 22282 | 3b010f9 | GCC 10.2.1 | |
POWER9 | Debian 11 | large | 32 | 838 | 41106 | 3b010f9 | GCC 10.2.1 |
I got referred here from https://github.com/openai/whisper/discussions/978#discussioncomment-5093839 This seems really interesting.
I'm running on Oracle Cloud's free tier, which contains 4x Ampere A1 CPUs and 24G RAM.
Compiler:
I CC: cc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
I CXX: g++ (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
Default (no changes)
~/whisper.cpp$ extra/bench-all.sh
Usage: ./bench.sh [n_threads]
Running memcpy benchmark with 1 thread
memcpy: 10.92 GB/s
sum: error 136902082731.000000
Running ggml_mul_mat benchmark with 4 threads
ggml_mul_mat: 64 x 64: F16 1.0 GFLOPS (128 runs) / F32 0.7 GFLOPS (128 runs)
ggml_mul_mat: 128 x 128: F16 16.8 GFLOPS (128 runs) / F32 13.2 GFLOPS (128 runs)
ggml_mul_mat: 256 x 256: F16 18.5 GFLOPS (128 runs) / F32 41.8 GFLOPS (128 runs)
ggml_mul_mat: 512 x 512: F16 21.5 GFLOPS ( 81 runs) / F32 35.4 GFLOPS (128 runs)
ggml_mul_mat: 1024 x 1024: F16 23.2 GFLOPS ( 11 runs) / F32 41.4 GFLOPS ( 20 runs)
ggml_mul_mat: 2048 x 2048: F16 23.4 GFLOPS ( 3 runs) / F32 32.6 GFLOPS ( 3 runs)
ggml_mul_mat: 4096 x 4096: F16 22.5 GFLOPS ( 3 runs) / F32 21.4 GFLOPS ( 3 runs)
Running benchmark for all models
This can take a while!
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
Ampere A1 | Ubuntu 22.04 | NEON | tiny | 4 | 83 | 1832 | ca21f7a |
Ampere A1 | Ubuntu 22.04 | NEON | base | 4 | 120 | 4767 | ca21f7a |
Ampere A1 | Ubuntu 22.04 | NEON | small | 4 | 273 | 17529 | ca21f7a |
Ampere A1 | Ubuntu 22.04 | NEON | medium | 4 | 739 | 59794 | ca21f7a |
Ampere A1 | Ubuntu 22.04 | NEON | large | 4 | 1436 | 115771 | ca21f7a |
With changes mentioned in https://github.com/openai/whisper/discussions/978#discussioncomment-5093839 Thanks again @jan-grzybek-ampere
~/whisper.cpp$ extra/bench-all.sh
Running memcpy benchmark with 1 thread
memcpy: 10.88 GB/s
sum: error 136902082731.000000
Running ggml_mul_mat benchmark with 4 threads
ggml_mul_mat: 64 x 64: F16 2.0 GFLOPS (128 runs) / F32 1.7 GFLOPS (128 runs)
ggml_mul_mat: 128 x 128: F16 14.3 GFLOPS (128 runs) / F32 33.6 GFLOPS (128 runs)
ggml_mul_mat: 256 x 256: F16 40.7 GFLOPS (128 runs) / F32 54.3 GFLOPS (128 runs)
ggml_mul_mat: 512 x 512: F16 97.5 GFLOPS (128 runs) / F32 31.4 GFLOPS (117 runs)
ggml_mul_mat: 1024 x 1024: F16 87.1 GFLOPS ( 41 runs) / F32 41.0 GFLOPS ( 20 runs)
ggml_mul_mat: 2048 x 2048: F16 74.3 GFLOPS ( 5 runs) / F32 33.4 GFLOPS ( 3 runs)
ggml_mul_mat: 4096 x 4096: F16 50.4 GFLOPS ( 3 runs) / F32 21.5 GFLOPS ( 3 runs)
Running benchmark for all models
This can take a while!
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
Ampere A1 | Ubuntu 22.04 | NEON | tiny | 4 | 84 | 619 | ca21f7a |
Ampere A1 | Ubuntu 22.04 | NEON | base | 4 | 124 | 2036 | ca21f7a |
Ampere A1 | Ubuntu 22.04 | NEON | small | 4 | 293 | 5872 | ca21f7a |
Ampere A1 | Ubuntu 22.04 | NEON | medium | 4 | 817 | 22064 | ca21f7a |
Ampere A1 | Ubuntu 22.04 | NEON | large | 4 | 1446 | 37996 | ca21f7a |
Done a bit of reading and done several more tests.
According to https://community.arm.com/arm-community-blogs/b/tools-software-ides-blog/posts/compiler-flags-across-architectures-march-mtune-and-mcpu , the recommendation is to use -mcpu=native
and I did indeed get the best performance with it.
Will put in a pull request to use -mcpu=native
for aarch64.
No significant difference between GCC 11.3 and GCC 12.1 on Ubuntu 22.04.
Performance seems slightly worse compared to yesterday's test in https://github.com/ggerganov/whisper.cpp/issues/89#issuecomment-1443688585 I re-ran all of the following tests one after another to hopefully obtain comparable figures. This is a free instance on Oracle Cloud and perhaps others are using the other cores on the CPU.
make clean
make main bench
./extra/bench-all.sh
Running memcpy benchmark with 1 thread
memcpy: 10.82 GB/s
sum: error 136902082731.000000
Running ggml_mul_mat benchmark with 4 threads
ggml_mul_mat: 64 x 64: F16 1.8 GFLOPS (128 runs) / F32 2.0 GFLOPS (128 runs)
ggml_mul_mat: 128 x 128: F16 40.7 GFLOPS (128 runs) / F32 12.7 GFLOPS (128 runs)
ggml_mul_mat: 256 x 256: F16 52.9 GFLOPS (128 runs) / F32 32.8 GFLOPS (128 runs)
ggml_mul_mat: 512 x 512: F16 97.3 GFLOPS (128 runs) / F32 32.1 GFLOPS (120 runs)
ggml_mul_mat: 1024 x 1024: F16 77.0 GFLOPS ( 36 runs) / F32 35.1 GFLOPS ( 17 runs)
ggml_mul_mat: 2048 x 2048: F16 64.0 GFLOPS ( 4 runs) / F32 25.9 GFLOPS ( 3 runs)
ggml_mul_mat: 4096 x 4096: F16 45.8 GFLOPS ( 3 runs) / F32 21.0 GFLOPS ( 3 runs)
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
Ampere A1 | Ubuntu 22.04 | NEON | tiny | 4 | 85 | 662 | ca21f7a |
Ampere A1 | Ubuntu 22.04 | NEON | base | 4 | 121 | 2039 | ca21f7a |
Ampere A1 | Ubuntu 22.04 | NEON | small | 4 | 281 | 6667 | ca21f7a |
Ampere A1 | Ubuntu 22.04 | NEON | medium | 4 | 760 | 25355 | ca21f7a |
Ampere A1 | Ubuntu 22.04 | NEON | large | 4 | 1456 | 45563 | ca21f7a |
make clean
make main bench
./extra/bench-all.sh
Running memcpy benchmark with 1 thread
memcpy: 10.85 GB/s
sum: error 136902082731.000000
Running ggml_mul_mat benchmark with 4 threads
ggml_mul_mat: 64 x 64: F16 7.9 GFLOPS (128 runs) / F32 1.8 GFLOPS (128 runs)
ggml_mul_mat: 128 x 128: F16 7.5 GFLOPS (128 runs) / F32 12.6 GFLOPS (128 runs)
ggml_mul_mat: 256 x 256: F16 51.8 GFLOPS (128 runs) / F32 54.4 GFLOPS (128 runs)
ggml_mul_mat: 512 x 512: F16 96.3 GFLOPS (128 runs) / F32 31.2 GFLOPS (117 runs)
ggml_mul_mat: 1024 x 1024: F16 74.1 GFLOPS ( 35 runs) / F32 33.5 GFLOPS ( 16 runs)
ggml_mul_mat: 2048 x 2048: F16 67.1 GFLOPS ( 4 runs) / F32 27.0 GFLOPS ( 3 runs)
ggml_mul_mat: 4096 x 4096: F16 49.3 GFLOPS ( 3 runs) / F32 21.7 GFLOPS ( 3 runs)
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
Ampere A1 | Ubuntu 22.04 | NEON | tiny | 4 | 85 | 655 | ca21f7a |
Ampere A1 | Ubuntu 22.04 | NEON | base | 4 | 121 | 2002 | ca21f7a |
Ampere A1 | Ubuntu 22.04 | NEON | small | 4 | 283 | 6923 | ca21f7a |
Ampere A1 | Ubuntu 22.04 | NEON | medium | 4 | 762 | 24085 | ca21f7a |
Ampere A1 | Ubuntu 22.04 | NEON | large | 4 | 1459 | 43846 | ca21f7a |
make clean
make CC=gcc-12 CXX=g++-12 main bench
./extra/bench-all.sh
Running memcpy benchmark with 1 thread
memcpy: 11.01 GB/s
sum: error 136902082731.000000
Running ggml_mul_mat benchmark with 4 threads
ggml_mul_mat: 64 x 64: F16 8.0 GFLOPS (128 runs) / F32 8.0 GFLOPS (128 runs)
ggml_mul_mat: 128 x 128: F16 12.0 GFLOPS (128 runs) / F32 12.6 GFLOPS (128 runs)
ggml_mul_mat: 256 x 256: F16 55.7 GFLOPS (128 runs) / F32 41.8 GFLOPS (128 runs)
ggml_mul_mat: 512 x 512: F16 95.1 GFLOPS (128 runs) / F32 30.2 GFLOPS (113 runs)
ggml_mul_mat: 1024 x 1024: F16 67.1 GFLOPS ( 32 runs) / F32 33.0 GFLOPS ( 16 runs)
ggml_mul_mat: 2048 x 2048: F16 64.2 GFLOPS ( 4 runs) / F32 26.8 GFLOPS ( 3 runs)
ggml_mul_mat: 4096 x 4096: F16 46.1 GFLOPS ( 3 runs) / F32 21.4 GFLOPS ( 3 runs)
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
Ampere A1 | Ubuntu 22.04 | NEON | tiny | 4 | 84 | 613 | ca21f7a |
Ampere A1 | Ubuntu 22.04 | NEON | base | 4 | 122 | 2086 | ca21f7a |
Ampere A1 | Ubuntu 22.04 | NEON | small | 4 | 286 | 6375 | ca21f7a |
Ampere A1 | Ubuntu 22.04 | NEON | medium | 4 | 761 | 24667 | ca21f7a |
Ampere A1 | Ubuntu 22.04 | NEON | large | 4 | 1457 | 43826 | ca21f7a |
I confirmed your findings, and interestingly enough, I found the performance worse with OpenBLAS.
On Sat, 25 Feb 2023 at 12:06, FlippFuzz @.***> wrote:
Done a bit of reading and done several more tests.
According to https://community.arm.com/arm-community-blogs/b/tools-software-ides-blog/posts/compiler-flags-across-architectures-march-mtune-and-mcpu , the recommendation is to use -mcpu=native and I did indeed get the best performance with it. Will put in a pull request to use -mcpu=native for aarch64. No significant difference between GCC 11.3 and GCC 12.1 on Ubuntu 22.04.
-march=armv8.2-a+fp16, gcc-11.3
Performance seems slightly worse compared to yesterday's test in #89 (comment) https://github.com/ggerganov/whisper.cpp/issues/89#issuecomment-1443688585 I re-ran all of the following tests one after another to hopefully obtain comparable figures. This is a free instance on Oracle Cloud and perhaps others are using the other cores on the CPU.
make clean make main bench ./extra/bench-all.sh
Running memcpy benchmark with 1 thread
memcpy: 10.82 GB/s sum: error 136902082731.000000
Running ggml_mul_mat benchmark with 4 threads
ggml_mul_mat: 64 x 64: F16 1.8 GFLOPS (128 runs) / F32 2.0 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 40.7 GFLOPS (128 runs) / F32 12.7 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 52.9 GFLOPS (128 runs) / F32 32.8 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 97.3 GFLOPS (128 runs) / F32 32.1 GFLOPS (120 runs) ggml_mul_mat: 1024 x 1024: F16 77.0 GFLOPS ( 36 runs) / F32 35.1 GFLOPS ( 17 runs) ggml_mul_mat: 2048 x 2048: F16 64.0 GFLOPS ( 4 runs) / F32 25.9 GFLOPS ( 3 runs) ggml_mul_mat: 4096 x 4096: F16 45.8 GFLOPS ( 3 runs) / F32 21.0 GFLOPS ( 3 runs)
CPU OS Config Model Th Load Enc. Commit Ampere A1 Ubuntu 22.04 NEON tiny 4 85 662 ca21f7a https://github.com/ggerganov/whisper.cpp/commit/ca21f7ab16694384fb74b1ba4f68b39f16540d23 Ampere A1 Ubuntu 22.04 NEON base 4 121 2039 ca21f7a https://github.com/ggerganov/whisper.cpp/commit/ca21f7ab16694384fb74b1ba4f68b39f16540d23 Ampere A1 Ubuntu 22.04 NEON small 4 281 6667 ca21f7a https://github.com/ggerganov/whisper.cpp/commit/ca21f7ab16694384fb74b1ba4f68b39f16540d23 Ampere A1 Ubuntu 22.04 NEON medium 4 760 25355 ca21f7a https://github.com/ggerganov/whisper.cpp/commit/ca21f7ab16694384fb74b1ba4f68b39f16540d23 Ampere A1 Ubuntu 22.04 NEON large 4 1456 45563 ca21f7a https://github.com/ggerganov/whisper.cpp/commit/ca21f7ab16694384fb74b1ba4f68b39f16540d23
-mcpu=native, gcc-11.3
make clean make main bench ./extra/bench-all.sh
Running memcpy benchmark with 1 thread
memcpy: 10.85 GB/s sum: error 136902082731.000000
Running ggml_mul_mat benchmark with 4 threads
ggml_mul_mat: 64 x 64: F16 7.9 GFLOPS (128 runs) / F32 1.8 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 7.5 GFLOPS (128 runs) / F32 12.6 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 51.8 GFLOPS (128 runs) / F32 54.4 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 96.3 GFLOPS (128 runs) / F32 31.2 GFLOPS (117 runs) ggml_mul_mat: 1024 x 1024: F16 74.1 GFLOPS ( 35 runs) / F32 33.5 GFLOPS ( 16 runs) ggml_mul_mat: 2048 x 2048: F16 67.1 GFLOPS ( 4 runs) / F32 27.0 GFLOPS ( 3 runs) ggml_mul_mat: 4096 x 4096: F16 49.3 GFLOPS ( 3 runs) / F32 21.7 GFLOPS ( 3 runs)
CPU OS Config Model Th Load Enc. Commit Ampere A1 Ubuntu 22.04 NEON tiny 4 85 655 ca21f7a https://github.com/ggerganov/whisper.cpp/commit/ca21f7ab16694384fb74b1ba4f68b39f16540d23 Ampere A1 Ubuntu 22.04 NEON base 4 121 2002 ca21f7a https://github.com/ggerganov/whisper.cpp/commit/ca21f7ab16694384fb74b1ba4f68b39f16540d23 Ampere A1 Ubuntu 22.04 NEON small 4 283 6923 ca21f7a https://github.com/ggerganov/whisper.cpp/commit/ca21f7ab16694384fb74b1ba4f68b39f16540d23 Ampere A1 Ubuntu 22.04 NEON medium 4 762 24085 ca21f7a https://github.com/ggerganov/whisper.cpp/commit/ca21f7ab16694384fb74b1ba4f68b39f16540d23 Ampere A1 Ubuntu 22.04 NEON large 4 1459 43846 ca21f7a https://github.com/ggerganov/whisper.cpp/commit/ca21f7ab16694384fb74b1ba4f68b39f16540d23
-mcpu=native, gcc-12.1
make clean make CC=gcc-12 CXX=g++-12 main bench ./extra/bench-all.sh
Running memcpy benchmark with 1 thread
memcpy: 11.01 GB/s sum: error 136902082731.000000
Running ggml_mul_mat benchmark with 4 threads
ggml_mul_mat: 64 x 64: F16 8.0 GFLOPS (128 runs) / F32 8.0 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 12.0 GFLOPS (128 runs) / F32 12.6 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 55.7 GFLOPS (128 runs) / F32 41.8 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 95.1 GFLOPS (128 runs) / F32 30.2 GFLOPS (113 runs) ggml_mul_mat: 1024 x 1024: F16 67.1 GFLOPS ( 32 runs) / F32 33.0 GFLOPS ( 16 runs) ggml_mul_mat: 2048 x 2048: F16 64.2 GFLOPS ( 4 runs) / F32 26.8 GFLOPS ( 3 runs) ggml_mul_mat: 4096 x 4096: F16 46.1 GFLOPS ( 3 runs) / F32 21.4 GFLOPS ( 3 runs)
CPU OS Config Model Th Load Enc. Commit Ampere A1 Ubuntu 22.04 NEON tiny 4 84 613 ca21f7a https://github.com/ggerganov/whisper.cpp/commit/ca21f7ab16694384fb74b1ba4f68b39f16540d23 Ampere A1 Ubuntu 22.04 NEON base 4 122 2086 ca21f7a https://github.com/ggerganov/whisper.cpp/commit/ca21f7ab16694384fb74b1ba4f68b39f16540d23 Ampere A1 Ubuntu 22.04 NEON small 4 286 6375 ca21f7a https://github.com/ggerganov/whisper.cpp/commit/ca21f7ab16694384fb74b1ba4f68b39f16540d23 Ampere A1 Ubuntu 22.04 NEON medium 4 761 24667 ca21f7a https://github.com/ggerganov/whisper.cpp/commit/ca21f7ab16694384fb74b1ba4f68b39f16540d23 Ampere A1 Ubuntu 22.04 NEON large 4 1457 43826 ca21f7a https://github.com/ggerganov/whisper.cpp/commit/ca21f7ab16694384fb74b1ba4f68b39f16540d23
— Reply to this email directly, view it on GitHub https://github.com/ggerganov/whisper.cpp/issues/89#issuecomment-1444918927, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALQR6226FF7KCLW6VNZK6TWZFSIZANCNFSM6AAAAAAROFTFJE . You are receiving this because you commented.Message ID: @.***>
-- Sincerely
Jay
>bench.exe
whisper_init_from_file: loading model from 'models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 512
whisper_model_load: n_text_head = 8
whisper_model_load: n_text_layer = 6
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 2
whisper_model_load: mem required = 215.00 MB (+ 6.00 MB per decoder)
whisper_model_load: kv self size = 5.25 MB
whisper_model_load: kv cross size = 17.58 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx = 140.60 MB
whisper_model_load: model size = 140.54 MB
system_info: n_threads = 4 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 |
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: load time = 109.45 ms
whisper_print_timings: mel time = 0.00 ms
whisper_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: encode time = 919.30 ms / 1 runs ( 919.30 ms per run)
whisper_print_timings: decode time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: total time = 1032.75 ms
>bench -w 1 -t 1
memcpy: 24.58 GB/s
sum: error -536870819.000000
>bench -w 2 -t 1
ggml_mul_mat: 64 x 64: F16 22.7 GFLOPS (128 runs) / F32 38.7 GFLOPS (128 runs)
ggml_mul_mat: 128 x 128: F16 34.6 GFLOPS (128 runs) / F32 45.6 GFLOPS (128 runs)
ggml_mul_mat: 256 x 256: F16 44.2 GFLOPS (128 runs) / F32 54.5 GFLOPS (128 runs)
ggml_mul_mat: 512 x 512: F16 50.5 GFLOPS (128 runs) / F32 55.3 GFLOPS (128 runs)
ggml_mul_mat: 1024 x 1024: F16 53.2 GFLOPS ( 25 runs) / F32 65.7 GFLOPS ( 31 runs)
ggml_mul_mat: 2048 x 2048: F16 54.9 GFLOPS ( 4 runs) / F32 61.8 GFLOPS ( 4 runs)
ggml_mul_mat: 4096 x 4096: F16 50.7 GFLOPS ( 3 runs) / F32 19.9 GFLOPS ( 3 runs)
That last one is less than the 5950X above, weird. Oh, OpenBLAS below:
>bench
whisper_init_from_file: loading model from 'models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 512
whisper_model_load: n_text_head = 8
whisper_model_load: n_text_layer = 6
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 2
whisper_model_load: mem required = 215.00 MB (+ 6.00 MB per decoder)
whisper_model_load: kv self size = 5.25 MB
whisper_model_load: kv cross size = 17.58 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx = 140.60 MB
whisper_model_load: model size = 140.54 MB
system_info: n_threads = 4 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: load time = 101.76 ms
whisper_print_timings: mel time = 0.00 ms
whisper_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: encode time = 602.63 ms / 1 runs ( 602.63 ms per run)
whisper_print_timings: decode time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: total time = 705.80 ms
>bench -w 1 -t 1
memcpy: 24.30 GB/s
sum: error -536870819.000000
>bench -w 2 -t 1
ggml_mul_mat: 64 x 64: F16 89.4 GFLOPS (128 runs) / F32 119.6 GFLOPS (128 runs)
ggml_mul_mat: 128 x 128: F16 27.6 GFLOPS (128 runs) / F32 31.0 GFLOPS (128 runs)
ggml_mul_mat: 256 x 256: F16 172.9 GFLOPS (128 runs) / F32 222.0 GFLOPS (128 runs)
ggml_mul_mat: 512 x 512: F16 596.8 GFLOPS (128 runs) / F32 926.4 GFLOPS (128 runs)
ggml_mul_mat: 1024 x 1024: F16 1257.0 GFLOPS (128 runs) / F32 1887.7 GFLOPS (128 runs)
ggml_mul_mat: 2048 x 2048: F16 1726.5 GFLOPS (101 runs) / F32 2193.9 GFLOPS (128 runs)
ggml_mul_mat: 4096 x 4096: F16 2109.8 GFLOPS ( 16 runs) / F32 2237.5 GFLOPS ( 17 runs)
memcpy: 7.20 GB/s sum: error -536870997.000000
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
AMD Ryzen 3 3200U | Linux Mint 21.1 | AVX2 | tiny | 4 | 109 | 3417 | 09e9068 |
AMD Ryzen 3 3200U | Linux Mint 21.1 | AVX2 | base | 4 | 180 | 7907 | 09e9068 |
AMD Ryzen 3 3200U | Linux Mint 21.1 | AVX2 | small | 4 | 419 | 30899 | 09e9068 |
AMD Ryzen 3 3200U | Linux Mint 21.1 | AVX2 | medium | 4 | 1851 | 106542 | 09e9068 |
AMD Ryzen 3 3200U | Linux Mint 21.1 | AVX2 | large | 4 | 4715 | 203455 | 09e9068 |
memcpy: 15.57 GB/s
Running ggml_mul_mat benchmark with 8 threads
ggml_mul_mat: 64 x 64: F16 6.1 GFLOPS (128 runs) / F32 6.2 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 40.1 GFLOPS (128 runs) / F32 38.7 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 147.9 GFLOPS (128 runs) / F32 110.1 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 264.9 GFLOPS (128 runs) / F32 134.4 GFLOPS (128 runs) ggml_mul_mat: 1024 x 1024: F16 289.5 GFLOPS (128 runs) / F32 151.9 GFLOPS ( 71 runs) ggml_mul_mat: 2048 x 2048: F16 290.6 GFLOPS ( 17 runs) / F32 70.7 GFLOPS ( 5 runs) ggml_mul_mat: 4096 x 4096: F16 114.0 GFLOPS ( 3 runs) / F32 62.7 GFLOPS ( 3 runs)
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
AMD Ryzen 7 5800HS | Linux RHEL8.7 | AVX2 | tiny | 8 | 50 | 361 | 09e9068 |
AMD Ryzen 7 5800HS | Linux RHEL8.7 | AVX2 | base | 8 | 70 | 1000 | 09e9068 |
AMD Ryzen 7 5800HS | Linux RHEL8.7 | AVX2 | small | 8 | 185 | 2264 | 09e9068 |
AMD Ryzen 7 5800HS | Linux RHEL8.7 | AVX2 | medium | 8 | 587 | 8421 | 09e9068 |
AMD Ryzen 7 5800HS | Linux RHEL8.7 | AVX2 | large | 8 | 2296 | 15759 | 09e9068 |
Running ggml_mul_mat benchmark with 16 threads
ggml_mul_mat: 64 x 64: F16 2.1 GFLOPS (128 runs) / F32 1.9 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 19.6 GFLOPS (128 runs) / F32 14.8 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 68.1 GFLOPS (128 runs) / F32 84.5 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 200.5 GFLOPS (128 runs) / F32 141.4 GFLOPS (128 runs) ggml_mul_mat: 1024 x 1024: F16 271.0 GFLOPS (127 runs) / F32 163.7 GFLOPS ( 77 runs) ggml_mul_mat: 2048 x 2048: F16 205.5 GFLOPS ( 12 runs) / F32 71.6 GFLOPS ( 5 runs) ggml_mul_mat: 4096 x 4096: F16 142.3 GFLOPS ( 3 runs) / F32 63.0 GFLOPS ( 3 runs)
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
AMD Ryzen 7 5800HS | Linux RHEL8.7 | AVX2 | tiny | 16 | 52 | 329 | 09e9068 |
AMD Ryzen 7 5800HS | Linux RHEL8.7 | AVX2 | base | 16 | 72 | 723 | 09e9068 |
AMD Ryzen 7 5800HS | Linux RHEL8.7 | AVX2 | small | 16 | 188 | 2214 | 09e9068 |
AMD Ryzen 7 5800HS | Linux RHEL8.7 | AVX2 | medium | 16 | 698 | 10889 | 09e9068 |
AMD Ryzen 7 5800HS | Linux RHEL8.7 | AVX2 | large | 16 | 1619 | 16640 | 09e9068 |
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
Apple M2 Pro | macOS 13.2 | NEON BLAS | tiny | 8 | 76 | 161 | 09e9068 |
Apple M2 Pro | macOS 13.2 | NEON BLAS | base | 8 | 104 | 318 | 09e9068 |
Apple M2 Pro | macOS 13.2 | NEON BLAS | small | 8 | 221 | 975 | 09e9068 |
Apple M2 Pro | macOS 13.2 | NEON BLAS | medium | 8 | 969 | 2692 | 09e9068 |
Apple M2 Pro | macOS 13.2 | NEON BLAS | large | 8 | 1939 | 4959 | 09e9068 |
NVIDIA Jetson Nano, without GPU optimization: base-en
./bin/main -f samples/jfk.wav whisper_init_from_file_no_state: loading model from 'models/ggml-base.en.bin' whisper_model_load: loading model whisper_model_load: n_vocab = 51864 whisper_model_load: n_audio_ctx = 1500 whisper_model_load: n_audio_state = 512 whisper_model_load: n_audio_head = 8 whisper_model_load: n_audio_layer = 6 whisper_model_load: n_text_ctx = 448 whisper_model_load: n_text_state = 512 whisper_model_load: n_text_head = 8 whisper_model_load: n_text_layer = 6 whisper_model_load: n_mels = 80 whisper_model_load: f16 = 1 whisper_model_load: type = 2 whisper_model_load: mem required = 215.00 MB (+ 6.00 MB per decoder) whisper_model_load: adding 1607 extra tokens whisper_model_load: model ctx = 140.60 MB whisper_model_load: model size = 140.54 MB whisper_init_state: kv self size = 5.25 MB whisper_init_state: kv cross size = 17.58 MB system_info: n_threads = 4 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ... [00:00:00.000 --> 00:00:11.000] And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country. whisper_print_timings: load time = 354.49 ms whisper_print_timings: fallbacks = 0 p / 0 h whisper_print_timings: mel time = 712.86 ms whisper_print_timings: sample time = 79.37 ms / 27 runs ( 2.94 ms per run) whisper_print_timings: encode time = 24406.28 ms / 1 runs (24406.28 ms per run) whisper_print_timings: decode time = 1284.84 ms / 27 runs ( 47.59 ms per run) whisper_print_timings: total time = 26908.31 ms
tiny-en
./bin/main -m ./models/ggml-tiny.en.bin -f ./samples/jfk.wav whisper_init_from_file_no_state: loading model from './models/ggml-tiny.en.bin' whisper_model_load: loading model whisper_model_load: n_vocab = 51864 whisper_model_load: n_audio_ctx = 1500 whisper_model_load: n_audio_state = 384 whisper_model_load: n_audio_head = 6 whisper_model_load: n_audio_layer = 4 whisper_model_load: n_text_ctx = 448 whisper_model_load: n_text_state = 384 whisper_model_load: n_text_head = 6 whisper_model_load: n_text_layer = 4 whisper_model_load: n_mels = 80 whisper_model_load: f16 = 1 whisper_model_load: type = 1 whisper_model_load: mem required = 127.00 MB (+ 3.00 MB per decoder) whisper_model_load: adding 1607 extra tokens whisper_model_load: model ctx = 73.58 MB whisper_model_load: model size = 73.54 MB whisper_init_state: kv self size = 2.62 MB whisper_init_state: kv cross size = 8.79 MB system_info: n_threads = 4 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ... [00:00:00.000 --> 00:00:07.740] And so my fellow Americans ask not what your country can do for you [00:00:07.740 --> 00:00:10.740] ask what you can do for your country whisper_print_timings: load time = 204.60 ms whisper_print_timings: fallbacks = 0 p / 0 h whisper_print_timings: mel time = 564.90 ms whisper_print_timings: sample time = 72.13 ms / 26 runs ( 2.77 ms per run) whisper_print_timings: encode time = 9232.34 ms / 1 runs ( 9232.34 ms per run) whisper_print_timings: decode time = 616.00 ms / 26 runs ( 23.69 ms per run) whisper_print_timings: total time = 10745.65 ms
MacBook Pro 14" with M2 Pro 10 Cores, 32GB RAM macOS Ventura 13.2 Benchmarks running at 8 threads memcpy: 40.68 GB/s
| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| ------------ | ------ | ---------- | -------- | -- | ---- | ---- | ------- |
| Apple M1 Pro | 13.2.1 | NEON BLAS | tiny | 8 | 45 | 93 | 09e9068 |
| Apple M1 Pro | 13.2.1 | NEON BLAS | base | 8 | 68 | 187 | 09e9068 |
| Apple M1 Pro | 13.2.1 | NEON BLAS | small | 8 | 179 | 702 | 09e9068 |
| Apple M1 Pro | 13.2.1 | NEON BLAS | medium | 8 | 496 | 2227 | 09e9068 |
| Apple M1 Pro | 13.2.1 | NEON BLAS | large | 8 | 1037 | 3796 | 09e9068 |
Running ggml_mul_mat benchmark with 8 threads
ggml_mul_mat: 64 x 64: F16 4.6 GFLOPS (128 runs) / F32 4.1 GFLOPS (128 runs)
ggml_mul_mat: 128 x 128: F16 46.6 GFLOPS (128 runs) / F32 36.4 GFLOPS (128 runs)
ggml_mul_mat: 256 x 256: F16 294.2 GFLOPS (128 runs) / F32 238.8 GFLOPS (128 runs)
ggml_mul_mat: 512 x 512: F16 611.0 GFLOPS (128 runs) / F32 712.5 GFLOPS (128 runs)
ggml_mul_mat: 1024 x 1024: F16 770.9 GFLOPS (128 runs) / F32 700.3 GFLOPS (128 runs)
ggml_mul_mat: 2048 x 2048: F16 902.8 GFLOPS ( 53 runs) / F32 906.9 GFLOPS ( 53 runs)
ggml_mul_mat: 4096 x 4096: F16 1521.2 GFLOPS ( 12 runs) / F32 1469.3 GFLOPS ( 11 runs)
MacBook Pro 16" with M2 Max 12 Cores, 96GB RAM macOS Ventura 13.3 Benchmarks running at 4 threads (4 threads were faster than 8 threads for ggml_mul_mat but about same for model load/encode) memcpy: 49.94 GB/s sum: ok -536870910.000000
Running ggml_mul_mat benchmark with 4 threads
ggml_mul_mat: 64 x 64: F16 11.2 GFLOPS (128 runs) / F32 9.3 GFLOPS (128 runs)
ggml_mul_mat: 128 x 128: F16 83.0 GFLOPS (128 runs) / F32 73.7 GFLOPS (128 runs)
ggml_mul_mat: 256 x 256: F16 505.2 GFLOPS (128 runs) / F32 488.2 GFLOPS (128 runs)
ggml_mul_mat: 512 x 512: F16 1018.0 GFLOPS (128 runs) / F32 1196.3 GFLOPS (128 runs)
ggml_mul_mat: 1024 x 1024: F16 1796.2 GFLOPS (128 runs) / F32 2087.4 GFLOPS (128 runs)
ggml_mul_mat: 2048 x 2048: F16 1638.8 GFLOPS ( 96 runs) / F32 1673.7 GFLOPS ( 98 runs)
ggml_mul_mat: 4096 x 4096: F16 1995.2 GFLOPS ( 15 runs) / F32 2037.8 GFLOPS ( 15 runs)
Running benchmark for all models This can take a while!
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
Apple M2 Max | 13.3 | NEON BLAS | tiny | 4 | 41 | 118 | 0a2d121 |
Apple M2 Max | 13.3 | NEON BLAS | base | 4 | 61 | 230 | 0a2d121 |
Apple M2 Max | 13.3 | NEON BLAS | small | 4 | 153 | 734 | 0a2d121 |
Apple M2 Max | 13.3 | NEON BLAS | medium | 4 | 448 | 1979 | 0a2d121 |
Apple M2 Max | 13.3 | NEON BLAS | large | 4 | 882 | 3553 | 0a2d121 |
Running memcpy benchmark with 1 thread
memcpy: 7.03 GB/s sum: error -536870997.000000 - how fix ??
Running ggml_mul_mat benchmark with 4 threads
ggml_mul_mat: 64 x 64: F16 8.9 GFLOPS (128 runs) / F32 10.0 GFLOPS (128 runs)
ggml_mul_mat: 128 x 128: F16 53.3 GFLOPS (128 runs) / F32 47.9 GFLOPS (128 runs)
ggml_mul_mat: 256 x 256: F16 91.7 GFLOPS (128 runs) / F32 99.4 GFLOPS (128 runs)
ggml_mul_mat: 512 x 512: F16 134.2 GFLOPS (128 runs) / F32 94.8 GFLOPS (128 runs)
ggml_mul_mat: 1024 x 1024: F16 182.9 GFLOPS ( 86 runs) / F32 121.2 GFLOPS ( 57 runs)
ggml_mul_mat: 2048 x 2048: F16 180.0 GFLOPS ( 11 runs) / F32 42.4 GFLOPS ( 3 runs)
ggml_mul_mat: 4096 x 4096: F16 59.1 GFLOPS ( 3 runs) / F32 31.5 GFLOPS ( 3 runs)
Running benchmark for all models This can take a while!
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
Ryzen 7 PRO 5850U | Ubuntu 22.04.2 | AVX2 | tiny | 4 | 69 | 495 | 0a2d121 |
Ryzen 7 PRO 5850U | Ubuntu 22.04.2 | AVX2 | base | 4 | 111 | 1128 | 0a2d121 |
Ryzen 7 PRO 5850U | Ubuntu 22.04.2 | AVX2 | small | 4 | 264 | 3992 | 0a2d121 |
Ryzen 7 PRO 5850U | Ubuntu 22.04.2 | AVX2 | medium | 4 | 806 | 12230 | 0a2d121 |
Ryzen 7 PRO 5850U | Ubuntu 22.04.2 | AVX2 | large | 4 | 1919 | 25574 | 0a2d121 |
memcpy: 9.49 GB/s sum: error -536870997.000000
Running ggml_mul_mat benchmark with 4 threads
ggml_mul_mat: 64 x 64: F16 8.8 GFLOPS (128 runs) / F32 10.0 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 35.4 GFLOPS (128 runs) / F32 49.2 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 61.9 GFLOPS (128 runs) / F32 95.1 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 64.3 GFLOPS (128 runs) / F32 86.5 GFLOPS (128 runs) ggml_mul_mat: 1024 x 1024: F16 74.4 GFLOPS ( 35 runs) / F32 39.9 GFLOPS ( 19 runs) ggml_mul_mat: 2048 x 2048: F16 56.9 GFLOPS ( 4 runs) / F32 31.1 GFLOPS ( 3 runs) ggml_mul_mat: 4096 x 4096: F16 56.9 GFLOPS ( 3 runs) / F32 30.1 GFLOPS ( 3 runs)
Running benchmark for all models This can take a while!
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
Ryzen 5 5500U | Ubuntu 22.04.2 | AVX2 | tiny | 4 | 67 | 761 | 0a2d121 |
Ryzen 5 5500U | Ubuntu 22.04.2 | AVX2 | base | 4 | 96 | 2040 | 0a2d121 |
Ryzen 5 5500U | Ubuntu 22.04.2 | AVX2 | small | 4 | 239 | 7639 | 0a2d121 |
Ryzen 5 5500U | Ubuntu 22.04.2 | AVX2 | medium | 4 | 657 | 23735 | 0a2d121 |
Ryzen 5 5500U | Ubuntu 22.04.2 | AVX2 | large | 4 | 1302 | 45006 | 0a2d121 |
HP Z440, Xeon E5-2690v4, 64Gb, Rocky Linux 9.1
memcpy: 10.94 GB/s sum: error -536870997.000000
./bench -w 2 ggml_mul_mat: 64 x 64: F16 4.8 GFLOPS (128 runs) / F32 4.8 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 23.1 GFLOPS (128 runs) / F32 18.7 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 52.5 GFLOPS (128 runs) / F32 35.1 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 69.6 GFLOPS (128 runs) / F32 44.4 GFLOPS (128 runs) ggml_mul_mat: 1024 x 1024: F16 78.8 GFLOPS ( 37 runs) / F32 49.2 GFLOPS ( 23 runs) ggml_mul_mat: 2048 x 2048: F16 83.6 GFLOPS ( 5 runs) / F32 50.8 GFLOPS ( 3 runs) ggml_mul_mat: 4096 x 4096: F16 64.5 GFLOPS ( 3 runs) / F32 21.8 GFLOPS ( 3 runs)
system_info: n_threads = 28 / 28 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
whisper_print_timings: load time = 1031.43 ms whisper_print_timings: fallbacks = 0 p / 0 h whisper_print_timings: mel time = 0.00 ms whisper_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run) whisper_print_timings: encode time = 13121.63 ms / 1 runs (13121.63 ms per run) whisper_print_timings: decode time = 0.00 ms / 1 runs ( 0.00 ms per run) whisper_print_timings: total time = 14219.33 ms
model: large
very impressed
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
MacBook M1 Max | macOS 13.0 beta (22A5321d) | NEON BLAS | medium | 8 | 488 | 2344 | 0a2d121 |
MacBook M1 Max | macOS 13.0 beta (22A5321d) | NEON BLAS | large | 8 | 1070 | 3209 | 0a2d121 |
What am I doing wrong? 17.6 GFlops on a Ryzen 6850H
WHISPER_OPENBLAS=1 make -j bench && ./bench -w 2 -t 1
I whisper.cpp build info:
I UNAME_S: Linux
I UNAME_P: x86_64
I UNAME_M: x86_64
I CFLAGS: -I. -O3 -DNDEBUG -std=c11 -fPIC -pthread -mavx -mavx2 -mfma -mf16c -msse3 -DGGML_USE_OPENBLAS -I/usr/local/include/openblas
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread
I LDFLAGS: -lopenblas
I CC: cc (Ubuntu 9.5.0-1ubuntu1~22.04) 9.5.0
I CXX: g++ (Ubuntu 9.5.0-1ubuntu1~22.04) 9.5.0
make: 'bench' is up to date.
ggml_mul_mat: 64 x 64: F16 12.6 GFLOPS (128 runs) / F32 9.8 GFLOPS (128 runs)
ggml_mul_mat: 128 x 128: F16 19.4 GFLOPS (128 runs) / F32 12.5 GFLOPS (128 runs)
ggml_mul_mat: 256 x 256: F16 27.0 GFLOPS (128 runs) / F32 18.4 GFLOPS (128 runs)
ggml_mul_mat: 512 x 512: F16 50.3 GFLOPS (128 runs) / F32 28.1 GFLOPS (105 runs)
ggml_mul_mat: 1024 x 1024: F16 59.0 GFLOPS ( 28 runs) / F32 27.0 GFLOPS ( 13 runs)
ggml_mul_mat: 2048 x 2048: F16 43.0 GFLOPS ( 3 runs) / F32 11.4 GFLOPS ( 3 runs)
ggml_mul_mat: 4096 x 4096: F16 17.6 GFLOPS ( 3 runs) / F32 6.6 GFLOPS ( 3 runs)
I tried running 8 and 12 threads. They were a few ms slower than 4 threads. So the default 4threads is the key it seems. I also have not compiled anything apple specific. Just git clone and make.
> ./extra/bench-all.sh 8
Usage: ./bench.sh [n_threads]
Running memcpy benchmark with 1 thread
memcpy: 50.22 GB/s sum: ok -536870910.000000
Running ggml_mul_mat benchmark with 8 threads
ggml_mul_mat: 64 x 64: F16 5.0 GFLOPS (128 runs) / F32 4.7 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 46.1 GFLOPS (128 runs) / F32 38.3 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 294.0 GFLOPS (128 runs) / F32 243.7 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 574.5 GFLOPS (128 runs) / F32 272.9 GFLOPS (128 runs) ggml_mul_mat: 1024 x 1024: F16 736.6 GFLOPS (128 runs) / F32 750.8 GFLOPS (128 runs) ggml_mul_mat: 2048 x 2048: F16 973.7 GFLOPS ( 57 runs) / F32 993.7 GFLOPS ( 58 runs) ggml_mul_mat: 4096 x 4096: F16 1554.5 GFLOPS ( 12 runs) / F32 1553.6 GFLOPS ( 12 runs)
Running benchmark for all models This can take a while!
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
NEON BLAS | tiny | 8 | 40 | 101 | c23588c | ||
NEON BLAS | base | 8 | 61 | 223 | c23588c | ||
NEON BLAS | small | 8 | 154 | 961 | c23588c | ||
NEON BLAS | medium | 8 | 436 | 2534 | c23588c | ||
NEON BLAS | large | 8 | 867 | 4100 | c23588c |
Same hardware as in the post before. I've just tried converting to CoreML models and here are the results. The personal impression of running STT seemed very good - much faster.
./extra/bench-all.sh 4
Usage: ./bench.sh [n_threads]
Running memcpy benchmark with 1 thread
memcpy: 49.33 GB/s sum: ok -536870910.000000
Running ggml_mul_mat benchmark with 4 threads
ggml_mul_mat: 64 x 64: F16 9.1 GFLOPS (128 runs) / F32 8.2 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 70.7 GFLOPS (128 runs) / F32 77.0 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 350.7 GFLOPS (128 runs) / F32 435.9 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 1060.0 GFLOPS (128 runs) / F32 1254.3 GFLOPS (128 runs) ggml_mul_mat: 1024 x 1024: F16 1611.0 GFLOPS (128 runs) / F32 1652.4 GFLOPS (128 runs) ggml_mul_mat: 2048 x 2048: F16 1887.2 GFLOPS (110 runs) / F32 1900.9 GFLOPS (111 runs) ggml_mul_mat: 4096 x 4096: F16 1806.0 GFLOPS ( 14 runs) / F32 1849.3 GFLOPS ( 14 runs)
Running benchmark for all models This can take a while!
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
NEON BLAS COREML | tiny | 4 | 42 | 30 | c23588c | ||
NEON BLAS COREML | base | 4 | 60 | 49 | c23588c | ||
NEON BLAS COREML | small | 4 | 151 | 169 | c23588c | ||
NEON BLAS COREML | medium | 4 | 430 | 737 | c23588c | ||
NEON BLAS COREML | large | 4 | 885 | 1672 | c23588c |
Dell 3050 Micro Running memcpy benchmark with 1 thread memcpy: 11.49 GB/s sum: error -536870997.000000
Running ggml_mul_mat benchmark with 4 threads ggml_mul_mat: 64 x 64: F16 7.7 GFLOPS (128 runs) / F32 3.3 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 27.7 GFLOPS (128 runs) / F32 7.5 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 50.8 GFLOPS (128 runs) / F32 8.8 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 59.4 GFLOPS (128 runs) / F32 9.0 GFLOPS ( 34 runs) ggml_mul_mat: 1024 x 1024: F16 51.5 GFLOPS ( 24 runs) / F32 8.4 GFLOPS ( 4 runs) ggml_mul_mat: 2048 x 2048: F16 46.3 GFLOPS ( 3 runs) / F32 8.1 GFLOPS ( 3 runs) ggml_mul_mat: 4096 x 4096: F16 47.3 GFLOPS ( 3 runs) / F32 8.1 GFLOPS ( 3 runs)
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
i3-7100t | Ubuntu 22.04 | AVX2 | tiny | 4 | 84 | 1125 | c23588c |
i3-7100t | Ubuntu 22.04 | AVX2 | base | 4 | 128 | 2616 | c23588c |
i3-7100t | Ubuntu 22.04 | AVX2 | small | 4 | 339 | 10127 | c23588c |
i3-7100t | Ubuntu 22.04 | AVX2 | medium | 4 | 991 | 39383 | c23588c |
i3-7100t | Ubuntu 22.04 | AVX2 | large | 4 | 2922 | 74488 | c23588c |
Lenovo thinkcentre m720q
Running memcpy benchmark with 1 thread
memcpy: 6.54 GB/s sum: error -536870997.000000
Running ggml_mul_mat benchmark with 4 threads
ggml_mul_mat: 64 x 64: F16 8.6 GFLOPS (128 runs) / F32 4.5 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 38.8 GFLOPS (128 runs) / F32 7.9 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 76.2 GFLOPS (128 runs) / F32 9.6 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 87.4 GFLOPS (128 runs) / F32 10.0 GFLOPS ( 38 runs) ggml_mul_mat: 1024 x 1024: F16 89.7 GFLOPS ( 42 runs) / F32 10.1 GFLOPS ( 5 runs) ggml_mul_mat: 2048 x 2048: F16 67.7 GFLOPS ( 4 runs) / F32 9.1 GFLOPS ( 3 runs) ggml_mul_mat: 4096 x 4096: F16 54.7 GFLOPS ( 3 runs) / F32 8.6 GFLOPS ( 3 runs)
Running benchmark for all models This can take a while!
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
i5-8500T | OpenVoiceOS | AVX2 | tiny.en | 4 | 79 | 686 | 70567ef |
i5-8500T | OpenVoiceOS | AVX2 | base.en | 4 | 121 | 1600 | 70567ef |
i5-8500T | OpenVoiceOS | AVX2 | small.en | 4 | 320 | 6197 | 70567ef |
i5-8500T | OpenVoiceOS | AVX2 | medium.en | 4 | 928 | 20276 | 70567ef |
Running memcpy benchmark with 1 thread
memcpy: 7.16 GB/s sum: error -536870997.000000
Running ggml_mul_mat benchmark with 6 threads
ggml_mul_mat: 64 x 64: F16 1.9 GFLOPS (128 runs) / F32 1.8 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 29.7 GFLOPS (128 runs) / F32 7.3 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 65.5 GFLOPS (128 runs) / F32 14.5 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 123.4 GFLOPS (128 runs) / F32 15.2 GFLOPS ( 57 runs) ggml_mul_mat: 1024 x 1024: F16 127.5 GFLOPS ( 60 runs) / F32 14.7 GFLOPS ( 7 runs) ggml_mul_mat: 2048 x 2048: F16 93.3 GFLOPS ( 6 runs) / F32 13.3 GFLOPS ( 3 runs) ggml_mul_mat: 4096 x 4096: F16 70.0 GFLOPS ( 3 runs) / F32 12.5 GFLOPS ( 3 runs)
Running benchmark for all models This can take a while!
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
i5-8500T | OpenVoiceOS | AVX2 | tiny.en | 6 | 78 | 511 | 70567ef |
i5-8500T | OpenVoiceOS | AVX2 | base.en | 6 | 118 | 1264 | 70567ef |
i5-8500T | OpenVoiceOS | AVX2 | small.en | 6 | 320 | 4587 | 70567ef |
i5-8500T | OpenVoiceOS | AVX2 | medium.en | 6 | 928 | 16303 | 70567ef |
Yet another M1 Ultra but look at the bottom, comparision to Const-Me GPU version: memcpy: 42.66 GB/s sum: ok -536870910.000000
Running ggml_mul_mat benchmark with 4 threads
ggml_mul_mat: 64 x 64: F16 9.1 GFLOPS (128 runs) / F32 7.1 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 68.2 GFLOPS (128 runs) / F32 68.5 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 465.0 GFLOPS (128 runs) / F32 386.2 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 1131.9 GFLOPS (128 runs) / F32 1437.0 GFLOPS (128 runs) ggml_mul_mat: 1024 x 1024: F16 2188.9 GFLOPS (128 runs) / F32 2519.6 GFLOPS (128 runs) ggml_mul_mat: 2048 x 2048: F16 2938.8 GFLOPS (128 runs) / F32 2996.5 GFLOPS (128 runs) ggml_mul_mat: 4096 x 4096: F16 3074.7 GFLOPS ( 23 runs) / F32 3167.2 GFLOPS ( 24 runs)
| CPU | OS | Config | Model | Th | Load | Enc. | Commit | | M1 Ultra | Ventura 13.3.1 | NEON BLAS | large | 4 | 858 | 3649 | 70567ef |
Much more interesting i find the comparison i did to a Win10 Core i9 9900K with Nvidia A4000 using the Const-Me Version. I used a 10 minute portion of a "real" tv show (-l de, about 56k tokens known in the model). Note that the power consumption has been measured too, it is not just guessing.
const-me whisper gpu (~450-550W real power consumption while 100% gpu utilisation, cpu is mostly bored)
A4000 1x parallel 93s
A4000 2x parallel both finish at 180s
A4000 4x parallel 3 finish after 317s, 1 finishes at 453s
MACOS, M1 Ultra (70-90W real power consumption while 100% "cpu" utilisation) whisper cpp - default settings, 1 core, 4 threads Macos 1x : 155 s Macos 2x parallel: 196 s - all finish at same time Macos 4x parallel: 274s - all finish at same time Macos 6x parallel: 462s - all finish at same time
Also some other tests with different commandline params, on the M1 only, with 1 file: -p8 (threads default 4) - system unresponsive while processing 120.3 seconds
-p 4 (default threads 4, ~80% cpu utilisation) 79.37545
-bs 2 -p 4 101.01730
-t 16 threads (processors default 1) 148.713
-p 8 -t 2 98.91152
We currently use the Const-me GPU version on Nvidia A5000 because on an intel cpu it delivers much faster results than this cpp version could do. Also it looks like Const-me version does not go anywhere while this repository is vibrant.
As a conclusion i can say that even if i hate it but we are buying this Mac because it delivers faster results, more throughput and all while consuming only 20% of power. Also, it has much better processing power distribution between mutliple parallel processes, i bet i can even use nice to give priorities while on the GPU there are no priorities whatsoever possible.
At our usage amount that means we saved the full costs of the mac (~4000 euros) after 2-3 years of operations (due to lower power costs and A/C) compared to running it on windows/gpu which we bought about the same initial price. Even if i could now safely say we dont need A5000 but just some gamer card for 600 euros, looking at the power costs these days i'd still prefer the mac. (Thanks god i dont need to put it into Active directory or such, so i have an easy time just using it as a slave processing machine)
It would be great if watts idle/peak could be posted as I have been posting benches for RK3588 devices that prob gives the minimum usable results and even then a tad slow. In that price range I just posted a I3-7100T that was picked up for £64 off ebay which is approx 8 watts idle / 30 peak. I used to be a bit of an Apple hater in terms of bling tech, but bang for buck the M1 Mini is surprisingly good value and in that race-till-idle likely could process quite a number of zones especially because of diversification of use.
I am on disability so even though cheap the £849.00 for the 16gb version could prob be the basis of the ultimate home-assistant in something similar to https://github.com/ggerganov/whisper.cpp/blob/master/examples/talk-llama/talk-llama.cpp So likely I will continue posting in the £64 range :)
But what Apple/Arm provide per watt currently is pretty special and for 24/365 in the energy expensive world that is pretty important. Dunno how many people could post idle & peak wattages also but it would be really interesting especially with CPU vs GPU than just out right speed.
Rock 5b
memcpy: 8.78 GB/s (1 thread)
sum: 136902081526.000000
Running ggml_mul_mat benchmark with 4 threads
64 x 64: Q4_0 7.2 GFLOPS (128 runs) | Q4_1 7.6 GFLOPS (128 runs) | Q4_2 6.9 GFLOPS (128 runs)
64 x 64: Q5_0 6.8 GFLOPS (128 runs) | Q5_1 7.0 GFLOPS (128 runs) | Q8_0 7.1 GFLOPS (128 runs)
64 x 64: F16 8.6 GFLOPS (128 runs) | F32 7.5 GFLOPS (128 runs)
128 x 128: Q4_0 22.8 GFLOPS (128 runs) | Q4_1 22.4 GFLOPS (128 runs) | Q4_2 19.6 GFLOPS (128 runs)
128 x 128: Q5_0 19.5 GFLOPS (128 runs) | Q5_1 20.7 GFLOPS (128 runs) | Q8_0 22.7 GFLOPS (128 runs)
128 x 128: F16 28.3 GFLOPS (128 runs) | F32 29.4 GFLOPS (128 runs)
256 x 256: Q4_0 40.6 GFLOPS (128 runs) | Q4_1 37.6 GFLOPS (128 runs) | Q4_2 30.5 GFLOPS (128 runs)
256 x 256: Q5_0 31.2 GFLOPS (128 runs) | Q5_1 31.9 GFLOPS (128 runs) | Q8_0 49.1 GFLOPS (128 runs)
256 x 256: F16 51.8 GFLOPS (128 runs) | F32 36.9 GFLOPS (128 runs)
512 x 512: Q4_0 52.0 GFLOPS (128 runs) | Q4_1 45.4 GFLOPS (128 runs) | Q4_2 35.7 GFLOPS (128 runs)
512 x 512: Q5_0 37.4 GFLOPS (128 runs) | Q5_1 36.9 GFLOPS (128 runs) | Q8_0 64.9 GFLOPS (128 runs)
512 x 512: F16 76.9 GFLOPS (128 runs) | F32 30.7 GFLOPS (115 runs)
1024 x 1024: Q4_0 56.6 GFLOPS ( 27 runs) | Q4_1 47.5 GFLOPS ( 23 runs) | Q4_2 37.5 GFLOPS ( 18 runs)
1024 x 1024: Q5_0 39.5 GFLOPS ( 19 runs) | Q5_1 37.7 GFLOPS ( 18 runs) | Q8_0 71.1 GFLOPS ( 34 runs)
1024 x 1024: F16 49.0 GFLOPS ( 23 runs) | F32 22.4 GFLOPS ( 11 runs)
2048 x 2048: Q4_0 54.2 GFLOPS ( 4 runs) | Q4_1 44.6 GFLOPS ( 3 runs) | Q4_2 38.5 GFLOPS ( 3 runs)
2048 x 2048: Q5_0 37.4 GFLOPS ( 3 runs) | Q5_1 35.5 GFLOPS ( 3 runs) | Q8_0 61.0 GFLOPS ( 4 runs)
2048 x 2048: F16 41.3 GFLOPS ( 3 runs) | F32 19.0 GFLOPS ( 3 runs)
4096 x 4096: Q4_0 56.2 GFLOPS ( 3 runs) | Q4_1 45.4 GFLOPS ( 3 runs) | Q4_2 38.7 GFLOPS ( 3 runs)
4096 x 4096: Q5_0 40.7 GFLOPS ( 3 runs) | Q5_1 37.3 GFLOPS ( 3 runs) | Q8_0 63.2 GFLOPS ( 3 runs)
4096 x 4096: F16 40.0 GFLOPS ( 3 runs) | F32 17.5 GFLOPS ( 3 runs)
Running benchmark for all models
This can take a while!
| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| rk3588 | Ubuntu 20.04.6 LTS | NEON | tiny | 4 | 102 | 1191 | be5911a |
| rk3588 | Ubuntu 20.04.6 LTS | NEON | base | 4 | 140 | 2861 | be5911a |
| rk3588 | Ubuntu 20.04.6 LTS | NEON | small | 4 | 393 | 10576 | be5911a |
| rk3588 | Ubuntu 20.04.6 LTS | NEON | medium | 4 | 10289 | 36042 | be5911a |
| rk3588 | Ubuntu 20.04.6 LTS | NEON | large | 4 | 2099 | 70740 | be5911a |
How do you get these numbers @StuartIanNaylor ? 😲 Isn't the Rock 5b basically the same as the Orange Pi 5?
Orange Pi 5 8GB:
Running memcpy benchmark
memcpy: 10.14 GB/s (1 thread)
sum: 136902081526.000000
Running ggml_mul_mat benchmark with 4 threads
64 x 64: Q4_0 4.7 GFLOPS (128 runs) | Q4_1 4.8 GFLOPS (128 runs) | Q4_2 4.6 GFLOPS (128 runs)
64 x 64: Q5_0 4.2 GFLOPS (128 runs) | Q5_1 4.4 GFLOPS (128 runs) | Q8_0 4.4 GFLOPS (128 runs)
64 x 64: F16 4.8 GFLOPS (128 runs) | F32 4.4 GFLOPS (128 runs)
128 x 128: Q4_0 4.2 GFLOPS (128 runs) | Q4_1 9.8 GFLOPS (128 runs) | Q4_2 10.0 GFLOPS (128 runs)
128 x 128: Q5_0 8.4 GFLOPS (128 runs) | Q5_1 8.2 GFLOPS (128 runs) | Q8_0 10.3 GFLOPS (128 runs)
128 x 128: F16 10.3 GFLOPS (128 runs) | F32 10.7 GFLOPS (128 runs)
256 x 256: Q4_0 34.7 GFLOPS (128 runs) | Q4_1 34.9 GFLOPS (128 runs) | Q4_2 33.9 GFLOPS (128 runs)
256 x 256: Q5_0 26.2 GFLOPS (128 runs) | Q5_1 24.9 GFLOPS (128 runs) | Q8_0 36.1 GFLOPS (128 runs)
256 x 256: F16 36.4 GFLOPS (128 runs) | F32 38.4 GFLOPS (128 runs)
512 x 512: Q4_0 22.2 GFLOPS ( 83 runs) | Q4_1 26.1 GFLOPS ( 98 runs) | Q4_2 35.5 GFLOPS (128 runs)
512 x 512: Q5_0 42.4 GFLOPS (128 runs) | Q5_1 26.8 GFLOPS (100 runs) | Q8_0 35.8 GFLOPS (128 runs)
512 x 512: F16 21.6 GFLOPS ( 81 runs) | F32 31.5 GFLOPS (118 runs)
1024 x 1024: Q4_0 32.4 GFLOPS ( 16 runs) | Q4_1 44.1 GFLOPS ( 21 runs) | Q4_2 39.7 GFLOPS ( 19 runs)
1024 x 1024: Q5_0 42.3 GFLOPS ( 20 runs) | Q5_1 40.4 GFLOPS ( 20 runs) | Q8_0 41.2 GFLOPS ( 20 runs)
1024 x 1024: F16 46.8 GFLOPS ( 22 runs) | F32 42.1 GFLOPS ( 20 runs)
2048 x 2048: Q4_0 50.9 GFLOPS ( 4 runs) | Q4_1 48.6 GFLOPS ( 3 runs) | Q4_2 48.0 GFLOPS ( 3 runs)
2048 x 2048: Q5_0 46.7 GFLOPS ( 3 runs) | Q5_1 47.8 GFLOPS ( 3 runs) | Q8_0 46.4 GFLOPS ( 3 runs)
2048 x 2048: F16 46.1 GFLOPS ( 3 runs) | F32 44.8 GFLOPS ( 3 runs)
4096 x 4096: Q4_0 42.2 GFLOPS ( 3 runs) | Q4_1 36.7 GFLOPS ( 3 runs) | Q4_2 33.0 GFLOPS ( 3 runs)
4096 x 4096: Q5_0 38.5 GFLOPS ( 3 runs) | Q5_1 44.7 GFLOPS ( 3 runs) | Q8_0 44.7 GFLOPS ( 3 runs)
4096 x 4096: F16 44.4 GFLOPS ( 3 runs) | F32 44.5 GFLOPS ( 3 runs)
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
RK3588S | Armbian 11 - 5.10.110 | NEON BLAS | tiny | 4 | 193 | 3748 | be5911a |
RK3588S | Armbian 11 - 5.10.110 | NEON BLAS | tiny-q5_0 | 4 | 156 | 3341 | be5911a |
RK3588S | Armbian 11 - 5.10.110 | NEON BLAS | base | 4 | 253 | 7359 | be5911a |
RK3588S | Armbian 11 - 5.10.110 | NEON BLAS | base-q5_0 | 4 | 178 | 7307 | be5911a |
[EDIT: a bit better without OpenBLAS although the GFLOPS are considerably lower O_o]
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
RK3588S | Armbian 11 - 5.10.110 | NEON | tiny | 4 | 111 | 3170 | be5911a |
RK3588S | Armbian 11 - 5.10.110 | NEON | tiny-q5_0 | 4 | 205 | 2817 | be5911a |
RK3588S | Armbian 11 - 5.10.110 | NEON | base | 4 | 248 | 6385 | be5911a |
RK3588S | Armbian 11 - 5.10.110 | NEON | base-q5_0 | 4 | 140 | 6198 | be5911a |
[EDIT2: getting very unstable results right now 🤔 ]
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
RK3588S | Armbian 11 - 5.10.110 | NEON | tiny | 4 | 269 | 1722 | be5911a |
RK3588S | Armbian 11 - 5.10.110 | NEON | tiny-q5_0 | 4 | 104 | 2746 | be5911a |
RK3588S | Armbian 11 - 5.10.110 | NEON | base | 4 | 243 | 7063 | be5911a |
RK3588S | Armbian 11 - 5.10.110 | NEON | base-q5_0 | 4 | 135 | 6516 | be5911a |
Likely I don't use Armbian but the supplied server image by Radxa and also the OPI version. Generally I stay clear of Armbian due to a pet hate of there epic init script that replaces standard installs and /etc and often blind sights me.
I add some tricks and tips I gathered when Radxa do a community board bring up. I have changed my pref for the scheduler and set it to performance and also and I dunno why but using taskset to make sure it just uses the big cores has a slight perf boost.
So running again I get
memcpy: 8.56 GB/s (1 thread)
sum: 136902081526.000000
Running ggml_mul_mat benchmark with 4 threads
64 x 64: Q4_0 7.3 GFLOPS (128 runs) | Q4_1 7.8 GFLOPS (128 runs) | Q4_2 6.9 GFLOPS (128 runs)
64 x 64: Q5_0 6.2 GFLOPS (128 runs) | Q5_1 6.7 GFLOPS (128 runs) | Q8_0 7.0 GFLOPS (128 runs)
64 x 64: F16 2.4 GFLOPS (128 runs) | F32 8.5 GFLOPS (128 runs)
128 x 128: Q4_0 23.2 GFLOPS (128 runs) | Q4_1 24.1 GFLOPS (128 runs) | Q4_2 19.9 GFLOPS (128 runs)
128 x 128: Q5_0 15.4 GFLOPS (128 runs) | Q5_1 21.0 GFLOPS (128 runs) | Q8_0 26.6 GFLOPS (128 runs)
128 x 128: F16 35.0 GFLOPS (128 runs) | F32 28.6 GFLOPS (128 runs)
256 x 256: Q4_0 41.2 GFLOPS (128 runs) | Q4_1 38.7 GFLOPS (128 runs) | Q4_2 30.5 GFLOPS (128 runs)
256 x 256: Q5_0 31.2 GFLOPS (128 runs) | Q5_1 31.9 GFLOPS (128 runs) | Q8_0 49.1 GFLOPS (128 runs)
256 x 256: F16 65.0 GFLOPS (128 runs) | F32 43.5 GFLOPS (128 runs)
512 x 512: Q4_0 52.0 GFLOPS (128 runs) | Q4_1 45.4 GFLOPS (128 runs) | Q4_2 35.3 GFLOPS (128 runs)
512 x 512: Q5_0 37.4 GFLOPS (128 runs) | Q5_1 36.8 GFLOPS (128 runs) | Q8_0 64.9 GFLOPS (128 runs)
512 x 512: F16 78.1 GFLOPS (128 runs) | F32 30.6 GFLOPS (114 runs)
1024 x 1024: Q4_0 56.4 GFLOPS ( 27 runs) | Q4_1 47.4 GFLOPS ( 23 runs) | Q4_2 37.5 GFLOPS ( 18 runs)
1024 x 1024: Q5_0 39.5 GFLOPS ( 19 runs) | Q5_1 37.7 GFLOPS ( 18 runs) | Q8_0 70.8 GFLOPS ( 33 runs)
1024 x 1024: F16 47.2 GFLOPS ( 22 runs) | F32 21.8 GFLOPS ( 11 runs)
2048 x 2048: Q4_0 54.4 GFLOPS ( 4 runs) | Q4_1 45.3 GFLOPS ( 3 runs) | Q4_2 38.6 GFLOPS ( 3 runs)
2048 x 2048: Q5_0 37.4 GFLOPS ( 3 runs) | Q5_1 35.6 GFLOPS ( 3 runs) | Q8_0 59.8 GFLOPS ( 4 runs)
2048 x 2048: F16 41.2 GFLOPS ( 3 runs) | F32 20.6 GFLOPS ( 3 runs)
4096 x 4096: Q4_0 56.9 GFLOPS ( 3 runs) | Q4_1 46.6 GFLOPS ( 3 runs) | Q4_2 38.9 GFLOPS ( 3 runs)
4096 x 4096: Q5_0 41.1 GFLOPS ( 3 runs) | Q5_1 37.4 GFLOPS ( 3 runs) | Q8_0 62.9 GFLOPS ( 3 runs)
4096 x 4096: F16 39.8 GFLOPS ( 3 runs) | F32 17.6 GFLOPS ( 3 runs)
Running benchmark for all models
This can take a while!
| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| <todo> | <todo> | NEON | tiny | 4 | 96 | 1199 | be5911a |
| <todo> | <todo> | NEON | base | 4 | 137 | 2875 | be5911a |
| <todo> | <todo> | NEON | small | 4 | 343 | 10635 | be5911a |
| <todo> | <todo> | NEON | medium | 4 | 1013 | 35174 | be5911a |
| <todo> | <todo> | NEON | large | 4 | 2019 | 71678 | be5911a |
If I run without previously echo performance | tee /sys/bus/cpu/devices/cpu[046]/cpufreq/scaling_governor /sys/class/devfreq/dmc/governor
as the rk3588[x] is a tri-cluster 4-2-2 and dunno about the dmc but it was something we where using at that time.
Prefix (taskset -c 4-7) to further enforce not using the efficiency cores.
The ondemand governor seems to load balance whilst at least Whisper.cpp a race-till-idle more like how Android is set up does seem to give a perf boost with little loss in efficiency, if none.
Without bench gives
memcpy: 7.82 GB/s (1 thread)
sum: 136902081526.000000
Running ggml_mul_mat benchmark with 4 threads
64 x 64: Q4_0 3.1 GFLOPS (128 runs) | Q4_1 2.8 GFLOPS (128 runs) | Q 4_2 2.4 GFLOPS (128 runs)
64 x 64: Q5_0 2.3 GFLOPS (128 runs) | Q5_1 2.2 GFLOPS (128 runs) | Q 8_0 2.7 GFLOPS (128 runs)
64 x 64: F16 3.1 GFLOPS (128 runs) | F32 2.6 GFLOPS (128 runs)
128 x 128: Q4_0 7.1 GFLOPS (128 runs) | Q4_1 7.0 GFLOPS (128 runs) | Q 4_2 6.2 GFLOPS (128 runs)
128 x 128: Q5_0 5.4 GFLOPS (128 runs) | Q5_1 5.4 GFLOPS (128 runs) | Q 8_0 7.2 GFLOPS (128 runs)
128 x 128: F16 9.3 GFLOPS (128 runs) | F32 5.9 GFLOPS (128 runs)
256 x 256: Q4_0 10.1 GFLOPS (128 runs) | Q4_1 9.5 GFLOPS (128 runs) | Q 4_2 8.4 GFLOPS (128 runs)
256 x 256: Q5_0 7.4 GFLOPS (128 runs) | Q5_1 6.9 GFLOPS (128 runs) | Q 8_0 10.9 GFLOPS (128 runs)
256 x 256: F16 13.4 GFLOPS (128 runs) | F32 7.9 GFLOPS (128 runs)
512 x 512: Q4_0 10.9 GFLOPS ( 41 runs) | Q4_1 10.4 GFLOPS ( 39 runs) | Q 4_2 8.5 GFLOPS ( 32 runs)
512 x 512: Q5_0 8.9 GFLOPS ( 34 runs) | Q5_1 8.2 GFLOPS ( 31 runs) | Q 8_0 12.1 GFLOPS ( 46 runs)
512 x 512: F16 14.5 GFLOPS ( 54 runs) | F32 8.7 GFLOPS ( 33 runs)
1024 x 1024: Q4_0 26.9 GFLOPS ( 13 runs) | Q4_1 24.9 GFLOPS ( 12 runs) | Q 4_2 21.7 GFLOPS ( 11 runs)
1024 x 1024: Q5_0 23.0 GFLOPS ( 11 runs) | Q5_1 22.0 GFLOPS ( 11 runs) | Q 8_0 29.1 GFLOPS ( 14 runs)
1024 x 1024: F16 28.2 GFLOPS ( 14 runs) | F32 17.9 GFLOPS ( 9 runs)
2048 x 2048: Q4_0 50.1 GFLOPS ( 3 runs) | Q4_1 41.3 GFLOPS ( 3 runs) | Q 4_2 36.7 GFLOPS ( 3 runs)
2048 x 2048: Q5_0 36.0 GFLOPS ( 3 runs) | Q5_1 33.2 GFLOPS ( 3 runs) | Q 8_0 53.7 GFLOPS ( 4 runs)
2048 x 2048: F16 37.5 GFLOPS ( 3 runs) | F32 19.3 GFLOPS ( 3 runs)
4096 x 4096: Q4_0 55.7 GFLOPS ( 3 runs) | Q4_1 43.7 GFLOPS ( 3 runs) | Q 4_2 39.4 GFLOPS ( 3 runs)
4096 x 4096: Q5_0 40.5 GFLOPS ( 3 runs) | Q5_1 36.1 GFLOPS ( 3 runs) | Q 8_0 65.8 GFLOPS ( 3 runs)
4096 x 4096: F16 36.8 GFLOPS ( 3 runs) | F32 18.5 GFLOPS ( 3 runs)
Running benchmark for all models
This can take a while!
| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| <todo> | <todo> | NEON | tiny | 4 | 171 | 1817 | be5911a |
| <todo> | <todo> | NEON | base | 4 | 255 | 3529 | be5911a |
| <todo> | <todo> | NEON | small | 4 | 433 | 11208 | be5911a |
| <todo> | <todo> | NEON | medium | 4 | 1814 | 36829 | be5911a |
| <todo> | <todo> | NEON | large | 4 | 36647 | 71393 | be5911a |
I will tack on the OPI5 next as think it is a smidge faster. So without again
memcpy: 8.26 GB/s (1 thread)
sum: 136902081526.000000
Running ggml_mul_mat benchmark with 4 threads
64 x 64: Q4_0 3.1 GFLOPS (128 runs) | Q4_1 3.3 GFLOPS (128 runs) | Q4_2 3.4 GFLOPS (128 runs)
64 x 64: Q5_0 1.7 GFLOPS (128 runs) | Q5_1 3.1 GFLOPS (128 runs) | Q8_0 2.9 GFLOPS (128 runs)
64 x 64: F16 4.0 GFLOPS (128 runs) | F32 3.5 GFLOPS (128 runs)
128 x 128: Q4_0 7.8 GFLOPS (128 runs) | Q4_1 6.6 GFLOPS (128 runs) | Q4_2 6.7 GFLOPS (128 runs)
128 x 128: Q5_0 5.6 GFLOPS (128 runs) | Q5_1 5.4 GFLOPS (128 runs) | Q8_0 8.7 GFLOPS (128 runs)
128 x 128: F16 10.1 GFLOPS (128 runs) | F32 6.3 GFLOPS (128 runs)
256 x 256: Q4_0 10.5 GFLOPS (128 runs) | Q4_1 9.1 GFLOPS (128 runs) | Q4_2 7.9 GFLOPS (128 runs)
256 x 256: Q5_0 7.0 GFLOPS (128 runs) | Q5_1 6.7 GFLOPS (128 runs) | Q8_0 12.6 GFLOPS (128 runs)
256 x 256: F16 12.6 GFLOPS (128 runs) | F32 7.5 GFLOPS (128 runs)
512 x 512: Q4_0 11.9 GFLOPS ( 45 runs) | Q4_1 10.8 GFLOPS ( 41 runs) | Q4_2 10.0 GFLOPS ( 38 runs)
512 x 512: Q5_0 8.5 GFLOPS ( 32 runs) | Q5_1 7.9 GFLOPS ( 30 runs) | Q8_0 14.5 GFLOPS ( 54 runs)
512 x 512: F16 14.2 GFLOPS ( 53 runs) | F32 8.3 GFLOPS ( 32 runs)
1024 x 1024: Q4_0 30.4 GFLOPS ( 15 runs) | Q4_1 28.9 GFLOPS ( 14 runs) | Q4_2 23.6 GFLOPS ( 11 runs)
1024 x 1024: Q5_0 23.0 GFLOPS ( 11 runs) | Q5_1 23.5 GFLOPS ( 12 runs) | Q8_0 37.4 GFLOPS ( 18 runs)
1024 x 1024: F16 33.9 GFLOPS ( 16 runs) | F32 18.0 GFLOPS ( 9 runs)
2048 x 2048: Q4_0 51.4 GFLOPS ( 4 runs) | Q4_1 42.5 GFLOPS ( 3 runs) | Q4_2 36.5 GFLOPS ( 3 runs)
2048 x 2048: Q5_0 36.0 GFLOPS ( 3 runs) | Q5_1 32.7 GFLOPS ( 3 runs) | Q8_0 59.0 GFLOPS ( 4 runs)
2048 x 2048: F16 39.4 GFLOPS ( 3 runs) | F32 17.5 GFLOPS ( 3 runs)
4096 x 4096: Q4_0 58.8 GFLOPS ( 3 runs) | Q4_1 47.0 GFLOPS ( 3 runs) | Q4_2 39.7 GFLOPS ( 3 runs)
4096 x 4096: Q5_0 40.8 GFLOPS ( 3 runs) | Q5_1 37.3 GFLOPS ( 3 runs) | Q8_0 65.1 GFLOPS ( 3 runs)
4096 x 4096: F16 40.6 GFLOPS ( 3 runs) | F32 18.6 GFLOPS ( 3 runs)
Running benchmark for all models
This can take a while!
| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| <todo> | <todo> | NEON | tiny | 4 | 133 | 1235 | be5911a |
| <todo> | <todo> | NEON | base | 4 | 232 | 2941 | be5911a |
| <todo> | <todo> | NEON | small | 4 | 470 | 10870 | be5911a |
| <todo> | <todo> | NEON | medium | 4 | 23195 | 36162 | be5911a |
| <todo> | <todo> | NEON | large | 4 | 46511 | 90187 | be5911a |
Then as sudo orangepi-config
set the perf governor (no dmc)
taskset -c 4-7 ,/extra/bench-all.sh
memcpy: 8.22 GB/s (1 thread)
sum: 136902081526.000000
Running ggml_mul_mat benchmark with 4 threads
64 x 64: Q4_0 0.7 GFLOPS (128 runs) | Q4_1 1.6 GFLOPS (128 runs) | Q 4_2 1.0 GFLOPS (128 runs)
64 x 64: Q5_0 0.6 GFLOPS (128 runs) | Q5_1 0.8 GFLOPS (128 runs) | Q 8_0 1.4 GFLOPS (128 runs)
64 x 64: F16 1.9 GFLOPS (128 runs) | F32 0.8 GFLOPS (128 runs)
128 x 128: Q4_0 8.9 GFLOPS (128 runs) | Q4_1 3.8 GFLOPS (128 runs) | Q 4_2 3.1 GFLOPS (128 runs)
128 x 128: Q5_0 5.8 GFLOPS (128 runs) | Q5_1 3.8 GFLOPS (128 runs) | Q 8_0 7.8 GFLOPS (128 runs)
128 x 128: F16 5.2 GFLOPS (128 runs) | F32 3.6 GFLOPS (128 runs)
256 x 256: Q4_0 13.1 GFLOPS (128 runs) | Q4_1 12.1 GFLOPS (128 runs) | Q 4_2 12.1 GFLOPS (128 runs)
256 x 256: Q5_0 12.8 GFLOPS (128 runs) | Q5_1 13.4 GFLOPS (128 runs) | Q 8_0 17.9 GFLOPS (128 runs)
256 x 256: F16 17.6 GFLOPS (128 runs) | F32 11.0 GFLOPS (128 runs)
512 x 512: Q4_0 33.3 GFLOPS (125 runs) | Q4_1 34.7 GFLOPS (128 runs) | Q 4_2 21.9 GFLOPS ( 82 runs)
512 x 512: Q5_0 21.4 GFLOPS ( 80 runs) | Q5_1 22.4 GFLOPS ( 84 runs) | Q 8_0 35.2 GFLOPS (128 runs)
512 x 512: F16 37.1 GFLOPS (128 runs) | F32 23.2 GFLOPS ( 87 runs)
1024 x 1024: Q4_0 54.9 GFLOPS ( 26 runs) | Q4_1 44.3 GFLOPS ( 21 runs) | Q 4_2 31.4 GFLOPS ( 15 runs)
1024 x 1024: Q5_0 35.7 GFLOPS ( 17 runs) | Q5_1 32.1 GFLOPS ( 15 runs) | Q 8_0 66.5 GFLOPS ( 31 runs)
1024 x 1024: F16 45.0 GFLOPS ( 21 runs) | F32 19.6 GFLOPS ( 10 runs)
2048 x 2048: Q4_0 54.6 GFLOPS ( 4 runs) | Q4_1 45.2 GFLOPS ( 3 runs) | Q 4_2 38.4 GFLOPS ( 3 runs)
2048 x 2048: Q5_0 37.9 GFLOPS ( 3 runs) | Q5_1 34.7 GFLOPS ( 3 runs) | Q 8_0 59.9 GFLOPS ( 4 runs)
2048 x 2048: F16 40.5 GFLOPS ( 3 runs) | F32 20.0 GFLOPS ( 3 runs)
4096 x 4096: Q4_0 59.5 GFLOPS ( 3 runs) | Q4_1 47.7 GFLOPS ( 3 runs) | Q 4_2 40.1 GFLOPS ( 3 runs)
4096 x 4096: Q5_0 42.7 GFLOPS ( 3 runs) | Q5_1 39.6 GFLOPS ( 3 runs) | Q 8_0 60.7 GFLOPS ( 3 runs)
4096 x 4096: F16 35.5 GFLOPS ( 3 runs) | F32 20.8 GFLOPS ( 3 runs)
Running benchmark for all models
This can take a while!
| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| <todo> | <todo> | NEON | tiny | 4 | 119 | 1178 | be5911a |
| <todo> | <todo> | NEON | base | 4 | 168 | 2910 | be5911a |
| <todo> | <todo> | NEON | small | 4 | 399 | 10784 | be5911a |
| <todo> | <todo> | NEON | medium | 4 | 23469 | 35952 | be5911a |
| <todo> | <todo> | NEON | large | 4 | 47147 | 76405 | be5911a |
I ran that again as think transformers do bounce around abit to end up with the same tokens.
memcpy: 9.46 GB/s (1 thread)
sum: 136902081526.000000
Running ggml_mul_mat benchmark with 4 threads
64 x 64: Q4_0 7.1 GFLOPS (128 runs) | Q4_1 7.6 GFLOPS (128 runs) | Q4_2 6.6 GFLOPS (128 runs)
64 x 64: Q5_0 6.3 GFLOPS (128 runs) | Q5_1 6.9 GFLOPS (128 runs) | Q8_0 6.6 GFLOPS (128 runs)
64 x 64: F16 7.8 GFLOPS (128 runs) | F32 7.3 GFLOPS (128 runs)
128 x 128: Q4_0 23.8 GFLOPS (128 runs) | Q4_1 25.0 GFLOPS (128 runs) | Q4_2 8.5 GFLOPS (128 runs)
128 x 128: Q5_0 19.1 GFLOPS (128 runs) | Q5_1 20.8 GFLOPS (128 runs) | Q8_0 26.4 GFLOPS (128 runs)
128 x 128: F16 34.8 GFLOPS (128 runs) | F32 28.6 GFLOPS (128 runs)
256 x 256: Q4_0 43.4 GFLOPS (128 runs) | Q4_1 42.0 GFLOPS (128 runs) | Q4_2 31.3 GFLOPS (128 runs)
256 x 256: Q5_0 30.5 GFLOPS (128 runs) | Q5_1 32.0 GFLOPS (128 runs) | Q8_0 41.7 GFLOPS (128 runs)
256 x 256: F16 60.0 GFLOPS (128 runs) | F32 42.9 GFLOPS (128 runs)
512 x 512: Q4_0 56.5 GFLOPS (128 runs) | Q4_1 49.5 GFLOPS (128 runs) | Q4_2 36.6 GFLOPS (128 runs)
512 x 512: Q5_0 36.7 GFLOPS (128 runs) | Q5_1 36.8 GFLOPS (128 runs) | Q8_0 69.9 GFLOPS (128 runs)
512 x 512: F16 78.5 GFLOPS (128 runs) | F32 30.1 GFLOPS (113 runs)
1024 x 1024: Q4_0 62.7 GFLOPS ( 30 runs) | Q4_1 52.2 GFLOPS ( 25 runs) | Q4_2 38.9 GFLOPS ( 19 runs)
1024 x 1024: Q5_0 39.2 GFLOPS ( 19 runs) | Q5_1 38.2 GFLOPS ( 18 runs) | Q8_0 76.2 GFLOPS ( 36 runs)
1024 x 1024: F16 46.7 GFLOPS ( 22 runs) | F32 21.6 GFLOPS ( 11 runs)
2048 x 2048: Q4_0 60.4 GFLOPS ( 4 runs) | Q4_1 50.3 GFLOPS ( 3 runs) | Q4_2 39.6 GFLOPS ( 3 runs)
2048 x 2048: Q5_0 37.9 GFLOPS ( 3 runs) | Q5_1 35.4 GFLOPS ( 3 runs) | Q8_0 66.5 GFLOPS ( 4 runs)
2048 x 2048: F16 33.8 GFLOPS ( 3 runs) | F32 15.0 GFLOPS ( 3 runs)
4096 x 4096: Q4_0 64.2 GFLOPS ( 3 runs) | Q4_1 51.2 GFLOPS ( 3 runs) | Q4_2 40.2 GFLOPS ( 3 runs)
4096 x 4096: Q5_0 40.7 GFLOPS ( 3 runs) | Q5_1 37.2 GFLOPS ( 3 runs) | Q8_0 71.5 GFLOPS ( 3 runs)
4096 x 4096: F16 38.5 GFLOPS ( 3 runs) | F32 20.3 GFLOPS ( 3 runs)
Running benchmark for all models
This can take a while!
| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| <todo> | <todo> | NEON | tiny | 4 | 103 | 1166 | be5911a |
| <todo> | <todo> | NEON | base | 4 | 152 | 2888 | be5911a |
| <todo> | <todo> | NEON | small | 4 | 379 | 10892 | be5911a |
| <todo> | <todo> | NEON | medium | 4 | 22649 | 35767 | be5911a |
| <todo> | <todo> | NEON | large | 4 | 45427 | 73967 | be5911a |
But don't seem to get that much variance, race-till-idle is just preference.
Prefix (taskset -c 4-7) to further enforce not using the efficiency cores.
Tried that played with the CPU settings (performance mode etc.), even added some better cooling but it still keeps jumping all over the place with the tiny model at ~2s (in the good runs) while 'htop' shows consistent 100% load on the performance cores. Q5 models are sometimes a few ms faster sometimes slower. When I do the same tests with the CTranslate2 Whisper version results are pretty stable and always about twice as fast.
Dunno just to show the next run is very consistant and considerabilly faster... ?
memcpy: 10.52 GB/s (1 thread)
sum: 136902081526.000000
Running ggml_mul_mat benchmark with 4 threads
64 x 64: Q4_0 2.5 GFLOPS (128 runs) | Q4_1 2.5 GFLOPS (128 runs) | Q4_2 1.3 GFLOPS (128 runs)
64 x 64: Q5_0 1.0 GFLOPS (128 runs) | Q5_1 0.6 GFLOPS (128 runs) | Q8_0 0.8 GFLOPS (128 runs)
64 x 64: F16 1.0 GFLOPS (128 runs) | F32 1.8 GFLOPS (128 runs)
128 x 128: Q4_0 2.8 GFLOPS (128 runs) | Q4_1 2.2 GFLOPS (128 runs) | Q4_2 6.7 GFLOPS (128 runs)
128 x 128: Q5_0 3.2 GFLOPS (128 runs) | Q5_1 5.5 GFLOPS (128 runs) | Q8_0 3.0 GFLOPS (128 runs)
128 x 128: F16 11.2 GFLOPS (128 runs) | F32 8.5 GFLOPS (128 runs)
256 x 256: Q4_0 13.5 GFLOPS (128 runs) | Q4_1 8.8 GFLOPS (128 runs) | Q4_2 9.9 GFLOPS (128 runs)
256 x 256: Q5_0 10.7 GFLOPS (128 runs) | Q5_1 6.7 GFLOPS (128 runs) | Q8_0 7.3 GFLOPS (128 runs)
256 x 256: F16 18.3 GFLOPS (128 runs) | F32 10.1 GFLOPS (128 runs)
512 x 512: Q4_0 36.4 GFLOPS (128 runs) | Q4_1 31.2 GFLOPS (117 runs) | Q4_2 19.0 GFLOPS ( 71 runs)
512 x 512: Q5_0 18.5 GFLOPS ( 69 runs) | Q5_1 20.4 GFLOPS ( 77 runs) | Q8_0 30.7 GFLOPS (115 runs)
512 x 512: F16 33.8 GFLOPS (126 runs) | F32 20.7 GFLOPS ( 79 runs)
1024 x 1024: Q4_0 40.0 GFLOPS ( 19 runs) | Q4_1 36.4 GFLOPS ( 18 runs) | Q4_2 29.6 GFLOPS ( 14 runs)
1024 x 1024: Q5_0 32.9 GFLOPS ( 16 runs) | Q5_1 30.6 GFLOPS ( 15 runs) | Q8_0 54.2 GFLOPS ( 26 runs)
1024 x 1024: F16 44.1 GFLOPS ( 21 runs) | F32 20.0 GFLOPS ( 10 runs)
2048 x 2048: Q4_0 57.7 GFLOPS ( 4 runs) | Q4_1 47.7 GFLOPS ( 3 runs) | Q4_2 38.7 GFLOPS ( 3 runs)
2048 x 2048: Q5_0 37.8 GFLOPS ( 3 runs) | Q5_1 35.1 GFLOPS ( 3 runs) | Q8_0 63.6 GFLOPS ( 4 runs)
2048 x 2048: F16 33.6 GFLOPS ( 3 runs) | F32 14.8 GFLOPS ( 3 runs)
4096 x 4096: Q4_0 61.9 GFLOPS ( 3 runs) | Q4_1 50.2 GFLOPS ( 3 runs) | Q4_2 38.8 GFLOPS ( 3 runs)
4096 x 4096: Q5_0 40.6 GFLOPS ( 3 runs) | Q5_1 37.9 GFLOPS ( 3 runs) | Q8_0 70.4 GFLOPS ( 3 runs)
4096 x 4096: F16 38.0 GFLOPS ( 3 runs) | F32 20.8 GFLOPS ( 3 runs)
Running benchmark for all models
This can take a while!
| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| <todo> | <todo> | NEON | tiny | 4 | 134 | 1176 | be5911a |
| <todo> | <todo> | NEON | base | 4 | 179 | 2964 | be5911a |
| <todo> | <todo> | NEON | small | 4 | 416 | 11037 | be5911a |
| <todo> | <todo> | NEON | medium | 4 | 23462 | 36469 | be5911a |
| <todo> | <todo> | NEON | large | 4 | 47286 | 77494 | be5911a |
System76 Pangolin (pang12) w/ Ryzen 7 6800U (8c16t) @ 2.7GHz + 32GB DDR5 at 6400MT/s Models stored on a Samsung 970 Evo Plus
Running memcpy benchmark with 1 thread
memcpy: 11.18 GB/s
sum: error -536870997.000000
Running ggml_mul_mat benchmark with 16 threads
ggml_mul_mat: 64 x 64: Q4_0 0.9 GFLOPS (128 runs) / Q4_1 0.4 GFLOPS (128 runs) / F16 1.2 GFLOPS (128 runs) / F32 1.2 GFLOPS (128 runs)
ggml_mul_mat: 128 x 128: Q4_0 6.1 GFLOPS (128 runs) / Q4_1 7.5 GFLOPS (128 runs) / F16 4.6 GFLOPS (128 runs) / F32 10.0 GFLOPS (128 runs)
ggml_mul_mat: 256 x 256: Q4_0 26.2 GFLOPS (128 runs) / Q4_1 42.3 GFLOPS (128 runs) / F16 19.9 GFLOPS (128 runs) / F32 47.9 GFLOPS (128 runs)
ggml_mul_mat: 512 x 512: Q4_0 66.6 GFLOPS (128 runs) / Q4_1 98.6 GFLOPS (128 runs) / F16 90.1 GFLOPS (128 runs) / F32 110.4 GFLOPS (128 runs)
ggml_mul_mat: 1024 x 1024: Q4_0 97.8 GFLOPS ( 46 runs) / Q4_1 154.3 GFLOPS ( 72 runs) / F16 158.7 GFLOPS ( 74 runs) / F32 132.2 GFLOPS ( 62 runs)
ggml_mul_mat: 2048 x 2048: Q4_0 126.7 GFLOPS ( 8 runs) / Q4_1 164.8 GFLOPS ( 10 runs) / F16 164.1 GFLOPS ( 10 runs) / F32 96.4 GFLOPS ( 6 runs)
ggml_mul_mat: 4096 x 4096: Q4_0 138.6 GFLOPS ( 3 runs) / Q4_1 166.9 GFLOPS ( 3 runs) / F16 136.0 GFLOPS ( 3 runs) / F32 62.8 GFLOPS ( 3 runs)
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
Ryzen 7 6800U | Arch Linux | AVX2 | tiny | 16 | 37 | 510 | 9c61f5f |
Ryzen 7 6800U | Arch Linux | AVX2 | base | 16 | 51 | 1222 | 9c61f5f |
Ryzen 7 6800U | Arch Linux | AVX2 | small | 16 | 123 | 4283 | 9c61f5f |
Ryzen 7 6800U | Arch Linux | AVX2 | medium | 16 | 341 | 14178 | 9c61f5f |
Ryzen 7 6800U | Arch Linux | AVX2 | large | 16 | 650 | 25801 | 9c61f5f |
It is interesting to note that when converted to a CoreML model and executed, even a Macbook Air M2 has a processing speed close to that of a high-spec Mac, perhaps because the specifications of the Neural engine are the same for the same generation of Apple Silicon.
./extra/bench-all.sh 4
Usage: ./bench.sh [n_threads]
Running memcpy benchmark with 1 thread
memcpy: 34.33 GB/s sum: ok -536870910.000000
Running ggml_mul_mat benchmark with 4 threads
ggml_mul_mat: 64 x 64: F16 11.4 GFLOPS (128 runs) / F32 10.5 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 89.0 GFLOPS (128 runs) / F32 74.8 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 422.6 GFLOPS (128 runs) / F32 419.4 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 793.4 GFLOPS (128 runs) / F32 801.8 GFLOPS (128 runs) ggml_mul_mat: 1024 x 1024: F16 827.0 GFLOPS (128 runs) / F32 849.3 GFLOPS (128 runs) ggml_mul_mat: 2048 x 2048: F16 821.8 GFLOPS ( 48 runs) / F32 773.4 GFLOPS ( 46 runs) ggml_mul_mat: 4096 x 4096: F16 765.2 GFLOPS ( 6 runs) / F32 743.6 GFLOPS ( 6 runs)
Running benchmark for all models This can take a while!
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
NEON BLAS COREML | tiny | 4 | c23588c | ||||
NEON BLAS COREML | base | 4 | c23588c | ||||
M2 | 13.3.1 (a)(22E772610a) | NEON BLAS COREML | small | 4 | 153 | 199 | c23588c |
M2 | 13.3.1 (a)(22E772610a) | NEON BLAS COREML | medium | 4 | 450 | 746 | c23588c |
M2 | 13.3.1 (a)(22E772610a) | NEON BLAS COREML | large | 4 | 1053 | 1439 | c23588c |
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
Raspberry Pi 4 2GB | Bullseye 6.1.21-v8+ | OPENBLAS | tiny.en | 4 | 393 | 7882 | 14bee39 |
Raspberry Pi 4 2GB | Bullseye 6.1.21-v8+ | OPENBLAS | tiny.en-q5 | 4 | 265 | 8564 | 14bee39 |
Raspberry Pi 4 2GB | Bullseye 6.1.21-v8+ | OPENBLAS | base.en | 4 | 571 | 16328 | 14bee39 |
Raspberry Pi 4 2GB | Bullseye 6.1.21-v8+ | OPENBLAS | base.en-q5 | 4 | 306 | 16169 | 14bee39 |
Tests performed using Raspberry Pi OS libopenblas-dev
package (version 0.3.13+ds-3
).
Ryzen 3 2200GE (Lenovo M715q)
Running memcpy benchmark
memcpy: 12.14 GB/s (1 thread)
sum: -536869898.000000
Running ggml_mul_mat benchmark with 4 threads
64 x 64: Q4_0 5.3 GFLOPS (128 runs) | Q4_1 1.6 GFLOPS (128 runs) | Q4_2 5.2 GFLOPS (128 runs)
64 x 64: Q5_0 5.5 GFLOPS (128 runs) | Q5_1 1.7 GFLOPS (128 runs) | Q8_0 1.7 GFLOPS (128 runs)
64 x 64: F16 1.1 GFLOPS (128 runs) | F32 2.0 GFLOPS (128 runs)
128 x 128: Q4_0 9.9 GFLOPS (128 runs) | Q4_1 10.8 GFLOPS (128 runs) | Q4_2 9.8 GFLOPS (128 runs)
128 x 128: Q5_0 16.7 GFLOPS (128 runs) | Q5_1 19.0 GFLOPS (128 runs) | Q8_0 20.6 GFLOPS (128 runs)
128 x 128: F16 9.4 GFLOPS (128 runs) | F32 29.8 GFLOPS (128 runs)
256 x 256: Q4_0 26.1 GFLOPS (128 runs) | Q4_1 29.4 GFLOPS (128 runs) | Q4_2 31.2 GFLOPS (128 runs)
256 x 256: Q5_0 28.4 GFLOPS (128 runs) | Q5_1 31.0 GFLOPS (128 runs) | Q8_0 32.5 GFLOPS (128 runs)
256 x 256: F16 21.5 GFLOPS (128 runs) | F32 41.6 GFLOPS (128 runs)
512 x 512: Q4_0 41.4 GFLOPS (128 runs) | Q4_1 42.7 GFLOPS (128 runs) | Q4_2 43.2 GFLOPS (128 runs)
512 x 512: Q5_0 39.2 GFLOPS (128 runs) | Q5_1 37.2 GFLOPS (128 runs) | Q8_0 56.7 GFLOPS (128 runs)
512 x 512: F16 29.3 GFLOPS (110 runs) | F32 56.0 GFLOPS (128 runs)
1024 x 1024: Q4_0 52.5 GFLOPS ( 25 runs) | Q4_1 51.6 GFLOPS ( 25 runs) | Q4_2 48.3 GFLOPS ( 23 runs)
1024 x 1024: Q5_0 44.1 GFLOPS ( 21 runs) | Q5_1 41.9 GFLOPS ( 20 runs) | Q8_0 71.4 GFLOPS ( 34 runs)
1024 x 1024: F16 30.4 GFLOPS ( 15 runs) | F32 35.5 GFLOPS ( 17 runs)
2048 x 2048: Q4_0 54.6 GFLOPS ( 4 runs) | Q4_1 50.6 GFLOPS ( 3 runs) | Q4_2 49.8 GFLOPS ( 3 runs)
2048 x 2048: Q5_0 44.8 GFLOPS ( 3 runs) | Q5_1 40.8 GFLOPS ( 3 runs) | Q8_0 67.1 GFLOPS ( 4 runs)
2048 x 2048: F16 29.1 GFLOPS ( 3 runs) | F32 20.0 GFLOPS ( 3 runs)
4096 x 4096: Q4_0 54.3 GFLOPS ( 3 runs) | Q4_1 50.0 GFLOPS ( 3 runs) | Q4_2 49.5 GFLOPS ( 3 runs)
4096 x 4096: Q5_0 44.7 GFLOPS ( 3 runs) | Q5_1 40.2 GFLOPS ( 3 runs) | Q8_0 64.0 GFLOPS ( 3 runs)
4096 x 4096: F16 28.3 GFLOPS ( 3 runs) | F32 19.7 GFLOPS ( 3 runs)
Running benchmark for all models
This can take a while!
| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| Ryzen 3 2200GE | Ubuntu 22.04.2 | AVX2 | tiny | 4 | 68 | 1676 | 2b6a074 |
| Ryzen 3 2200GE | Ubuntu 22.04.2 | AVX2 | base | 4 | 96 | 3850 | 2b6a074 |
| Ryzen 3 2200GE | Ubuntu 22.04.2 | AVX2 | small | 4 | 235 | 14734 | 2b6a074 |
| Ryzen 3 2200GE | Ubuntu 22.04.2 | AVX2 | medium | 4 | 660 | 49288 | 2b6a074 |
| Ryzen 3 2200GE | Ubuntu 22.04.2 | AVX2 | large | 4 | 1302 | 105757 | 2b6a074 |
This is what I get with clblast on an AMD RX6700XT:
Running memcpy benchmark
memcpy: 11.94 GB/s (1 thread)
sum: -536869898.000000
Running ggml_mul_mat benchmark with 16 threads
Initializing CLBlast (First Run)...
Attempting to use: Platform=0, Device=0 (If invalid, program will crash)
Using Platform: AMD Accelerated Parallel Processing Device: gfx1031
64 x 64: Q4_0 0.8 GFLOPS (128 runs) | Q4_1 0.8 GFLOPS (128 runs)
64 x 64: Q5_0 0.8 GFLOPS (128 runs) | Q5_1 0.8 GFLOPS (128 runs) | Q8_0 0.8 GFLOPS (128 runs)
64 x 64: F16 0.8 GFLOPS (128 runs) | F32 0.8 GFLOPS (128 runs)
128 x 128: Q4_0 5.6 GFLOPS (128 runs) | Q4_1 5.6 GFLOPS (128 runs)
128 x 128: Q5_0 6.1 GFLOPS (128 runs) | Q5_1 5.7 GFLOPS (128 runs) | Q8_0 6.1 GFLOPS (128 runs)
128 x 128: F16 5.8 GFLOPS (128 runs) | F32 6.0 GFLOPS (128 runs)
256 x 256: Q4_0 43.4 GFLOPS (128 runs) | Q4_1 40.3 GFLOPS (128 runs)
256 x 256: Q5_0 38.2 GFLOPS (128 runs) | Q5_1 39.2 GFLOPS (128 runs) | Q8_0 39.0 GFLOPS (128 runs)
256 x 256: F16 38.3 GFLOPS (128 runs) | F32 38.6 GFLOPS (128 runs)
512 x 512: Q4_0 210.9 GFLOPS (128 runs) | Q4_1 212.8 GFLOPS (128 runs)
512 x 512: Q5_0 212.0 GFLOPS (128 runs) | Q5_1 213.2 GFLOPS (128 runs) | Q8_0 210.2 GFLOPS (128 runs)
512 x 512: F16 195.5 GFLOPS (128 runs) | F32 208.7 GFLOPS (128 runs)
1024 x 1024: Q4_0 1280.6 GFLOPS (128 runs) | Q4_1 1289.0 GFLOPS (128 runs)
1024 x 1024: Q5_0 1292.2 GFLOPS (128 runs) | Q5_1 1287.4 GFLOPS (128 runs) | Q8_0 1271.0 GFLOPS (128 runs)
1024 x 1024: F16 1025.9 GFLOPS (128 runs) | F32 1227.8 GFLOPS (128 runs)
2048 x 2048: Q4_0 3423.2 GFLOPS (128 runs) | Q4_1 3414.1 GFLOPS (128 runs)
2048 x 2048: Q5_0 3393.6 GFLOPS (128 runs) | Q5_1 3385.8 GFLOPS (128 runs) | Q8_0 3385.2 GFLOPS (128 runs)
2048 x 2048: F16 2434.4 GFLOPS (128 runs) | F32 3045.8 GFLOPS (128 runs)
4096 x 4096: Q4_0 4187.6 GFLOPS ( 31 runs) | Q4_1 4193.6 GFLOPS ( 31 runs)
4096 x 4096: Q5_0 4204.3 GFLOPS ( 31 runs) | Q5_1 4187.1 GFLOPS ( 31 runs) | Q8_0 4135.0 GFLOPS ( 31 runs)
4096 x 4096: F16 3491.1 GFLOPS ( 26 runs) | F32 3911.3 GFLOPS ( 29 runs)
Running benchmark for all models This can take a while!
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
Ryzen 5950X / RX6700XT | Arch | AVX2 BLAS | tiny | 16 | 382 | 603 | 95b02d7 |
Ryzen 5950X / RX6700XT | Arch | AVX2 BLAS | base | 16 | 371 | 717 | 95b02d7 |
Ryzen 5950X / RX6700XT | Arch | AVX2 BLAS | small | 16 | 427 | 1271 | 95b02d7 |
Ryzen 5950X / RX6700XT | Arch | AVX2 BLAS | medium | 16 | 636 | 2784 | 95b02d7 |
Ryzen 5950X / RX6700XT | Arch | AVX2 BLAS | large | 16 | 868 | 4308 | 95b02d7 |
Thinkpad T480, Core i7 8550U
Usage: ./bench.sh [n_threads] [encoder-only]
Running memcpy benchmark
memcpy: 12.67 GB/s (1 thread)
sum: -536869898.000000
Running ggml_mul_mat benchmark with 4 threads
64 x 64: Q4_0 6.1 GFLOPS (128 runs) | Q4_1 6.4 GFLOPS (128 runs)
64 x 64: Q5_0 6.6 GFLOPS (128 runs) | Q5_1 6.7 GFLOPS (128 runs) | Q8_0 6.3 GFLOPS (128 runs)
64 x 64: F16 7.8 GFLOPS (128 runs) | F32 5.4 GFLOPS (128 runs)
128 x 128: Q4_0 25.3 GFLOPS (128 runs) | Q4_1 25.5 GFLOPS (128 runs)
128 x 128: Q5_0 29.6 GFLOPS (128 runs) | Q5_1 26.9 GFLOPS (128 runs) | Q8_0 31.7 GFLOPS (128 runs)
128 x 128: F16 34.8 GFLOPS (128 runs) | F32 13.8 GFLOPS (128 runs)
256 x 256: Q4_0 49.9 GFLOPS (128 runs) | Q4_1 43.3 GFLOPS (128 runs)
256 x 256: Q5_0 46.6 GFLOPS (128 runs) | Q5_1 45.4 GFLOPS (128 runs) | Q8_0 64.0 GFLOPS (128 runs)
256 x 256: F16 61.2 GFLOPS (128 runs) | F32 18.7 GFLOPS (128 runs)
512 x 512: Q4_0 66.7 GFLOPS (128 runs) | Q4_1 54.7 GFLOPS (128 runs)
512 x 512: Q5_0 53.5 GFLOPS (128 runs) | Q5_1 57.9 GFLOPS (128 runs) | Q8_0 80.6 GFLOPS (128 runs)
512 x 512: F16 65.5 GFLOPS (128 runs) | F32 22.2 GFLOPS ( 83 runs)
1024 x 1024: Q4_0 77.7 GFLOPS ( 37 runs) | Q4_1 66.9 GFLOPS ( 32 runs)
1024 x 1024: Q5_0 66.3 GFLOPS ( 31 runs) | Q5_1 60.2 GFLOPS ( 29 runs) | Q8_0 91.6 GFLOPS ( 44 runs)
1024 x 1024: F16 63.8 GFLOPS ( 30 runs) | F32 21.2 GFLOPS ( 10 runs)
2048 x 2048: Q4_0 74.3 GFLOPS ( 5 runs) | Q4_1 71.1 GFLOPS ( 5 runs)
2048 x 2048: Q5_0 59.5 GFLOPS ( 4 runs) | Q5_1 56.4 GFLOPS ( 4 runs) | Q8_0 90.2 GFLOPS ( 6 runs)
2048 x 2048: F16 49.9 GFLOPS ( 3 runs) | F32 15.9 GFLOPS ( 3 runs)
4096 x 4096: Q4_0 61.1 GFLOPS ( 3 runs) | Q4_1 54.7 GFLOPS ( 3 runs)
4096 x 4096: Q5_0 48.4 GFLOPS ( 3 runs) | Q5_1 45.1 GFLOPS ( 3 runs) | Q8_0 62.7 GFLOPS ( 3 runs)
4096 x 4096: F16 38.4 GFLOPS ( 3 runs) | F32 12.9 GFLOPS ( 3 runs)
Running benchmark for all models
This can take a while!
| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
I don't know why it stopped when it wanted to run the benchmark for all models? I have ggml-base.en.bin, and I have for-tests-ggml*.bin.
@randomshinichi That is what its does when the non en models are not avail
Jetson Orin Nano (Developer Kit) - Unoptimised install (no CLBlast, CUBLAS etc)
Running memcpy benchmark
memcpy: 6.28 GB/s (1 thread)
sum: 136902081526.000000
Running ggml_mul_mat benchmark with 4 threads
64 x 64: Q4_0 4.1 GFLOPS (128 runs) | Q4_1 4.2 GFLOPS (128 runs)
64 x 64: Q5_0 4.2 GFLOPS (128 runs) | Q5_1 4.1 GFLOPS (128 runs) | Q8_0 4.6 GFLOPS (128 runs)
64 x 64: F16 4.0 GFLOPS (128 runs) | F32 5.2 GFLOPS (128 runs)
128 x 128: Q4_0 12.9 GFLOPS (128 runs) | Q4_1 13.2 GFLOPS (128 runs)
128 x 128: Q5_0 12.7 GFLOPS (128 runs) | Q5_1 12.5 GFLOPS (128 runs) | Q8_0 14.1 GFLOPS (128 runs)
128 x 128: F16 9.3 GFLOPS (128 runs) | F32 20.9 GFLOPS (128 runs)
256 x 256: Q4_0 17.9 GFLOPS (128 runs) | Q4_1 17.5 GFLOPS (128 runs)
256 x 256: Q5_0 17.8 GFLOPS (128 runs) | Q5_1 16.2 GFLOPS (128 runs) | Q8_0 20.3 GFLOPS (128 runs)
256 x 256: F16 10.4 GFLOPS (128 runs) | F32 28.8 GFLOPS (128 runs)
512 x 512: Q4_0 21.1 GFLOPS ( 79 runs) | Q4_1 20.0 GFLOPS ( 75 runs)
512 x 512: Q5_0 18.6 GFLOPS ( 70 runs) | Q5_1 19.1 GFLOPS ( 72 runs) | Q8_0 22.0 GFLOPS ( 83 runs)
512 x 512: F16 10.5 GFLOPS ( 40 runs) | F32 25.7 GFLOPS ( 97 runs)
1024 x 1024: Q4_0 20.6 GFLOPS ( 10 runs) | Q4_1 20.4 GFLOPS ( 10 runs)
1024 x 1024: Q5_0 20.2 GFLOPS ( 10 runs) | Q5_1 18.7 GFLOPS ( 9 runs) | Q8_0 23.2 GFLOPS ( 11 runs)
1024 x 1024: F16 11.4 GFLOPS ( 6 runs) | F32 16.6 GFLOPS ( 8 runs)
2048 x 2048: Q4_0 22.3 GFLOPS ( 3 runs) | Q4_1 22.4 GFLOPS ( 3 runs)
2048 x 2048: Q5_0 22.0 GFLOPS ( 3 runs) | Q5_1 20.9 GFLOPS ( 3 runs) | Q8_0 25.8 GFLOPS ( 3 runs)
2048 x 2048: F16 11.9 GFLOPS ( 3 runs) | F32 11.5 GFLOPS ( 3 runs)
4096 x 4096: Q4_0 22.7 GFLOPS ( 3 runs) | Q4_1 22.6 GFLOPS ( 3 runs)
4096 x 4096: Q5_0 22.2 GFLOPS ( 3 runs) | Q5_1 21.0 GFLOPS ( 3 runs) | Q8_0 26.2 GFLOPS ( 3 runs)
4096 x 4096: F16 12.0 GFLOPS ( 3 runs) | F32 13.1 GFLOPS ( 3 runs)
Running benchmark for all models
This can take a while!
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
6-core Arm Cortex-A78AE | Ubuntu 20.04 | NEON | tiny | 4 | 117 | 3631 | 5e2b340 |
6-core Arm Cortex-A78AE | Ubuntu 20.04 | NEON | base | 4 | 153 | 8603 | 5e2b340 |
6-core Arm Cortex-A78AE | Ubuntu 20.04 | NEON | small | 4 | 323 | 33605 | 5e2b340 |
6-core Arm Cortex-A78AE | Ubuntu 20.04 | NEON | medium | 4 | 1059 | 111404 | 5e2b340 |
6-core Arm Cortex-A78AE | Ubuntu 20.04 | NEON | large | 4 | 3187 | 222130 | 5e2b340 |
Jetson Orin Nano (Developer Kit)
Running memcpy benchmark memcpy: 6.28 GB/s (1 thread) sum: 136902081526.000000 Running ggml_mul_mat benchmark with 4 threads 64 x 64: Q4_0 4.1 GFLOPS (128 runs) | Q4_1 4.2 GFLOPS (128 runs) 64 x 64: Q5_0 4.2 GFLOPS (128 runs) | Q5_1 4.1 GFLOPS (128 runs) | Q8_0 4.6 GFLOPS (128 runs) 64 x 64: F16 4.0 GFLOPS (128 runs) | F32 5.2 GFLOPS (128 runs) 128 x 128: Q4_0 12.9 GFLOPS (128 runs) | Q4_1 13.2 GFLOPS (128 runs) 128 x 128: Q5_0 12.7 GFLOPS (128 runs) | Q5_1 12.5 GFLOPS (128 runs) | Q8_0 14.1 GFLOPS (128 runs) 128 x 128: F16 9.3 GFLOPS (128 runs) | F32 20.9 GFLOPS (128 runs) 256 x 256: Q4_0 17.9 GFLOPS (128 runs) | Q4_1 17.5 GFLOPS (128 runs) 256 x 256: Q5_0 17.8 GFLOPS (128 runs) | Q5_1 16.2 GFLOPS (128 runs) | Q8_0 20.3 GFLOPS (128 runs) 256 x 256: F16 10.4 GFLOPS (128 runs) | F32 28.8 GFLOPS (128 runs) 512 x 512: Q4_0 21.1 GFLOPS ( 79 runs) | Q4_1 20.0 GFLOPS ( 75 runs) 512 x 512: Q5_0 18.6 GFLOPS ( 70 runs) | Q5_1 19.1 GFLOPS ( 72 runs) | Q8_0 22.0 GFLOPS ( 83 runs) 512 x 512: F16 10.5 GFLOPS ( 40 runs) | F32 25.7 GFLOPS ( 97 runs) 1024 x 1024: Q4_0 20.6 GFLOPS ( 10 runs) | Q4_1 20.4 GFLOPS ( 10 runs) 1024 x 1024: Q5_0 20.2 GFLOPS ( 10 runs) | Q5_1 18.7 GFLOPS ( 9 runs) | Q8_0 23.2 GFLOPS ( 11 runs) 1024 x 1024: F16 11.4 GFLOPS ( 6 runs) | F32 16.6 GFLOPS ( 8 runs) 2048 x 2048: Q4_0 22.3 GFLOPS ( 3 runs) | Q4_1 22.4 GFLOPS ( 3 runs) 2048 x 2048: Q5_0 22.0 GFLOPS ( 3 runs) | Q5_1 20.9 GFLOPS ( 3 runs) | Q8_0 25.8 GFLOPS ( 3 runs) 2048 x 2048: F16 11.9 GFLOPS ( 3 runs) | F32 11.5 GFLOPS ( 3 runs) 4096 x 4096: Q4_0 22.7 GFLOPS ( 3 runs) | Q4_1 22.6 GFLOPS ( 3 runs) 4096 x 4096: Q5_0 22.2 GFLOPS ( 3 runs) | Q5_1 21.0 GFLOPS ( 3 runs) | Q8_0 26.2 GFLOPS ( 3 runs) 4096 x 4096: F16 12.0 GFLOPS ( 3 runs) | F32 13.1 GFLOPS ( 3 runs) Running benchmark for all models This can take a while!
CPU OS Config Model Th Load Enc. Commit 6-core Arm Cortex-A78AE Ubuntu 20.04 NEON tiny 4 117 3631 5e2b340 6-core Arm Cortex-A78AE Ubuntu 20.04 NEON base 4 153 8603 5e2b340 6-core Arm Cortex-A78AE Ubuntu 20.04 NEON small 4 323 33605 5e2b340 6-core Arm Cortex-A78AE Ubuntu 20.04 NEON medium 4 1059 111404 5e2b340 6-core Arm Cortex-A78AE Ubuntu 20.04 NEON large 4 3187 222130 5e2b340
@mark-beeby You sure everything is correct with your distro as your results are really bad, to what I was expecting. As been looking forward to see what a Orin nano could do.
Check out an rk3588 https://github.com/ggerganov/whisper.cpp/issues/89#issuecomment-1529989153 as that is an A76x4 with DDR4 not DDR5...
Also interested in what you get with cuBlas https://github.com/ggerganov/whisper.cpp#opencl-gpu-support-via-clblast
Jetson Orin Nano (Developer Kit) - CUBLAS
Usage: ./bench.sh [n_threads] [encoder-only]
Running memcpy benchmark
memcpy: 6.26 GB/s (1 thread)
sum: 136902081526.000000
Running ggml_mul_mat benchmark with 4 threads
64 x 64: Q4_0 1.0 GFLOPS (128 runs) | Q4_1 0.9 GFLOPS (128 runs)
64 x 64: Q5_0 0.7 GFLOPS (128 runs) | Q5_1 0.9 GFLOPS (128 runs) | Q8_0 1.0 GFLOPS (128 runs)
64 x 64: F16 1.0 GFLOPS (128 runs) | F32 0.9 GFLOPS (128 runs)
128 x 128: Q4_0 6.8 GFLOPS (128 runs) | Q4_1 7.3 GFLOPS (128 runs)
128 x 128: Q5_0 7.8 GFLOPS (128 runs) | Q5_1 7.8 GFLOPS (128 runs) | Q8_0 7.8 GFLOPS (128 runs)
128 x 128: F16 8.0 GFLOPS (128 runs) | F32 7.7 GFLOPS (128 runs)
256 x 256: Q4_0 57.1 GFLOPS (128 runs) | Q4_1 62.5 GFLOPS (128 runs)
256 x 256: Q5_0 62.3 GFLOPS (128 runs) | Q5_1 62.8 GFLOPS (128 runs) | Q8_0 64.6 GFLOPS (128 runs)
256 x 256: F16 38.7 GFLOPS (128 runs) | F32 38.6 GFLOPS (128 runs)
512 x 512: Q4_0 248.6 GFLOPS (128 runs) | Q4_1 250.9 GFLOPS (128 runs)
512 x 512: Q5_0 250.2 GFLOPS (128 runs) | Q5_1 248.7 GFLOPS (128 runs) | Q8_0 247.8 GFLOPS (128 runs)
512 x 512: F16 215.2 GFLOPS (128 runs) | F32 210.5 GFLOPS (128 runs)
1024 x 1024: Q4_0 884.6 GFLOPS (128 runs) | Q4_1 882.7 GFLOPS (128 runs)
1024 x 1024: Q5_0 879.2 GFLOPS (128 runs) | Q5_1 872.7 GFLOPS (128 runs) | Q8_0 632.0 GFLOPS (128 runs)
1024 x 1024: F16 651.2 GFLOPS (128 runs) | F32 627.2 GFLOPS (128 runs)
2048 x 2048: Q4_0 1349.9 GFLOPS ( 79 runs) | Q4_1 1337.1 GFLOPS ( 78 runs)
2048 x 2048: Q5_0 1332.3 GFLOPS ( 78 runs) | Q5_1 1327.7 GFLOPS ( 78 runs) | Q8_0 1304.8 GFLOPS ( 76 runs)
2048 x 2048: F16 1401.6 GFLOPS ( 82 runs) | F32 1140.0 GFLOPS ( 67 runs)
4096 x 4096: Q4_0 1967.6 GFLOPS ( 15 runs) | Q4_1 1962.9 GFLOPS ( 15 runs)
4096 x 4096: Q5_0 1956.3 GFLOPS ( 15 runs) | Q5_1 1952.7 GFLOPS ( 15 runs) | Q8_0 1929.9 GFLOPS ( 15 runs)
4096 x 4096: F16 2603.2 GFLOPS ( 19 runs) | F32 1742.4 GFLOPS ( 13 runs)
Running benchmark for all models
This can take a while!
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
6-core Arm Cortex-A78AE | Ubuntu 20.04 | NEON BLAS | tiny | 4 | 1296 | 544 | 5e2b340 |
6-core Arm Cortex-A78AE | Ubuntu 20.04 | NEON BLAS | base | 4 | 1350 | 1015 | 5e2b340 |
6-core Arm Cortex-A78AE | Ubuntu 20.04 | NEON BLAS | small | 4 | 1557 | 2901 | 5e2b340 |
6-core Arm Cortex-A78AE | Ubuntu 20.04 | NEON BLAS | medium | 4 | 2303 | 7977 | 5e2b340 |
6-core Arm Cortex-A78AE | Ubuntu 20.04 | NEON BLAS | large | 4 | 6716 | 12913 | 5e2b340 |
@StuartIanNaylor I've struggled to get clblast installed, and moved back to a CUDA install, and after a few hiccups and setting export CUDA_VISIBLE_DEVICES=0
I got the much more favourable results above. Hope that helps!
New desktop I built - CPU i7-13700K (turbo overclock +200MHz base), DDR5 @ 5600MT/s, GPU Intel Arc A770 LE
I tried differing numbers of thread counts, before settling on 20. Anything past 20 resulted in a drop in performance, which is obviously going to happen.
Running memcpy benchmark
memcpy: 23.16 GB/s (1 thread)
sum: -536869898.000000
Running ggml_mul_mat benchmark with 20 threads
Initializing CLBlast (First Run)...
Attempting to use: Platform=0, Device=0 (If invalid, program will crash)
Using Platform: Intel(R) OpenCL HD Graphics Device: Intel(R) Arc(TM) A770 Graphics
64 x 64: Q4_0 0.9 GFLOPS (128 runs) | Q4_1 1.0 GFLOPS (128 runs)
64 x 64: Q5_0 1.0 GFLOPS (128 runs) | Q5_1 1.0 GFLOPS (128 runs) | Q8_0 1.0 GFLOPS (128 runs)
64 x 64: F16 1.0 GFLOPS (128 runs) | F32 1.0 GFLOPS (128 runs)
128 x 128: Q4_0 5.6 GFLOPS (128 runs) | Q4_1 5.8 GFLOPS (128 runs)
128 x 128: Q5_0 5.7 GFLOPS (128 runs) | Q5_1 5.4 GFLOPS (128 runs) | Q8_0 5.0 GFLOPS (128 runs)
128 x 128: F16 5.6 GFLOPS (128 runs) | F32 5.5 GFLOPS (128 runs)
256 x 256: Q4_0 40.4 GFLOPS (128 runs) | Q4_1 38.9 GFLOPS (128 runs)
256 x 256: Q5_0 40.7 GFLOPS (128 runs) | Q5_1 40.3 GFLOPS (128 runs) | Q8_0 38.5 GFLOPS (128 runs)
256 x 256: F16 40.8 GFLOPS (128 runs) | F32 40.8 GFLOPS (128 runs)
512 x 512: Q4_0 260.5 GFLOPS (128 runs) | Q4_1 264.6 GFLOPS (128 runs)
512 x 512: Q5_0 234.3 GFLOPS (128 runs) | Q5_1 254.8 GFLOPS (128 runs) | Q8_0 260.2 GFLOPS (128 runs)
512 x 512: F16 223.7 GFLOPS (128 runs) | F32 261.0 GFLOPS (128 runs)
1024 x 1024: Q4_0 1158.0 GFLOPS (128 runs) | Q4_1 1158.2 GFLOPS (128 runs)
1024 x 1024: Q5_0 1119.2 GFLOPS (128 runs) | Q5_1 1157.4 GFLOPS (128 runs) | Q8_0 1125.5 GFLOPS (128 runs)
1024 x 1024: F16 871.3 GFLOPS (128 runs) | F32 1029.7 GFLOPS (128 runs)
2048 x 2048: Q4_0 2847.7 GFLOPS (128 runs) | Q4_1 2749.8 GFLOPS (128 runs)
2048 x 2048: Q5_0 2752.3 GFLOPS (128 runs) | Q5_1 2879.4 GFLOPS (128 runs) | Q8_0 2770.3 GFLOPS (128 runs)
2048 x 2048: F16 2061.0 GFLOPS (120 runs) | F32 2504.5 GFLOPS (128 runs)
4096 x 4096: Q4_0 4681.2 GFLOPS ( 35 runs) | Q4_1 4637.2 GFLOPS ( 34 runs)
4096 x 4096: Q5_0 4646.7 GFLOPS ( 34 runs) | Q5_1 4586.6 GFLOPS ( 34 runs) | Q8_0 4589.7 GFLOPS ( 34 runs)
4096 x 4096: F16 3444.7 GFLOPS ( 26 runs) | F32 4128.2 GFLOPS ( 31 runs)
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
Intel Core i7-13700K | Arch Linux | AVX2 BLAS | tiny | 20 | 145 | 417 | 5e2b340 |
Intel Core i7-13700K | Arch Linux | AVX2 BLAS | base | 20 | 161 | 560 | 5e2b340 |
Intel Core i7-13700K | Arch Linux | AVX2 BLAS | small | 20 | 281 | 1072 | 5e2b340 |
Intel Core i7-13700K | Arch Linux | AVX2 BLAS | medium | 20 | 606 | 2771 | 5e2b340 |
Intel Core i7-13700K | Arch Linux | AVX2 BLAS | large | 20 | 1116 | 4105 | 5e2b340 |
CPU power draw during these last tests averaged 140 watts, peaking at 141. GPU metrics are currently not exposed in Linux for Arc, so I'm unable to check what that was drawing.
Encoder
Collection of bench results for various platforms and devices. If you want to submit info about your device, simply run the bench tool or the extra/bench-all.sh and report the results in the comments below.
Suggestions for better summary of the results are welcome
memcpy
MacBook M1 Pro
Ryzen 9 5950X
ggml_mul_mat
MacBook M1 Pro
Ryzen 9 5950X