ggerganov commented 1 year ago

Encoder

Collection of bench results for various platforms and devices. If you want to submit info about your device, simply run the bench tool or the extra/bench-all.sh and report the results in the comments below.

Suggestions for better summary of the results are welcome

CPU	OS	Config	Model	Th	Load	Enc.	Commit
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	tiny	8	71	102	206fc93
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	base	8	96	220	206fc93
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	small	8	233	685	206fc93
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	medium	8	603	1928	206fc93
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	large	8	1158	3350	206fc93
---
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	small	1	251	2605	206fc93
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	small	4	255	884	206fc93
---
Mac Mini M1	MacOS	NEON BLAS	tiny	4	62	194	fcf515d
Mac Mini M1	MacOS	NEON BLAS	base	4	81	380	fcf515d
Mac Mini M1	MacOS	NEON BLAS	small	4	204	1249	fcf515d
Mac Mini M1	MacOS	NEON BLAS	medium	4	876	3980	fcf515d
Mac Mini M1	MacOS	NEON BLAS	large	4	1876	7979	fcf515d
---
Ryzen 9 3900X	Ubuntu 20.04	AVX2	tiny	8	107	422	fcf515d
Ryzen 9 3900X	Ubuntu 20.04	AVX2	base	8	137	880	fcf515d
Ryzen 9 3900X	Ubuntu 20.04	AVX2	small	8	280	2874	fcf515d
Ryzen 9 3900X	Ubuntu 20.04	AVX2	medium	8	692	9610	fcf515d
Ryzen 9 3900X	Ubuntu 20.04	AVX2	large	8	1317	16917	fcf515d
---
Ryzen 9 3900X	Ubuntu 20.04	AVX2 BLAS	tiny	4	120	780	fcf515d
Ryzen 9 3900X	Ubuntu 20.04	AVX2 BLAS	base	4	151	1173	fcf515d
Ryzen 9 3900X	Ubuntu 20.04	AVX2 BLAS	small	4	289	3062	fcf515d
Ryzen 9 3900X	Ubuntu 20.04	AVX2 BLAS	medium	4	711	9175	fcf515d
Ryzen 9 3900X	Ubuntu 20.04	AVX2 BLAS	large	4	1282	16050	fcf515d
---
Ryzen 9 5950X	Ubuntu 22.04	AVX2	tiny	8	135	197	fcf515d
Ryzen 9 5950X	Ubuntu 22.04	AVX2	base	8	176	421	fcf515d
Ryzen 9 5950X	Ubuntu 22.04	AVX2	small	8	357	1393	fcf515d
Ryzen 9 5950X	Ubuntu 22.04	AVX2	medium	8	855	4404	fcf515d
Ryzen 9 5950X	Ubuntu 22.04	AVX2	large	8	1576	8118	fcf515d
---
Raspberry Pi 4		NEON	tiny	4	1436	13839	fcf515d
Raspberry Pi 4		NEON	base	4	1894	30552	fcf515d
---
iPhone 13 Mini	iOS 16.0	NEON BLAS	base	4	97	1091	fcf515d
---
MacBook M1 Pro	Vivaldi	WASM	tiny	8	133	3785	fcf515d
MacBook M1 Pro	Vivaldi	WASM	base	8	172	8253	fcf515d
---
MacBook M1 Pro	Chrome	WASM	tiny	8	134	3776	fcf515d
MacBook M1 Pro	Chrome	WASM	base	8	168	8200	fcf515d
---
MacBook M1 Pro	Firefox	WASM	tiny	8	137	2626	fcf515d
MacBook M1 Pro	Firefox	WASM	base	8	183	6226	fcf515d

memcpy

MacBook M1 Pro

./bench -w 1 -t 1
memcpy: 37.59 GB/s

Ryzen 9 5950X

./bench -w 1 -t 1
memcpy: 16.74 GB/s

ggml_mul_mat

MacBook M1 Pro

./bench -w 2 -t 1
ggml_mul_mat:    64 x    64: F16    330.6 GFLOPS (128 runs) / F32    466.0 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16    737.5 GFLOPS (128 runs) / F32    838.9 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16    938.6 GFLOPS (128 runs) / F32   1062.3 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16   1312.5 GFLOPS (128 runs) / F32   1835.5 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16   1765.1 GFLOPS (128 runs) / F32   2041.4 GFLOPS (128 runs)
ggml_mul_mat:  2048 x  2048: F16   1784.3 GFLOPS (104 runs) / F32   1859.2 GFLOPS (109 runs)
ggml_mul_mat:  4096 x  4096: F16   1855.1 GFLOPS ( 14 runs) / F32   1873.3 GFLOPS ( 14 runs)

Ryzen 9 5950X

WHISPER_OPENBLAS=1 make -j bench && ./bench -w 2 -t 1
ggml_mul_mat:    64 x    64: F16     56.3 GFLOPS (128 runs) / F32     70.2 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     47.8 GFLOPS (128 runs) / F32     67.0 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16    185.1 GFLOPS (128 runs) / F32    332.7 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16    386.4 GFLOPS (128 runs) / F32    658.6 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16    636.2 GFLOPS (128 runs) / F32   1012.0 GFLOPS (128 runs)
ggml_mul_mat:  2048 x  2048: F16    950.9 GFLOPS ( 56 runs) / F32   1296.8 GFLOPS ( 76 runs)
ggml_mul_mat:  4096 x  4096: F16   1168.6 GFLOPS (  9 runs) / F32   1403.1 GFLOPS ( 11 runs)

bmilde commented 1 year ago

Whats the performance gain of this against the original implementation with pytorch compiled with AVX support or the pytorch m1 backend?

Does this implementation use beam decoding? (original pytorch impl has n=5 as default and is 100% faster with n=1)

Edit: README already mentions it's greedy decoding:

Very basic greedy sampling scheme - always pick up the token with highest probability. This should be similar to the GreedyDecoder from the original python implementation, so in order to make a fair comparison between the 2 implementations, make sure to run the python code with the following parameters:

whisper --best_of None --beam_size None ...

Greedy decoding is also 2x faster in the original implementation (on a GPU).

StuartIanNaylor commented 1 year ago

Orange Pi5 4Gb, Micro-SD not NVME

Starting to touch zram swap on medium and then file swap pretty hard on large

CPU	OS	Config	Model	Th	Load	Enc.	Commit
rk3588s	Bullseye 5.10.110	NEON	tiny	8	352	2876	0be6a1a
rk3588s	Bullseye 5.10.110	NEON	base	8	346	6213	0be6a1a
rk3588s	Bullseye 5.10.110	NEON	small	8	690	25808	0be6a1a
rk3588s	Bullseye 5.10.110	NEON	medium	8	23987	93995	0be6a1a
rk3588s	Bullseye 5.10.110	NEON	large	8	49633	190601	0be6a1a

Even with a 4:4 big:little its a touch faster taskset -c 4-7 ./extra/bench-all.sh

CPU	OS	Config	Model	Th	Load	Enc.	Commit
rk3588s	Bullseye 5.10.110	NEON	tiny	4	356	2716	0be6a1a
rk3588s	Bullseye 5.10.110	NEON	base	4	417	6661	0be6a1a
rk3588s	Bullseye 5.10.110	NEON	small	4	943	25357	0be6a1a
rk3588s	Bullseye 5.10.110	NEON	medium	4	17748	90187	0be6a1a
rk3588s	Bullseye 5.10.110	NEON	large	4	48793	182800	0be6a1a

Compiling on a rk3588 with -march=native -ffast-math seems to give a big boost taskset -c 4-7 ./extra/bench-all.sh

CPU	OS	Config	Model	Th	Load	Enc.	Commit
rk3588s	Bullseye 5.10.110	NEON	tiny	4	280	1074	0be6a1a
rk3588s	Bullseye 5.10.110	NEON	base	4	466	3491	0be6a1a
rk3588s	Bullseye 5.10.110	NEON	small	4	780	11052	0be6a1a
rk3588s	Bullseye 5.10.110	NEON	medium	4	15361	42252	0be6a1a
rk3588s	Bullseye 5.10.110	NEON	large	4	49331	91892	0be6a1a

abitofevrything commented 1 year ago

Intel Celeron N4120 (4 cores, 4 threads) on Artix Linux 6.0.12-artix1-1.

CPU	OS	Config	Model	Th	Load	Enc.	Commit
N4120	Artix 6.0.12-artix1-1	BLAS	tiny	4	330	12272	65fdcbb
N4120	Artix 6.0.12-artix1-1	BLAS	base	4			65fdcbb
N4120	Artix 6.0.12-artix1-1	BLAS	small	4	892	83209	65fdcbb
N4120	Artix 6.0.12-artix1-1	BLAS	medium	4	5478	237677	65fdcbb

JKeddo95 commented 1 year ago

Base 14 inch M1 Macbook Pro with NEON enabled:

CPU	OS	Config	RAM (GB)	Th	Model	Load (ms)	Enc. (ms)	Total
M1 Pro	OSX 12.5.1	NEON	16	8	Tiny.en	107	269.72	376.91
M1 Pro	OSX 12.5.1	NEON	16	8	Base.en	92	321	413.77
M1 Pro	OSX 12.5.1	NEON	16	8	Small.en	264	978	1243.24

16 Inch Base Apple M2 Pro results

CPU	OS	Config	RAM (GB)	Th	Model	Load (ms)	Enc. (ms)	Total (ms)
M2 Pro	OSX 13.2	NEON	16	8	Tiny.en	118	143	261
M2 Pro	OSX 13.2	NEON	16	8	Tiny	118	143	261
M2 Pro	OSX 13.2	NEON	16	8	Base.en	173	235	408
M2 Pro	OSX 13.2	NEON	16	8	Base	148	266	414
M2 Pro	OSX 13.2	NEON	16	8	Small.en	304	739	1042
M2 Pro	OSX 13.2	NEON	16	8	Small	277(?)	720	997
M2 Pro	OSX 13.2	NEON	16	8	Medium.en	747	2057	2804
M2 Pro	OSX 13.2	NEON	16	8	Medium	657	2055	2712
M2 Pro	OSX 13.2	NEON	16	8	Large	2126	4223	6349

I couldn't get bench to run on my iPhone 12, so I have attached my ad-hoc results below with the input audio "I love transcriber apps":

CPU	DGGML_USE_ACCELERATE	OS	Model	Load	Mel	Sample	Enc.	Dec.	Total (ms)
A14	Release	IOS 16.1	Base.en	150	23	2	2447	112	2584

--

This might appear obvious to some, but it wasn't to me so I'll note it here: I saw much better results using large steps lengths and sample sizes with "./stream". I feel like under the hood, Whisper relies greatly on 'whole-sentence' context to infer individual words.

j1nx commented 1 year ago

With the new beta 1.1.0 release. On first notice, not to much difference. Will not rebuild without OpenBLAS as it was slightly better with it on the rpi4.

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON BLAS	tiny	4	751	9506	ecda7f786a
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON BLAS	tiny.en	4	748	9295	ecda7f786a
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON BLAS	base	4	971	23512	ecda7f786a
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON BLAS	base.en	4	958	24263	ecda7f786a
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON BLAS	small	4	2238	84720	ecda7f786a
Raspberry Pi 4 - 2GB	OpenVoiceOS	NEON BLAS	small.en	4	3880	86031	ecda7f786a

fquirin commented 1 year ago

Results on 12th Gen Intel(R) Core(TM) i3-12300T:

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Core i3-12300T	Debian 11 (Docker on Win11)	AVX2	tiny.en	4	97	679	49b529b
Core i3-12300T	Debian 11 (Docker on Win11)	AVX2	tiny	4	90	580	49b529b
Core i3-12300T	Debian 11 (Docker on Win11)	AVX2	base	4	138	1478	49b529b

With OpenBLAS (considerably worse):

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Core i3-12300T	Debian 11 (Docker on Win11)	AVX2 BLAS	tiny	4	117	1644	49b529b
Core i3-12300T	Debian 11 (Docker on Win11)	AVX2 BLAS	base	4	122	2890	49b529b

johtso commented 1 year ago

The benchmarks for the macbook pro m1 are using 8 threads, but in my experience it runs nearly twice as fast with 4 threads. Am I missing something?

Edit: I just ran the benchmark with the large model.. and it actually made almost no difference whether 8 or 4 threads were used. But with real world workloads it makes a huge difference. Interesting.

StuartIanNaylor commented 1 year ago

Running memcpy benchmark with 1 thread
memcpy: 8.66 GB/s
sum:    error 136902082731.000000

Running ggml_mul_mat benchmark with 4 threads
ggml_mul_mat:    64 x    64: F16      4.2 GFLOPS (128 runs) / F32      3.5 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     10.1 GFLOPS (128 runs) / F32      6.3 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16     13.0 GFLOPS (128 runs) / F32      7.2 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16     14.0 GFLOPS ( 53 runs) / F32      7.1 GFLOPS ( 27 runs)
ggml_mul_mat:  1024 x  1024: F16     29.8 GFLOPS ( 15 runs) / F32     17.8 GFLOPS (  9 runs)
ggml_mul_mat:  2048 x  2048: F16     37.8 GFLOPS (  3 runs) / F32     19.6 GFLOPS (  3 runs)
ggml_mul_mat:  4096 x  4096: F16     40.0 GFLOPS (  3 runs) / F32     17.4 GFLOPS (  3 runs)

Running benchmark for all models	CPU	OS	Config	Model	Th	Load	Enc.
rk3588s	Ubuntu 22.04	NEON	tiny	4	257	1179	21c569b
rk3588s	Ubuntu 22.04	NEON	base	4	326	2967	21c569b
rk3588s	Ubuntu 22.04	NEON	small	4	661	10560	21c569b
rk3588s	Ubuntu 22.04	NEON	medium	4	23188	35867	21c569b

mscdex commented 1 year ago

Compiler: gcc version 12.2.0 (Ubuntu 12.2.0-3ubuntu1)

memcpy: 16.74 GB/s
sum:    error -536870997.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat:    64 x    64: F16     16.2 GFLOPS (128 runs) / F32     16.4 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     70.1 GFLOPS (128 runs) / F32     66.0 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16    133.9 GFLOPS (128 runs) / F32    105.7 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16    161.2 GFLOPS (128 runs) / F32    109.3 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16    204.4 GFLOPS ( 96 runs) / F32    121.9 GFLOPS ( 57 runs)
ggml_mul_mat:  2048 x  2048: F16    254.4 GFLOPS ( 15 runs) / F32    149.3 GFLOPS (  9 runs)
ggml_mul_mat:  4096 x  4096: F16    184.2 GFLOPS (  3 runs) / F32     54.1 GFLOPS (  3 runs)

Running ggml_mul_mat benchmark with 8 threads

ggml_mul_mat:    64 x    64: F16      8.4 GFLOPS (128 runs) / F32      9.0 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     58.1 GFLOPS (128 runs) / F32     57.6 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16    170.3 GFLOPS (128 runs) / F32    159.9 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16    315.7 GFLOPS (128 runs) / F32    230.8 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16    356.0 GFLOPS (128 runs) / F32    224.9 GFLOPS (105 runs)
ggml_mul_mat:  2048 x  2048: F16    499.5 GFLOPS ( 30 runs) / F32    292.4 GFLOPS ( 18 runs)
ggml_mul_mat:  4096 x  4096: F16    265.9 GFLOPS (  3 runs) / F32     66.2 GFLOPS (  3 runs)

Running ggml_mul_mat benchmark with 16 threads

ggml_mul_mat:    64 x    64: F16      3.6 GFLOPS (128 runs) / F32      3.0 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     16.7 GFLOPS (128 runs) / F32     27.0 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16     88.1 GFLOPS (128 runs) / F32    126.7 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16    263.5 GFLOPS (128 runs) / F32    229.5 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16    396.1 GFLOPS (128 runs) / F32    272.8 GFLOPS (128 runs)
ggml_mul_mat:  2048 x  2048: F16    498.6 GFLOPS ( 30 runs) / F32    314.9 GFLOPS ( 19 runs)
ggml_mul_mat:  4096 x  4096: F16    337.7 GFLOPS (  3 runs) / F32    112.0 GFLOPS (  3 runs)

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Ryzen 7700X (8C/16T 65W Eco Mode)	Ubuntu 22.10 (6.0.9 Kernel)	AVX2	tiny.en	4	104	247	78f1661
Ryzen 7700X (8C/16T 65W Eco Mode)	Ubuntu 22.10 (6.0.9 Kernel)	AVX2	base.en	4	130	585	78f1661
Ryzen 7700X (8C/16T 65W Eco Mode)	Ubuntu 22.10 (6.0.9 Kernel)	AVX2	small.en	4	264	1940	78f1661
---	--	------	-----	--	----	----	------
Ryzen 7700X (8C/16T 65W Eco Mode)	Ubuntu 22.10 (6.0.9 Kernel)	AVX2	tiny.en	8	99	166	78f1661
Ryzen 7700X (8C/16T 65W Eco Mode)	Ubuntu 22.10 (6.0.9 Kernel)	AVX2	base.en	8	123	329	78f1661
Ryzen 7700X (8C/16T 65W Eco Mode)	Ubuntu 22.10 (6.0.9 Kernel)	AVX2	small.en	8	262	1148	78f1661
---	--	------	-----	--	----	----	------
Ryzen 7700X (8C/16T 65W Eco Mode)	Ubuntu 22.10 (6.0.9 Kernel)	AVX2	tiny.en	16	100	160	78f1661
Ryzen 7700X (8C/16T 65W Eco Mode)	Ubuntu 22.10 (6.0.9 Kernel)	AVX2	base.en	16	123	338	78f1661
Ryzen 7700X (8C/16T 65W Eco Mode)	Ubuntu 22.10 (6.0.9 Kernel)	AVX2	small.en	16	262	1139	78f1661

braydenm commented 1 year ago

Tested on my M2 Macbook Air:

system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |

./extra/bench-all.sh Running memcpy benchmark with 1 thread memcpy: 31.42 GB/s sum: ok -536870910.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat: 64 x 64: F16 11.8 GFLOPS (128 runs) / F32 10.6 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 89.9 GFLOPS (128 runs) / F32 74.7 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 434.5 GFLOPS (128 runs) / F32 419.9 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 885.4 GFLOPS (128 runs) / F32 913.2 GFLOPS (128 runs) ggml_mul_mat: 1024 x 1024: F16 1023.4 GFLOPS (128 runs) / F32 1037.7 GFLOPS (128 runs) ggml_mul_mat: 2048 x 2048: F16 971.6 GFLOPS ( 57 runs) / F32 950.1 GFLOPS ( 56 runs) ggml_mul_mat: 4096 x 4096: F16 914.9 GFLOPS ( 7 runs) / F32 820.7 GFLOPS ( 6 runs)

CPU	OS	Config	Model	Th	Load	Enc.	Commit
M2	OSX 13.0.1	NEON BLAS	tiny	4	63	153	1a91c19
M2	OSX 13.0.1	NEON BLAS	base	4	92	329	1a91c19
M2	OSX 13.0.1	NEON BLAS	small	4	198	1014	1a91c19
M2	OSX 13.0.1	NEON BLAS	medium	4	564	3042	1a91c19
M2	OSX 13.0.1	NEON BLAS	large	4	1152	5466	1a91c19

Running ggml_mul_mat benchmark with 8 threads

ggml_mul_mat: 64 x 64: F16 5.7 GFLOPS (128 runs) / F32 3.9 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 45.0 GFLOPS (128 runs) / F32 25.8 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 272.7 GFLOPS (128 runs) / F32 166.1 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 747.6 GFLOPS (128 runs) / F32 748.8 GFLOPS (128 runs) ggml_mul_mat: 1024 x 1024: F16 998.7 GFLOPS (128 runs) / F32 895.8 GFLOPS (128 runs) ggml_mul_mat: 2048 x 2048: F16 716.0 GFLOPS ( 42 runs) / F32 717.2 GFLOPS ( 42 runs) ggml_mul_mat: 4096 x 4096: F16 790.4 GFLOPS ( 6 runs) / F32 726.3 GFLOPS ( 6 runs)

CPU	OS	Config	Model	Th	Load	Enc.	Commit
M2	OSX 13.0.1	NEON BLAS	tiny	8	66	154	1a91c19
M2	OSX 13.0.1	NEON BLAS	base	8	92	346	1a91c19
M2	OSX 13.0.1	NEON BLAS	small	8	211	1171	1a91c19
M2	OSX 13.0.1	NEON BLAS	medium	8	562	3848	1a91c19
M2	OSX 13.0.1	NEON BLAS	large	8	1079	6230	1a91c19

febriansasi commented 1 year ago

This is bench result :

whisper_init_from_file: loading model from 'models/ggml-base.en.bin' whisper_model_load: loading model whisper_model_load: n_vocab = 51864 whisper_model_load: n_audio_ctx = 1500 whisper_model_load: n_audio_state = 512 whisper_model_load: n_audio_head = 8 whisper_model_load: n_audio_layer = 6 whisper_model_load: n_text_ctx = 448 whisper_model_load: n_text_state = 512 whisper_model_load: n_text_head = 8 whisper_model_load: n_text_layer = 6 whisper_model_load: n_mels = 80 whisper_model_load: f16 = 1 whisper_model_load: type = 2 whisper_model_load: mem required = 500.00 MB (+ 6.00 MB per decoder) whisper_model_load: kv self size = 5.25 MB whisper_model_load: kv cross size = 17.58 MB whisper_model_load: adding 1607 extra tokens whisper_model_load: model ctx = 140.60 MB whisper_model_load: model size = 140.54 MB

system_info: n_threads = 4 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

whisper_print_timings: fallbacks = 0 p / 0 h whisper_print_timings: load time = 1245.39 ms whisper_print_timings: mel time = 0.00 ms whisper_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run) whisper_print_timings: encode time = 88596.32 ms / 1 runs (88596.32 ms per run) whisper_print_timings: decode time = 0.00 ms / 1 runs ( 0.00 ms per run) whisper_print_timings: total time = 89841.85 ms

This is cpuinfo :

processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 42 model name : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz stepping : 7 microcode : 0x2f cpu MHz : 2990.383 cache size : 3072 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 2 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts md_clear flush_l1d vmx flags : vnmi preemption_timer invvpid ept_x_only flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit mmio_unknown bogomips : 4983.97 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management:

processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 42 model name : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz stepping : 7 microcode : 0x2f cpu MHz : 2990.384 cache size : 3072 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 2 apicid : 1 initial apicid : 1 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts md_clear flush_l1d vmx flags : vnmi preemption_timer invvpid ept_x_only flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit mmio_unknown bogomips : 4983.97 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management:

processor : 2 vendor_id : GenuineIntel cpu family : 6 model : 42 model name : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz stepping : 7 microcode : 0x2f cpu MHz : 2990.384 cache size : 3072 KB physical id : 0 siblings : 4 core id : 1 cpu cores : 2 apicid : 2 initial apicid : 2 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts md_clear flush_l1d vmx flags : vnmi preemption_timer invvpid ept_x_only flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit mmio_unknown bogomips : 4983.97 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management:

processor : 3 vendor_id : GenuineIntel cpu family : 6 model : 42 model name : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz stepping : 7 microcode : 0x2f cpu MHz : 2990.384 cache size : 3072 KB physical id : 0 siblings : 4 core id : 1 cpu cores : 2 apicid : 3 initial apicid : 3 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts md_clear flush_l1d vmx flags : vnmi preemption_timer invvpid ept_x_only flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit mmio_unknown bogomips : 4983.97 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management:

./bench -w 1 -t 1

memcpy: 3.35 GB/s sum: error -536870997.000000 ./bench -w 2 -t 1

ggml_mul_mat: 64 x 64: F16 0.7 GFLOPS (128 runs) / F32 3.3 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 0.7 GFLOPS (128 runs) / F32 3.7 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 0.6 GFLOPS ( 18 runs) / F32 3.3 GFLOPS ( 99 runs) ggml_mul_mat: 512 x 512: F16 0.6 GFLOPS ( 3 runs) / F32 3.6 GFLOPS ( 14 runs) ggml_mul_mat: 1024 x 1024: F16 0.7 GFLOPS ( 3 runs) / F32 2.3 GFLOPS ( 3 runs) ggml_mul_mat: 2048 x 2048: F16 0.7 GFLOPS ( 3 runs) / F32 2.4 GFLOPS ( 3 runs) ggml_mul_mat: 4096 x 4096: F16 1.2 GFLOPS ( 3 runs) / F32 3.0 GFLOPS ( 3 runs)

Thinkpad T520, on Linux Mint Debian Edition, with commented out AVX1 on Makefile

rainmanjam commented 1 year ago

Usage: ./bench.sh [n_threads]

Running memcpy benchmark with 1 thread

memcpy: 38.84 GB/s sum: ok -536870910.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat: 64 x 64: F16 9.8 GFLOPS (128 runs) / F32 8.4 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 69.4 GFLOPS (128 runs) / F32 62.1 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 455.3 GFLOPS (128 runs) / F32 383.8 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 1141.1 GFLOPS (128 runs) / F32 1550.2 GFLOPS (128 runs) ggml_mul_mat: 1024 x 1024: F16 2302.0 GFLOPS (128 runs) / F32 2962.9 GFLOPS (128 runs) ggml_mul_mat: 2048 x 2048: F16 3035.6 GFLOPS (128 runs) / F32 3217.5 GFLOPS (128 runs) ggml_mul_mat: 4096 x 4096: F16 3431.7 GFLOPS ( 25 runs) / F32 3510.6 GFLOPS ( 26 runs)

Running benchmark for all models This can take a while!

CPU	OS	Config	Model	Th	Load	Enc.	Commit
M1 Ultra	13.2	NEON BLAS	tiny	4	71	139	2bee265
M1 Ultra	13.2	NEON BLAS	base	4	95	266	2bee265
M1 Ultra	13.2	NEON BLAS	small	4	222	806	2bee265
M1 Ultra	13.2	NEON BLAS	medium	4	598	2175	2bee265
M1 Ultra	13.2	NEON BLAS	large	4	1165	3895	2bee265

fitzsim commented 1 year ago

Here are new results for POWER9, now that #300 is closed.

Running memcpy benchmark with 1 thread

memcpy: 6.32 GB/s
sum:    error 136902082731.000000

Running ggml_mul_mat benchmark with 32 threads

ggml_mul_mat:    64 x    64: F16      0.4 GFLOPS (128 runs) / F32      0.4 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16      2.8 GFLOPS (128 runs) / F32      2.8 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16     13.4 GFLOPS (128 runs) / F32     23.0 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16     32.9 GFLOPS (123 runs) / F32     87.9 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16     47.9 GFLOPS ( 23 runs) / F32    127.4 GFLOPS ( 60 runs)
ggml_mul_mat:  2048 x  2048: F16     58.5 GFLOPS (  4 runs) / F32     67.3 GFLOPS (  4 runs)
ggml_mul_mat:  4096 x  4096: F16     23.8 GFLOPS (  3 runs) / F32     21.2 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

CPU	OS	Model	Th	Load	Enc.	Commit	Compiler
POWER9	Debian 11	tiny	32	75	1283	3b010f9	GCC 10.2.1
POWER9	Debian 11	base	32	96	2786	3b010f9	GCC 10.2.1
POWER9	Debian 11	small	32	182	8534	3b010f9	GCC 10.2.1
POWER9	Debian 11	medium	32	463	22282	3b010f9	GCC 10.2.1
POWER9	Debian 11	large	32	838	41106	3b010f9	GCC 10.2.1

FlippFuzz commented 1 year ago

I got referred here from https://github.com/openai/whisper/discussions/978#discussioncomment-5093839 This seems really interesting.

I'm running on Oracle Cloud's free tier, which contains 4x Ampere A1 CPUs and 24G RAM.

Compiler:

I CC:       cc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
I CXX:      g++ (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0

Default (no changes)

~/whisper.cpp$ extra/bench-all.sh
Usage: ./bench.sh [n_threads]

Running memcpy benchmark with 1 thread

memcpy: 10.92 GB/s
sum:    error 136902082731.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat:    64 x    64: F16      1.0 GFLOPS (128 runs) / F32      0.7 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     16.8 GFLOPS (128 runs) / F32     13.2 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16     18.5 GFLOPS (128 runs) / F32     41.8 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16     21.5 GFLOPS ( 81 runs) / F32     35.4 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16     23.2 GFLOPS ( 11 runs) / F32     41.4 GFLOPS ( 20 runs)
ggml_mul_mat:  2048 x  2048: F16     23.4 GFLOPS (  3 runs) / F32     32.6 GFLOPS (  3 runs)
ggml_mul_mat:  4096 x  4096: F16     22.5 GFLOPS (  3 runs) / F32     21.4 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Ampere A1	Ubuntu 22.04	NEON	tiny	4	83	1832	ca21f7a
Ampere A1	Ubuntu 22.04	NEON	base	4	120	4767	ca21f7a
Ampere A1	Ubuntu 22.04	NEON	small	4	273	17529	ca21f7a
Ampere A1	Ubuntu 22.04	NEON	medium	4	739	59794	ca21f7a
Ampere A1	Ubuntu 22.04	NEON	large	4	1436	115771	ca21f7a

With changes mentioned in https://github.com/openai/whisper/discussions/978#discussioncomment-5093839 Thanks again @jan-grzybek-ampere

~/whisper.cpp$ extra/bench-all.sh

Running memcpy benchmark with 1 thread

memcpy: 10.88 GB/s
sum:    error 136902082731.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat:    64 x    64: F16      2.0 GFLOPS (128 runs) / F32      1.7 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     14.3 GFLOPS (128 runs) / F32     33.6 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16     40.7 GFLOPS (128 runs) / F32     54.3 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16     97.5 GFLOPS (128 runs) / F32     31.4 GFLOPS (117 runs)
ggml_mul_mat:  1024 x  1024: F16     87.1 GFLOPS ( 41 runs) / F32     41.0 GFLOPS ( 20 runs)
ggml_mul_mat:  2048 x  2048: F16     74.3 GFLOPS (  5 runs) / F32     33.4 GFLOPS (  3 runs)
ggml_mul_mat:  4096 x  4096: F16     50.4 GFLOPS (  3 runs) / F32     21.5 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Ampere A1	Ubuntu 22.04	NEON	tiny	4	84	619	ca21f7a
Ampere A1	Ubuntu 22.04	NEON	base	4	124	2036	ca21f7a
Ampere A1	Ubuntu 22.04	NEON	small	4	293	5872	ca21f7a
Ampere A1	Ubuntu 22.04	NEON	medium	4	817	22064	ca21f7a
Ampere A1	Ubuntu 22.04	NEON	large	4	1446	37996	ca21f7a

FlippFuzz commented 1 year ago

Done a bit of reading and done several more tests.

According to https://community.arm.com/arm-community-blogs/b/tools-software-ides-blog/posts/compiler-flags-across-architectures-march-mtune-and-mcpu , the recommendation is to use -mcpu=native and I did indeed get the best performance with it. Will put in a pull request to use -mcpu=native for aarch64. No significant difference between GCC 11.3 and GCC 12.1 on Ubuntu 22.04.

-march=armv8.2-a+fp16, gcc-11.3

Performance seems slightly worse compared to yesterday's test in https://github.com/ggerganov/whisper.cpp/issues/89#issuecomment-1443688585 I re-ran all of the following tests one after another to hopefully obtain comparable figures. This is a free instance on Oracle Cloud and perhaps others are using the other cores on the CPU.

make clean
make main bench
./extra/bench-all.sh

Running memcpy benchmark with 1 thread

memcpy: 10.82 GB/s
sum:    error 136902082731.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat:    64 x    64: F16      1.8 GFLOPS (128 runs) / F32      2.0 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     40.7 GFLOPS (128 runs) / F32     12.7 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16     52.9 GFLOPS (128 runs) / F32     32.8 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16     97.3 GFLOPS (128 runs) / F32     32.1 GFLOPS (120 runs)
ggml_mul_mat:  1024 x  1024: F16     77.0 GFLOPS ( 36 runs) / F32     35.1 GFLOPS ( 17 runs)
ggml_mul_mat:  2048 x  2048: F16     64.0 GFLOPS (  4 runs) / F32     25.9 GFLOPS (  3 runs)
ggml_mul_mat:  4096 x  4096: F16     45.8 GFLOPS (  3 runs) / F32     21.0 GFLOPS (  3 runs)

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Ampere A1	Ubuntu 22.04	NEON	tiny	4	85	662	ca21f7a
Ampere A1	Ubuntu 22.04	NEON	base	4	121	2039	ca21f7a
Ampere A1	Ubuntu 22.04	NEON	small	4	281	6667	ca21f7a
Ampere A1	Ubuntu 22.04	NEON	medium	4	760	25355	ca21f7a
Ampere A1	Ubuntu 22.04	NEON	large	4	1456	45563	ca21f7a

-mcpu=native, gcc-11.3

make clean
make main bench
./extra/bench-all.sh

Running memcpy benchmark with 1 thread

memcpy: 10.85 GB/s
sum:    error 136902082731.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat:    64 x    64: F16      7.9 GFLOPS (128 runs) / F32      1.8 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16      7.5 GFLOPS (128 runs) / F32     12.6 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16     51.8 GFLOPS (128 runs) / F32     54.4 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16     96.3 GFLOPS (128 runs) / F32     31.2 GFLOPS (117 runs)
ggml_mul_mat:  1024 x  1024: F16     74.1 GFLOPS ( 35 runs) / F32     33.5 GFLOPS ( 16 runs)
ggml_mul_mat:  2048 x  2048: F16     67.1 GFLOPS (  4 runs) / F32     27.0 GFLOPS (  3 runs)
ggml_mul_mat:  4096 x  4096: F16     49.3 GFLOPS (  3 runs) / F32     21.7 GFLOPS (  3 runs)

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Ampere A1	Ubuntu 22.04	NEON	tiny	4	85	655	ca21f7a
Ampere A1	Ubuntu 22.04	NEON	base	4	121	2002	ca21f7a
Ampere A1	Ubuntu 22.04	NEON	small	4	283	6923	ca21f7a
Ampere A1	Ubuntu 22.04	NEON	medium	4	762	24085	ca21f7a
Ampere A1	Ubuntu 22.04	NEON	large	4	1459	43846	ca21f7a

-mcpu=native, gcc-12.1

make clean
make CC=gcc-12 CXX=g++-12 main bench
./extra/bench-all.sh

Running memcpy benchmark with 1 thread

memcpy: 11.01 GB/s
sum:    error 136902082731.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat:    64 x    64: F16      8.0 GFLOPS (128 runs) / F32      8.0 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     12.0 GFLOPS (128 runs) / F32     12.6 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16     55.7 GFLOPS (128 runs) / F32     41.8 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16     95.1 GFLOPS (128 runs) / F32     30.2 GFLOPS (113 runs)
ggml_mul_mat:  1024 x  1024: F16     67.1 GFLOPS ( 32 runs) / F32     33.0 GFLOPS ( 16 runs)
ggml_mul_mat:  2048 x  2048: F16     64.2 GFLOPS (  4 runs) / F32     26.8 GFLOPS (  3 runs)
ggml_mul_mat:  4096 x  4096: F16     46.1 GFLOPS (  3 runs) / F32     21.4 GFLOPS (  3 runs)

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Ampere A1	Ubuntu 22.04	NEON	tiny	4	84	613	ca21f7a
Ampere A1	Ubuntu 22.04	NEON	base	4	122	2086	ca21f7a
Ampere A1	Ubuntu 22.04	NEON	small	4	286	6375	ca21f7a
Ampere A1	Ubuntu 22.04	NEON	medium	4	761	24667	ca21f7a
Ampere A1	Ubuntu 22.04	NEON	large	4	1457	43826	ca21f7a

jaybinks commented 1 year ago

I confirmed your findings, and interestingly enough, I found the performance worse with OpenBLAS.

On Sat, 25 Feb 2023 at 12:06, FlippFuzz @.***> wrote:

Done a bit of reading and done several more tests.

According to https://community.arm.com/arm-community-blogs/b/tools-software-ides-blog/posts/compiler-flags-across-architectures-march-mtune-and-mcpu , the recommendation is to use -mcpu=native and I did indeed get the best performance with it. Will put in a pull request to use -mcpu=native for aarch64. No significant difference between GCC 11.3 and GCC 12.1 on Ubuntu 22.04.

-march=armv8.2-a+fp16, gcc-11.3

Performance seems slightly worse compared to yesterday's test in #89 (comment) https://github.com/ggerganov/whisper.cpp/issues/89#issuecomment-1443688585 I re-ran all of the following tests one after another to hopefully obtain comparable figures. This is a free instance on Oracle Cloud and perhaps others are using the other cores on the CPU.

make clean make main bench ./extra/bench-all.sh

Running memcpy benchmark with 1 thread

memcpy: 10.82 GB/s sum: error 136902082731.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat: 64 x 64: F16 1.8 GFLOPS (128 runs) / F32 2.0 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 40.7 GFLOPS (128 runs) / F32 12.7 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 52.9 GFLOPS (128 runs) / F32 32.8 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 97.3 GFLOPS (128 runs) / F32 32.1 GFLOPS (120 runs) ggml_mul_mat: 1024 x 1024: F16 77.0 GFLOPS ( 36 runs) / F32 35.1 GFLOPS ( 17 runs) ggml_mul_mat: 2048 x 2048: F16 64.0 GFLOPS ( 4 runs) / F32 25.9 GFLOPS ( 3 runs) ggml_mul_mat: 4096 x 4096: F16 45.8 GFLOPS ( 3 runs) / F32 21.0 GFLOPS ( 3 runs)

CPU OS Config Model Th Load Enc. Commit Ampere A1 Ubuntu 22.04 NEON tiny 4 85 662 ca21f7a https://github.com/ggerganov/whisper.cpp/commit/ca21f7ab16694384fb74b1ba4f68b39f16540d23 Ampere A1 Ubuntu 22.04 NEON base 4 121 2039 ca21f7a https://github.com/ggerganov/whisper.cpp/commit/ca21f7ab16694384fb74b1ba4f68b39f16540d23 Ampere A1 Ubuntu 22.04 NEON small 4 281 6667 ca21f7a https://github.com/ggerganov/whisper.cpp/commit/ca21f7ab16694384fb74b1ba4f68b39f16540d23 Ampere A1 Ubuntu 22.04 NEON medium 4 760 25355 ca21f7a https://github.com/ggerganov/whisper.cpp/commit/ca21f7ab16694384fb74b1ba4f68b39f16540d23 Ampere A1 Ubuntu 22.04 NEON large 4 1456 45563 ca21f7a https://github.com/ggerganov/whisper.cpp/commit/ca21f7ab16694384fb74b1ba4f68b39f16540d23

-mcpu=native, gcc-11.3

make clean make main bench ./extra/bench-all.sh

Running memcpy benchmark with 1 thread

memcpy: 10.85 GB/s sum: error 136902082731.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat: 64 x 64: F16 7.9 GFLOPS (128 runs) / F32 1.8 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 7.5 GFLOPS (128 runs) / F32 12.6 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 51.8 GFLOPS (128 runs) / F32 54.4 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 96.3 GFLOPS (128 runs) / F32 31.2 GFLOPS (117 runs) ggml_mul_mat: 1024 x 1024: F16 74.1 GFLOPS ( 35 runs) / F32 33.5 GFLOPS ( 16 runs) ggml_mul_mat: 2048 x 2048: F16 67.1 GFLOPS ( 4 runs) / F32 27.0 GFLOPS ( 3 runs) ggml_mul_mat: 4096 x 4096: F16 49.3 GFLOPS ( 3 runs) / F32 21.7 GFLOPS ( 3 runs)

CPU OS Config Model Th Load Enc. Commit Ampere A1 Ubuntu 22.04 NEON tiny 4 85 655 ca21f7a https://github.com/ggerganov/whisper.cpp/commit/ca21f7ab16694384fb74b1ba4f68b39f16540d23 Ampere A1 Ubuntu 22.04 NEON base 4 121 2002 ca21f7a https://github.com/ggerganov/whisper.cpp/commit/ca21f7ab16694384fb74b1ba4f68b39f16540d23 Ampere A1 Ubuntu 22.04 NEON small 4 283 6923 ca21f7a https://github.com/ggerganov/whisper.cpp/commit/ca21f7ab16694384fb74b1ba4f68b39f16540d23 Ampere A1 Ubuntu 22.04 NEON medium 4 762 24085 ca21f7a https://github.com/ggerganov/whisper.cpp/commit/ca21f7ab16694384fb74b1ba4f68b39f16540d23 Ampere A1 Ubuntu 22.04 NEON large 4 1459 43846 ca21f7a https://github.com/ggerganov/whisper.cpp/commit/ca21f7ab16694384fb74b1ba4f68b39f16540d23

-mcpu=native, gcc-12.1

make clean make CC=gcc-12 CXX=g++-12 main bench ./extra/bench-all.sh

Running memcpy benchmark with 1 thread

memcpy: 11.01 GB/s sum: error 136902082731.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat: 64 x 64: F16 8.0 GFLOPS (128 runs) / F32 8.0 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 12.0 GFLOPS (128 runs) / F32 12.6 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 55.7 GFLOPS (128 runs) / F32 41.8 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 95.1 GFLOPS (128 runs) / F32 30.2 GFLOPS (113 runs) ggml_mul_mat: 1024 x 1024: F16 67.1 GFLOPS ( 32 runs) / F32 33.0 GFLOPS ( 16 runs) ggml_mul_mat: 2048 x 2048: F16 64.2 GFLOPS ( 4 runs) / F32 26.8 GFLOPS ( 3 runs) ggml_mul_mat: 4096 x 4096: F16 46.1 GFLOPS ( 3 runs) / F32 21.4 GFLOPS ( 3 runs)

CPU OS Config Model Th Load Enc. Commit Ampere A1 Ubuntu 22.04 NEON tiny 4 84 613 ca21f7a https://github.com/ggerganov/whisper.cpp/commit/ca21f7ab16694384fb74b1ba4f68b39f16540d23 Ampere A1 Ubuntu 22.04 NEON base 4 122 2086 ca21f7a https://github.com/ggerganov/whisper.cpp/commit/ca21f7ab16694384fb74b1ba4f68b39f16540d23 Ampere A1 Ubuntu 22.04 NEON small 4 286 6375 ca21f7a https://github.com/ggerganov/whisper.cpp/commit/ca21f7ab16694384fb74b1ba4f68b39f16540d23 Ampere A1 Ubuntu 22.04 NEON medium 4 761 24667 ca21f7a https://github.com/ggerganov/whisper.cpp/commit/ca21f7ab16694384fb74b1ba4f68b39f16540d23 Ampere A1 Ubuntu 22.04 NEON large 4 1457 43826 ca21f7a https://github.com/ggerganov/whisper.cpp/commit/ca21f7ab16694384fb74b1ba4f68b39f16540d23

— Reply to this email directly, view it on GitHub https://github.com/ggerganov/whisper.cpp/issues/89#issuecomment-1444918927, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALQR6226FF7KCLW6VNZK6TWZFSIZANCNFSM6AAAAAAROFTFJE . You are receiving this because you commented.Message ID: @.***>

-- Sincerely

Jay

NathanSweet commented 1 year ago

CPU model: AMD Ryzen 9 7950X
Operating system: Windows 10 Pro N 22H2
Compiler: Windows x64 release v1.2.1

whisper-bin-x64

>bench.exe
whisper_init_from_file: loading model from 'models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem required  =  215.00 MB (+    6.00 MB per decoder)
whisper_model_load: kv self size  =    5.25 MB
whisper_model_load: kv cross size =   17.58 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =  140.60 MB
whisper_model_load: model size    =  140.54 MB

system_info: n_threads = 4 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 |

whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:     load time =   109.45 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   encode time =   919.30 ms /     1 runs (  919.30 ms per run)
whisper_print_timings:   decode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =  1032.75 ms

>bench -w 1 -t 1
memcpy: 24.58 GB/s
sum:    error -536870819.000000

>bench -w 2 -t 1
ggml_mul_mat:    64 x    64: F16     22.7 GFLOPS (128 runs) / F32     38.7 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     34.6 GFLOPS (128 runs) / F32     45.6 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16     44.2 GFLOPS (128 runs) / F32     54.5 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16     50.5 GFLOPS (128 runs) / F32     55.3 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16     53.2 GFLOPS ( 25 runs) / F32     65.7 GFLOPS ( 31 runs)
ggml_mul_mat:  2048 x  2048: F16     54.9 GFLOPS (  4 runs) / F32     61.8 GFLOPS (  4 runs)
ggml_mul_mat:  4096 x  4096: F16     50.7 GFLOPS (  3 runs) / F32     19.9 GFLOPS (  3 runs)

That last one is less than the 5950X above, weird. Oh, OpenBLAS below:

whisper-blas-bin-x64

>bench
whisper_init_from_file: loading model from 'models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem required  =  215.00 MB (+    6.00 MB per decoder)
whisper_model_load: kv self size  =    5.25 MB
whisper_model_load: kv cross size =   17.58 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =  140.60 MB
whisper_model_load: model size    =  140.54 MB

system_info: n_threads = 4 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |

whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:     load time =   101.76 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   encode time =   602.63 ms /     1 runs (  602.63 ms per run)
whisper_print_timings:   decode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =   705.80 ms

>bench -w 1 -t 1
memcpy: 24.30 GB/s
sum:    error -536870819.000000

>bench -w 2 -t 1
ggml_mul_mat:    64 x    64: F16     89.4 GFLOPS (128 runs) / F32    119.6 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     27.6 GFLOPS (128 runs) / F32     31.0 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16    172.9 GFLOPS (128 runs) / F32    222.0 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16    596.8 GFLOPS (128 runs) / F32    926.4 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16   1257.0 GFLOPS (128 runs) / F32   1887.7 GFLOPS (128 runs)
ggml_mul_mat:  2048 x  2048: F16   1726.5 GFLOPS (101 runs) / F32   2193.9 GFLOPS (128 runs)
ggml_mul_mat:  4096 x  4096: F16   2109.8 GFLOPS ( 16 runs) / F32   2237.5 GFLOPS ( 17 runs)

tim-gromeyer commented 1 year ago

memcpy: 7.20 GB/s sum: error -536870997.000000

CPU	OS	Config	Model	Th	Load	Enc.	Commit
AMD Ryzen 3 3200U	Linux Mint 21.1	AVX2	tiny	4	109	3417	09e9068
AMD Ryzen 3 3200U	Linux Mint 21.1	AVX2	base	4	180	7907	09e9068
AMD Ryzen 3 3200U	Linux Mint 21.1	AVX2	small	4	419	30899	09e9068
AMD Ryzen 3 3200U	Linux Mint 21.1	AVX2	medium	4	1851	106542	09e9068
AMD Ryzen 3 3200U	Linux Mint 21.1	AVX2	large	4	4715	203455	09e9068

Karl-Han commented 1 year ago

memcpy: 15.57 GB/s

Running ggml_mul_mat benchmark with 8 threads

ggml_mul_mat: 64 x 64: F16 6.1 GFLOPS (128 runs) / F32 6.2 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 40.1 GFLOPS (128 runs) / F32 38.7 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 147.9 GFLOPS (128 runs) / F32 110.1 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 264.9 GFLOPS (128 runs) / F32 134.4 GFLOPS (128 runs) ggml_mul_mat: 1024 x 1024: F16 289.5 GFLOPS (128 runs) / F32 151.9 GFLOPS ( 71 runs) ggml_mul_mat: 2048 x 2048: F16 290.6 GFLOPS ( 17 runs) / F32 70.7 GFLOPS ( 5 runs) ggml_mul_mat: 4096 x 4096: F16 114.0 GFLOPS ( 3 runs) / F32 62.7 GFLOPS ( 3 runs)

CPU	OS	Config	Model	Th	Load	Enc.	Commit
AMD Ryzen 7 5800HS	Linux RHEL8.7	AVX2	tiny	8	50	361	09e9068
AMD Ryzen 7 5800HS	Linux RHEL8.7	AVX2	base	8	70	1000	09e9068
AMD Ryzen 7 5800HS	Linux RHEL8.7	AVX2	small	8	185	2264	09e9068
AMD Ryzen 7 5800HS	Linux RHEL8.7	AVX2	medium	8	587	8421	09e9068
AMD Ryzen 7 5800HS	Linux RHEL8.7	AVX2	large	8	2296	15759	09e9068

Running ggml_mul_mat benchmark with 16 threads

ggml_mul_mat: 64 x 64: F16 2.1 GFLOPS (128 runs) / F32 1.9 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 19.6 GFLOPS (128 runs) / F32 14.8 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 68.1 GFLOPS (128 runs) / F32 84.5 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 200.5 GFLOPS (128 runs) / F32 141.4 GFLOPS (128 runs) ggml_mul_mat: 1024 x 1024: F16 271.0 GFLOPS (127 runs) / F32 163.7 GFLOPS ( 77 runs) ggml_mul_mat: 2048 x 2048: F16 205.5 GFLOPS ( 12 runs) / F32 71.6 GFLOPS ( 5 runs) ggml_mul_mat: 4096 x 4096: F16 142.3 GFLOPS ( 3 runs) / F32 63.0 GFLOPS ( 3 runs)

CPU	OS	Config	Model	Th	Load	Enc.	Commit
AMD Ryzen 7 5800HS	Linux RHEL8.7	AVX2	tiny	16	52	329	09e9068
AMD Ryzen 7 5800HS	Linux RHEL8.7	AVX2	base	16	72	723	09e9068
AMD Ryzen 7 5800HS	Linux RHEL8.7	AVX2	small	16	188	2214	09e9068
AMD Ryzen 7 5800HS	Linux RHEL8.7	AVX2	medium	16	698	10889	09e9068
AMD Ryzen 7 5800HS	Linux RHEL8.7	AVX2	large	16	1619	16640	09e9068

owengaspard commented 1 year ago

MacBook Pro 14" with M2 Pro

10 Cores, 16GB RAM
macOS Ventura 13.2
Benchmarks running at 8 threads

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Apple M2 Pro	macOS 13.2	NEON BLAS	tiny	8	76	161	09e9068
Apple M2 Pro	macOS 13.2	NEON BLAS	base	8	104	318	09e9068
Apple M2 Pro	macOS 13.2	NEON BLAS	small	8	221	975	09e9068
Apple M2 Pro	macOS 13.2	NEON BLAS	medium	8	969	2692	09e9068
Apple M2 Pro	macOS 13.2	NEON BLAS	large	8	1939	4959	09e9068

oceancloud82 commented 1 year ago

NVIDIA Jetson Nano, without GPU optimization: base-en

 ./bin/main -f samples/jfk.wav 
whisper_init_from_file_no_state: loading model from 'models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem required  =  215.00 MB (+    6.00 MB per decoder)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =  140.60 MB
whisper_model_load: model size    =  140.54 MB
whisper_init_state: kv self size  =    5.25 MB
whisper_init_state: kv cross size =   17.58 MB

system_info: n_threads = 4 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | 

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...

[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.

whisper_print_timings:     load time =   354.49 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   712.86 ms
whisper_print_timings:   sample time =    79.37 ms /    27 runs (    2.94 ms per run)
whisper_print_timings:   encode time = 24406.28 ms /     1 runs (24406.28 ms per run)
whisper_print_timings:   decode time =  1284.84 ms /    27 runs (   47.59 ms per run)
whisper_print_timings:    total time = 26908.31 ms

tiny-en

./bin/main -m ./models/ggml-tiny.en.bin  -f ./samples/jfk.wav 
whisper_init_from_file_no_state: loading model from './models/ggml-tiny.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 1
whisper_model_load: mem required  =  127.00 MB (+    3.00 MB per decoder)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =   73.58 MB
whisper_model_load: model size    =   73.54 MB
whisper_init_state: kv self size  =    2.62 MB
whisper_init_state: kv cross size =    8.79 MB

system_info: n_threads = 4 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | 

main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...

[00:00:00.000 --> 00:00:07.740]   And so my fellow Americans ask not what your country can do for you
[00:00:07.740 --> 00:00:10.740]   ask what you can do for your country

whisper_print_timings:     load time =   204.60 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   564.90 ms
whisper_print_timings:   sample time =    72.13 ms /    26 runs (    2.77 ms per run)
whisper_print_timings:   encode time =  9232.34 ms /     1 runs ( 9232.34 ms per run)
whisper_print_timings:   decode time =   616.00 ms /    26 runs (   23.69 ms per run)
whisper_print_timings:    total time = 10745.65 ms

gkovacsp commented 1 year ago

MacBook Pro 14" with M2 Pro 10 Cores, 32GB RAM macOS Ventura 13.2 Benchmarks running at 8 threads memcpy: 40.68 GB/s

| CPU          | OS     | Config     | Model    | Th | Load | Enc. | Commit  |
| ------------ | ------ | ---------- | -------- | -- | ---- | ---- | ------- |
| Apple M1 Pro | 13.2.1 |  NEON BLAS | tiny     | 8  | 45   | 93   | 09e9068 |
| Apple M1 Pro | 13.2.1 |  NEON BLAS | base     | 8  | 68   | 187  | 09e9068 |
| Apple M1 Pro | 13.2.1 |  NEON BLAS | small    | 8  | 179  | 702  | 09e9068 |
| Apple M1 Pro | 13.2.1 |  NEON BLAS | medium   | 8  | 496  | 2227 | 09e9068 |
| Apple M1 Pro | 13.2.1 |  NEON BLAS | large    | 8  | 1037 | 3796 | 09e9068 |

Running ggml_mul_mat benchmark with 8 threads

ggml_mul_mat:    64 x    64: F16      4.6 GFLOPS (128 runs) / F32      4.1 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     46.6 GFLOPS (128 runs) / F32     36.4 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16    294.2 GFLOPS (128 runs) / F32    238.8 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16    611.0 GFLOPS (128 runs) / F32    712.5 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16    770.9 GFLOPS (128 runs) / F32    700.3 GFLOPS (128 runs)
ggml_mul_mat:  2048 x  2048: F16    902.8 GFLOPS ( 53 runs) / F32    906.9 GFLOPS ( 53 runs)
ggml_mul_mat:  4096 x  4096: F16   1521.2 GFLOPS ( 12 runs) / F32   1469.3 GFLOPS ( 11 runs)

clarsen commented 1 year ago

MacBook Pro 16" with M2 Max 12 Cores, 96GB RAM macOS Ventura 13.3 Benchmarks running at 4 threads (4 threads were faster than 8 threads for ggml_mul_mat but about same for model load/encode) memcpy: 49.94 GB/s sum: ok -536870910.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat:    64 x    64: F16     11.2 GFLOPS (128 runs) / F32      9.3 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     83.0 GFLOPS (128 runs) / F32     73.7 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16    505.2 GFLOPS (128 runs) / F32    488.2 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16   1018.0 GFLOPS (128 runs) / F32   1196.3 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16   1796.2 GFLOPS (128 runs) / F32   2087.4 GFLOPS (128 runs)
ggml_mul_mat:  2048 x  2048: F16   1638.8 GFLOPS ( 96 runs) / F32   1673.7 GFLOPS ( 98 runs)
ggml_mul_mat:  4096 x  4096: F16   1995.2 GFLOPS ( 15 runs) / F32   2037.8 GFLOPS ( 15 runs)

Running benchmark for all models This can take a while!

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Apple M2 Max	13.3	NEON BLAS	tiny	4	41	118	0a2d121
Apple M2 Max	13.3	NEON BLAS	base	4	61	230	0a2d121
Apple M2 Max	13.3	NEON BLAS	small	4	153	734	0a2d121
Apple M2 Max	13.3	NEON BLAS	medium	4	448	1979	0a2d121
Apple M2 Max	13.3	NEON BLAS	large	4	882	3553	0a2d121

patsevanton commented 1 year ago

Running memcpy benchmark with 1 thread

memcpy: 7.03 GB/s sum: error -536870997.000000 - how fix ??

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat:    64 x    64: F16      8.9 GFLOPS (128 runs) / F32     10.0 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     53.3 GFLOPS (128 runs) / F32     47.9 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16     91.7 GFLOPS (128 runs) / F32     99.4 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16    134.2 GFLOPS (128 runs) / F32     94.8 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16    182.9 GFLOPS ( 86 runs) / F32    121.2 GFLOPS ( 57 runs)
ggml_mul_mat:  2048 x  2048: F16    180.0 GFLOPS ( 11 runs) / F32     42.4 GFLOPS (  3 runs)
ggml_mul_mat:  4096 x  4096: F16     59.1 GFLOPS (  3 runs) / F32     31.5 GFLOPS (  3 runs)

Running benchmark for all models This can take a while!

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Ryzen 7 PRO 5850U	Ubuntu 22.04.2	AVX2	tiny	4	69	495	0a2d121
Ryzen 7 PRO 5850U	Ubuntu 22.04.2	AVX2	base	4	111	1128	0a2d121
Ryzen 7 PRO 5850U	Ubuntu 22.04.2	AVX2	small	4	264	3992	0a2d121
Ryzen 7 PRO 5850U	Ubuntu 22.04.2	AVX2	medium	4	806	12230	0a2d121
Ryzen 7 PRO 5850U	Ubuntu 22.04.2	AVX2	large	4	1919	25574	0a2d121

patsevanton commented 1 year ago

memcpy: 9.49 GB/s sum: error -536870997.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat: 64 x 64: F16 8.8 GFLOPS (128 runs) / F32 10.0 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 35.4 GFLOPS (128 runs) / F32 49.2 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 61.9 GFLOPS (128 runs) / F32 95.1 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 64.3 GFLOPS (128 runs) / F32 86.5 GFLOPS (128 runs) ggml_mul_mat: 1024 x 1024: F16 74.4 GFLOPS ( 35 runs) / F32 39.9 GFLOPS ( 19 runs) ggml_mul_mat: 2048 x 2048: F16 56.9 GFLOPS ( 4 runs) / F32 31.1 GFLOPS ( 3 runs) ggml_mul_mat: 4096 x 4096: F16 56.9 GFLOPS ( 3 runs) / F32 30.1 GFLOPS ( 3 runs)

Running benchmark for all models This can take a while!

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Ryzen 5 5500U	Ubuntu 22.04.2	AVX2	tiny	4	67	761	0a2d121
Ryzen 5 5500U	Ubuntu 22.04.2	AVX2	base	4	96	2040	0a2d121
Ryzen 5 5500U	Ubuntu 22.04.2	AVX2	small	4	239	7639	0a2d121
Ryzen 5 5500U	Ubuntu 22.04.2	AVX2	medium	4	657	23735	0a2d121
Ryzen 5 5500U	Ubuntu 22.04.2	AVX2	large	4	1302	45006	0a2d121

espegro commented 1 year ago

HP Z440, Xeon E5-2690v4, 64Gb, Rocky Linux 9.1

memcpy: 10.94 GB/s sum: error -536870997.000000

./bench -w 2 ggml_mul_mat: 64 x 64: F16 4.8 GFLOPS (128 runs) / F32 4.8 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 23.1 GFLOPS (128 runs) / F32 18.7 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 52.5 GFLOPS (128 runs) / F32 35.1 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 69.6 GFLOPS (128 runs) / F32 44.4 GFLOPS (128 runs) ggml_mul_mat: 1024 x 1024: F16 78.8 GFLOPS ( 37 runs) / F32 49.2 GFLOPS ( 23 runs) ggml_mul_mat: 2048 x 2048: F16 83.6 GFLOPS ( 5 runs) / F32 50.8 GFLOPS ( 3 runs) ggml_mul_mat: 4096 x 4096: F16 64.5 GFLOPS ( 3 runs) / F32 21.8 GFLOPS ( 3 runs)

system_info: n_threads = 28 / 28 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

whisper_print_timings: load time = 1031.43 ms whisper_print_timings: fallbacks = 0 p / 0 h whisper_print_timings: mel time = 0.00 ms whisper_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run) whisper_print_timings: encode time = 13121.63 ms / 1 runs (13121.63 ms per run) whisper_print_timings: decode time = 0.00 ms / 1 runs ( 0.00 ms per run) whisper_print_timings: total time = 14219.33 ms

model: large

montagao commented 1 year ago

very impressed

CPU	OS	Config	Model	Th	Load	Enc.	Commit
MacBook M1 Max	macOS 13.0 beta (22A5321d)	NEON BLAS	medium	8	488	2344	0a2d121
MacBook M1 Max	macOS 13.0 beta (22A5321d)	NEON BLAS	large	8	1070	3209	0a2d121

jon-chuang commented 1 year ago

What am I doing wrong? 17.6 GFlops on a Ryzen 6850H

WHISPER_OPENBLAS=1 make -j bench && ./bench -w 2 -t 1
I whisper.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -mavx -mavx2 -mfma -mf16c -msse3 -DGGML_USE_OPENBLAS -I/usr/local/include/openblas
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread
I LDFLAGS:  -lopenblas
I CC:       cc (Ubuntu 9.5.0-1ubuntu1~22.04) 9.5.0
I CXX:      g++ (Ubuntu 9.5.0-1ubuntu1~22.04) 9.5.0

make: 'bench' is up to date.
ggml_mul_mat:    64 x    64: F16     12.6 GFLOPS (128 runs) / F32      9.8 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     19.4 GFLOPS (128 runs) / F32     12.5 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16     27.0 GFLOPS (128 runs) / F32     18.4 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16     50.3 GFLOPS (128 runs) / F32     28.1 GFLOPS (105 runs)
ggml_mul_mat:  1024 x  1024: F16     59.0 GFLOPS ( 28 runs) / F32     27.0 GFLOPS ( 13 runs)
ggml_mul_mat:  2048 x  2048: F16     43.0 GFLOPS (  3 runs) / F32     11.4 GFLOPS (  3 runs)
ggml_mul_mat:  4096 x  4096: F16     17.6 GFLOPS (  3 runs) / F32      6.6 GFLOPS (  3 runs)

flexchar commented 1 year ago

MacBook Pro M2 Max 96 GB 16-inch, 2023 13.3.1 (22E261)

I tried running 8 and 12 threads. They were a few ms slower than 4 threads. So the default 4threads is the key it seems. I also have not compiled anything apple specific. Just git clone and make.

> ./extra/bench-all.sh 8 Usage: ./bench.sh [n_threads]

Running memcpy benchmark with 1 thread

memcpy: 50.22 GB/s sum: ok -536870910.000000

Running ggml_mul_mat benchmark with 8 threads

ggml_mul_mat: 64 x 64: F16 5.0 GFLOPS (128 runs) / F32 4.7 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 46.1 GFLOPS (128 runs) / F32 38.3 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 294.0 GFLOPS (128 runs) / F32 243.7 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 574.5 GFLOPS (128 runs) / F32 272.9 GFLOPS (128 runs) ggml_mul_mat: 1024 x 1024: F16 736.6 GFLOPS (128 runs) / F32 750.8 GFLOPS (128 runs) ggml_mul_mat: 2048 x 2048: F16 973.7 GFLOPS ( 57 runs) / F32 993.7 GFLOPS ( 58 runs) ggml_mul_mat: 4096 x 4096: F16 1554.5 GFLOPS ( 12 runs) / F32 1553.6 GFLOPS ( 12 runs)

Running benchmark for all models This can take a while!

Config	Model	Th	Load	Enc.	Commit
NEON BLAS	tiny	8	40	101	c23588c
NEON BLAS	base	8	61	223	c23588c
NEON BLAS	small	8	154	961	c23588c
NEON BLAS	medium	8	436	2534	c23588c
NEON BLAS	large	8	867	4100	c23588c

flexchar commented 1 year ago

Same hardware as in the post before. I've just tried converting to CoreML models and here are the results. The personal impression of running STT seemed very good - much faster.

./extra/bench-all.sh 4 Usage: ./bench.sh [n_threads]

Running memcpy benchmark with 1 thread

memcpy: 49.33 GB/s sum: ok -536870910.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat: 64 x 64: F16 9.1 GFLOPS (128 runs) / F32 8.2 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 70.7 GFLOPS (128 runs) / F32 77.0 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 350.7 GFLOPS (128 runs) / F32 435.9 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 1060.0 GFLOPS (128 runs) / F32 1254.3 GFLOPS (128 runs) ggml_mul_mat: 1024 x 1024: F16 1611.0 GFLOPS (128 runs) / F32 1652.4 GFLOPS (128 runs) ggml_mul_mat: 2048 x 2048: F16 1887.2 GFLOPS (110 runs) / F32 1900.9 GFLOPS (111 runs) ggml_mul_mat: 4096 x 4096: F16 1806.0 GFLOPS ( 14 runs) / F32 1849.3 GFLOPS ( 14 runs)

Running benchmark for all models This can take a while!

Config	Model	Th	Load	Enc.	Commit
NEON BLAS COREML	tiny	4	42	30	c23588c
NEON BLAS COREML	base	4	60	49	c23588c
NEON BLAS COREML	small	4	151	169	c23588c
NEON BLAS COREML	medium	4	430	737	c23588c
NEON BLAS COREML	large	4	885	1672	c23588c

StuartIanNaylor commented 1 year ago

Dell 3050 Micro Running memcpy benchmark with 1 thread memcpy: 11.49 GB/s sum: error -536870997.000000

Running ggml_mul_mat benchmark with 4 threads ggml_mul_mat: 64 x 64: F16 7.7 GFLOPS (128 runs) / F32 3.3 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 27.7 GFLOPS (128 runs) / F32 7.5 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 50.8 GFLOPS (128 runs) / F32 8.8 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 59.4 GFLOPS (128 runs) / F32 9.0 GFLOPS ( 34 runs) ggml_mul_mat: 1024 x 1024: F16 51.5 GFLOPS ( 24 runs) / F32 8.4 GFLOPS ( 4 runs) ggml_mul_mat: 2048 x 2048: F16 46.3 GFLOPS ( 3 runs) / F32 8.1 GFLOPS ( 3 runs) ggml_mul_mat: 4096 x 4096: F16 47.3 GFLOPS ( 3 runs) / F32 8.1 GFLOPS ( 3 runs)

CPU	OS	Config	Model	Th	Load	Enc.	Commit
i3-7100t	Ubuntu 22.04	AVX2	tiny	4	84	1125	c23588c
i3-7100t	Ubuntu 22.04	AVX2	base	4	128	2616	c23588c
i3-7100t	Ubuntu 22.04	AVX2	small	4	339	10127	c23588c
i3-7100t	Ubuntu 22.04	AVX2	medium	4	991	39383	c23588c
i3-7100t	Ubuntu 22.04	AVX2	large	4	2922	74488	c23588c

j1nx commented 1 year ago

Lenovo thinkcentre m720q

Running memcpy benchmark with 1 thread

memcpy: 6.54 GB/s sum: error -536870997.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat: 64 x 64: F16 8.6 GFLOPS (128 runs) / F32 4.5 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 38.8 GFLOPS (128 runs) / F32 7.9 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 76.2 GFLOPS (128 runs) / F32 9.6 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 87.4 GFLOPS (128 runs) / F32 10.0 GFLOPS ( 38 runs) ggml_mul_mat: 1024 x 1024: F16 89.7 GFLOPS ( 42 runs) / F32 10.1 GFLOPS ( 5 runs) ggml_mul_mat: 2048 x 2048: F16 67.7 GFLOPS ( 4 runs) / F32 9.1 GFLOPS ( 3 runs) ggml_mul_mat: 4096 x 4096: F16 54.7 GFLOPS ( 3 runs) / F32 8.6 GFLOPS ( 3 runs)

Running benchmark for all models This can take a while!

CPU	OS	Config	Model	Th	Load	Enc.	Commit
i5-8500T	OpenVoiceOS	AVX2	tiny.en	4	79	686	70567ef
i5-8500T	OpenVoiceOS	AVX2	base.en	4	121	1600	70567ef
i5-8500T	OpenVoiceOS	AVX2	small.en	4	320	6197	70567ef
i5-8500T	OpenVoiceOS	AVX2	medium.en	4	928	20276	70567ef

Running memcpy benchmark with 1 thread

memcpy: 7.16 GB/s sum: error -536870997.000000

Running ggml_mul_mat benchmark with 6 threads

ggml_mul_mat: 64 x 64: F16 1.9 GFLOPS (128 runs) / F32 1.8 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 29.7 GFLOPS (128 runs) / F32 7.3 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 65.5 GFLOPS (128 runs) / F32 14.5 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 123.4 GFLOPS (128 runs) / F32 15.2 GFLOPS ( 57 runs) ggml_mul_mat: 1024 x 1024: F16 127.5 GFLOPS ( 60 runs) / F32 14.7 GFLOPS ( 7 runs) ggml_mul_mat: 2048 x 2048: F16 93.3 GFLOPS ( 6 runs) / F32 13.3 GFLOPS ( 3 runs) ggml_mul_mat: 4096 x 4096: F16 70.0 GFLOPS ( 3 runs) / F32 12.5 GFLOPS ( 3 runs)

Running benchmark for all models This can take a while!

CPU	OS	Config	Model	Th	Load	Enc.	Commit
i5-8500T	OpenVoiceOS	AVX2	tiny.en	6	78	511	70567ef
i5-8500T	OpenVoiceOS	AVX2	base.en	6	118	1264	70567ef
i5-8500T	OpenVoiceOS	AVX2	small.en	6	320	4587	70567ef
i5-8500T	OpenVoiceOS	AVX2	medium.en	6	928	16303	70567ef

emcodem commented 1 year ago

Yet another M1 Ultra but look at the bottom, comparision to Const-Me GPU version: memcpy: 42.66 GB/s sum: ok -536870910.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat: 64 x 64: F16 9.1 GFLOPS (128 runs) / F32 7.1 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 68.2 GFLOPS (128 runs) / F32 68.5 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 465.0 GFLOPS (128 runs) / F32 386.2 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 1131.9 GFLOPS (128 runs) / F32 1437.0 GFLOPS (128 runs) ggml_mul_mat: 1024 x 1024: F16 2188.9 GFLOPS (128 runs) / F32 2519.6 GFLOPS (128 runs) ggml_mul_mat: 2048 x 2048: F16 2938.8 GFLOPS (128 runs) / F32 2996.5 GFLOPS (128 runs) ggml_mul_mat: 4096 x 4096: F16 3074.7 GFLOPS ( 23 runs) / F32 3167.2 GFLOPS ( 24 runs)

| CPU | OS | Config | Model | Th | Load | Enc. | Commit | | M1 Ultra | Ventura 13.3.1 | NEON BLAS | large | 4 | 858 | 3649 | 70567ef |

Much more interesting i find the comparison i did to a Win10 Core i9 9900K with Nvidia A4000 using the Const-Me Version. I used a 10 minute portion of a "real" tv show (-l de, about 56k tokens known in the model). Note that the power consumption has been measured too, it is not just guessing.

const-me whisper gpu (~450-550W real power consumption while 100% gpu utilisation, cpu is mostly bored) A4000 1x parallel 93s
A4000 2x parallel both finish at 180s A4000 4x parallel 3 finish after 317s, 1 finishes at 453s

MACOS, M1 Ultra (70-90W real power consumption while 100% "cpu" utilisation) whisper cpp - default settings, 1 core, 4 threads Macos 1x : 155 s Macos 2x parallel: 196 s - all finish at same time Macos 4x parallel: 274s - all finish at same time Macos 6x parallel: 462s - all finish at same time

Also some other tests with different commandline params, on the M1 only, with 1 file: -p8 (threads default 4) - system unresponsive while processing 120.3 seconds

-p 4 (default threads 4, ~80% cpu utilisation) 79.37545

-bs 2 -p 4 101.01730

-t 16 threads (processors default 1) 148.713

-p 8 -t 2 98.91152

We currently use the Const-me GPU version on Nvidia A5000 because on an intel cpu it delivers much faster results than this cpp version could do. Also it looks like Const-me version does not go anywhere while this repository is vibrant.

As a conclusion i can say that even if i hate it but we are buying this Mac because it delivers faster results, more throughput and all while consuming only 20% of power. Also, it has much better processing power distribution between mutliple parallel processes, i bet i can even use nice to give priorities while on the GPU there are no priorities whatsoever possible.

At our usage amount that means we saved the full costs of the mac (~4000 euros) after 2-3 years of operations (due to lower power costs and A/C) compared to running it on windows/gpu which we bought about the same initial price. Even if i could now safely say we dont need A5000 but just some gamer card for 600 euros, looking at the power costs these days i'd still prefer the mac. (Thanks god i dont need to put it into Active directory or such, so i have an easy time just using it as a slave processing machine)

StuartIanNaylor commented 1 year ago

It would be great if watts idle/peak could be posted as I have been posting benches for RK3588 devices that prob gives the minimum usable results and even then a tad slow. In that price range I just posted a I3-7100T that was picked up for £64 off ebay which is approx 8 watts idle / 30 peak. I used to be a bit of an Apple hater in terms of bling tech, but bang for buck the M1 Mini is surprisingly good value and in that race-till-idle likely could process quite a number of zones especially because of diversification of use.

I am on disability so even though cheap the £849.00 for the 16gb version could prob be the basis of the ultimate home-assistant in something similar to https://github.com/ggerganov/whisper.cpp/blob/master/examples/talk-llama/talk-llama.cpp So likely I will continue posting in the £64 range :)

But what Apple/Arm provide per watt currently is pretty special and for 24/365 in the energy expensive world that is pretty important. Dunno how many people could post idle & peak wattages also but it would be really interesting especially with CPU vs GPU than just out right speed.

StuartIanNaylor commented 1 year ago

Rock 5b

memcpy: 8.78 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     7.2 GFLOPS (128 runs) | Q4_1     7.6 GFLOPS (128 runs) | Q4_2     6.9 GFLOPS (128 runs)
  64 x   64: Q5_0     6.8 GFLOPS (128 runs) | Q5_1     7.0 GFLOPS (128 runs) | Q8_0     7.1 GFLOPS (128 runs)
  64 x   64: F16      8.6 GFLOPS (128 runs) | F32      7.5 GFLOPS (128 runs)
 128 x  128: Q4_0    22.8 GFLOPS (128 runs) | Q4_1    22.4 GFLOPS (128 runs) | Q4_2    19.6 GFLOPS (128 runs)
 128 x  128: Q5_0    19.5 GFLOPS (128 runs) | Q5_1    20.7 GFLOPS (128 runs) | Q8_0    22.7 GFLOPS (128 runs)
 128 x  128: F16     28.3 GFLOPS (128 runs) | F32     29.4 GFLOPS (128 runs)
 256 x  256: Q4_0    40.6 GFLOPS (128 runs) | Q4_1    37.6 GFLOPS (128 runs) | Q4_2    30.5 GFLOPS (128 runs)
 256 x  256: Q5_0    31.2 GFLOPS (128 runs) | Q5_1    31.9 GFLOPS (128 runs) | Q8_0    49.1 GFLOPS (128 runs)
 256 x  256: F16     51.8 GFLOPS (128 runs) | F32     36.9 GFLOPS (128 runs)
 512 x  512: Q4_0    52.0 GFLOPS (128 runs) | Q4_1    45.4 GFLOPS (128 runs) | Q4_2    35.7 GFLOPS (128 runs)
 512 x  512: Q5_0    37.4 GFLOPS (128 runs) | Q5_1    36.9 GFLOPS (128 runs) | Q8_0    64.9 GFLOPS (128 runs)
 512 x  512: F16     76.9 GFLOPS (128 runs) | F32     30.7 GFLOPS (115 runs)
1024 x 1024: Q4_0    56.6 GFLOPS ( 27 runs) | Q4_1    47.5 GFLOPS ( 23 runs) | Q4_2    37.5 GFLOPS ( 18 runs)
1024 x 1024: Q5_0    39.5 GFLOPS ( 19 runs) | Q5_1    37.7 GFLOPS ( 18 runs) | Q8_0    71.1 GFLOPS ( 34 runs)
1024 x 1024: F16     49.0 GFLOPS ( 23 runs) | F32     22.4 GFLOPS ( 11 runs)
2048 x 2048: Q4_0    54.2 GFLOPS (  4 runs) | Q4_1    44.6 GFLOPS (  3 runs) | Q4_2    38.5 GFLOPS (  3 runs)
2048 x 2048: Q5_0    37.4 GFLOPS (  3 runs) | Q5_1    35.5 GFLOPS (  3 runs) | Q8_0    61.0 GFLOPS (  4 runs)
2048 x 2048: F16     41.3 GFLOPS (  3 runs) | F32     19.0 GFLOPS (  3 runs)
4096 x 4096: Q4_0    56.2 GFLOPS (  3 runs) | Q4_1    45.4 GFLOPS (  3 runs) | Q4_2    38.7 GFLOPS (  3 runs)
4096 x 4096: Q5_0    40.7 GFLOPS (  3 runs) | Q5_1    37.3 GFLOPS (  3 runs) | Q8_0    63.2 GFLOPS (  3 runs)
4096 x 4096: F16     40.0 GFLOPS (  3 runs) | F32     17.5 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| rk3588 | Ubuntu 20.04.6 LTS |  NEON | tiny | 4 | 102 | 1191 | be5911a |
| rk3588 | Ubuntu 20.04.6 LTS |  NEON | base | 4 | 140 | 2861 | be5911a |
| rk3588 | Ubuntu 20.04.6 LTS |  NEON | small | 4 | 393 | 10576 | be5911a |
| rk3588 | Ubuntu 20.04.6 LTS |  NEON | medium | 4 | 10289 | 36042 | be5911a |
| rk3588 | Ubuntu 20.04.6 LTS |  NEON | large | 4 | 2099 | 70740 | be5911a |

fquirin commented 1 year ago

How do you get these numbers @StuartIanNaylor ? 😲 Isn't the Rock 5b basically the same as the Orange Pi 5?

Orange Pi 5 8GB:

Running memcpy benchmark

memcpy: 10.14 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     4.7 GFLOPS (128 runs) | Q4_1     4.8 GFLOPS (128 runs) | Q4_2     4.6 GFLOPS (128 runs)
  64 x   64: Q5_0     4.2 GFLOPS (128 runs) | Q5_1     4.4 GFLOPS (128 runs) | Q8_0     4.4 GFLOPS (128 runs)
  64 x   64: F16      4.8 GFLOPS (128 runs) | F32      4.4 GFLOPS (128 runs)
 128 x  128: Q4_0     4.2 GFLOPS (128 runs) | Q4_1     9.8 GFLOPS (128 runs) | Q4_2    10.0 GFLOPS (128 runs)
 128 x  128: Q5_0     8.4 GFLOPS (128 runs) | Q5_1     8.2 GFLOPS (128 runs) | Q8_0    10.3 GFLOPS (128 runs)
 128 x  128: F16     10.3 GFLOPS (128 runs) | F32     10.7 GFLOPS (128 runs)
 256 x  256: Q4_0    34.7 GFLOPS (128 runs) | Q4_1    34.9 GFLOPS (128 runs) | Q4_2    33.9 GFLOPS (128 runs)
 256 x  256: Q5_0    26.2 GFLOPS (128 runs) | Q5_1    24.9 GFLOPS (128 runs) | Q8_0    36.1 GFLOPS (128 runs)
 256 x  256: F16     36.4 GFLOPS (128 runs) | F32     38.4 GFLOPS (128 runs)
 512 x  512: Q4_0    22.2 GFLOPS ( 83 runs) | Q4_1    26.1 GFLOPS ( 98 runs) | Q4_2    35.5 GFLOPS (128 runs)
 512 x  512: Q5_0    42.4 GFLOPS (128 runs) | Q5_1    26.8 GFLOPS (100 runs) | Q8_0    35.8 GFLOPS (128 runs)
 512 x  512: F16     21.6 GFLOPS ( 81 runs) | F32     31.5 GFLOPS (118 runs)
1024 x 1024: Q4_0    32.4 GFLOPS ( 16 runs) | Q4_1    44.1 GFLOPS ( 21 runs) | Q4_2    39.7 GFLOPS ( 19 runs)
1024 x 1024: Q5_0    42.3 GFLOPS ( 20 runs) | Q5_1    40.4 GFLOPS ( 20 runs) | Q8_0    41.2 GFLOPS ( 20 runs)
1024 x 1024: F16     46.8 GFLOPS ( 22 runs) | F32     42.1 GFLOPS ( 20 runs)
2048 x 2048: Q4_0    50.9 GFLOPS (  4 runs) | Q4_1    48.6 GFLOPS (  3 runs) | Q4_2    48.0 GFLOPS (  3 runs)
2048 x 2048: Q5_0    46.7 GFLOPS (  3 runs) | Q5_1    47.8 GFLOPS (  3 runs) | Q8_0    46.4 GFLOPS (  3 runs)
2048 x 2048: F16     46.1 GFLOPS (  3 runs) | F32     44.8 GFLOPS (  3 runs)
4096 x 4096: Q4_0    42.2 GFLOPS (  3 runs) | Q4_1    36.7 GFLOPS (  3 runs) | Q4_2    33.0 GFLOPS (  3 runs)
4096 x 4096: Q5_0    38.5 GFLOPS (  3 runs) | Q5_1    44.7 GFLOPS (  3 runs) | Q8_0    44.7 GFLOPS (  3 runs)
4096 x 4096: F16     44.4 GFLOPS (  3 runs) | F32     44.5 GFLOPS (  3 runs)

CPU	OS	Config	Model	Th	Load	Enc.	Commit
RK3588S	Armbian 11 - 5.10.110	NEON BLAS	tiny	4	193	3748	be5911a
RK3588S	Armbian 11 - 5.10.110	NEON BLAS	tiny-q5_0	4	156	3341	be5911a
RK3588S	Armbian 11 - 5.10.110	NEON BLAS	base	4	253	7359	be5911a
RK3588S	Armbian 11 - 5.10.110	NEON BLAS	base-q5_0	4	178	7307	be5911a

[EDIT: a bit better without OpenBLAS although the GFLOPS are considerably lower O_o]

CPU	OS	Config	Model	Th	Load	Enc.	Commit
RK3588S	Armbian 11 - 5.10.110	NEON	tiny	4	111	3170	be5911a
RK3588S	Armbian 11 - 5.10.110	NEON	tiny-q5_0	4	205	2817	be5911a
RK3588S	Armbian 11 - 5.10.110	NEON	base	4	248	6385	be5911a
RK3588S	Armbian 11 - 5.10.110	NEON	base-q5_0	4	140	6198	be5911a

[EDIT2: getting very unstable results right now 🤔 ]

CPU	OS	Config	Model	Th	Load	Enc.	Commit
RK3588S	Armbian 11 - 5.10.110	NEON	tiny	4	269	1722	be5911a
RK3588S	Armbian 11 - 5.10.110	NEON	tiny-q5_0	4	104	2746	be5911a
RK3588S	Armbian 11 - 5.10.110	NEON	base	4	243	7063	be5911a
RK3588S	Armbian 11 - 5.10.110	NEON	base-q5_0	4	135	6516	be5911a

StuartIanNaylor commented 1 year ago

Likely I don't use Armbian but the supplied server image by Radxa and also the OPI version. Generally I stay clear of Armbian due to a pet hate of there epic init script that replaces standard installs and /etc and often blind sights me.

I add some tricks and tips I gathered when Radxa do a community board bring up. I have changed my pref for the scheduler and set it to performance and also and I dunno why but using taskset to make sure it just uses the big cores has a slight perf boost.

So running again I get

memcpy: 8.56 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     7.3 GFLOPS (128 runs) | Q4_1     7.8 GFLOPS (128 runs) | Q4_2     6.9 GFLOPS (128 runs)
  64 x   64: Q5_0     6.2 GFLOPS (128 runs) | Q5_1     6.7 GFLOPS (128 runs) | Q8_0     7.0 GFLOPS (128 runs)
  64 x   64: F16      2.4 GFLOPS (128 runs) | F32      8.5 GFLOPS (128 runs)
 128 x  128: Q4_0    23.2 GFLOPS (128 runs) | Q4_1    24.1 GFLOPS (128 runs) | Q4_2    19.9 GFLOPS (128 runs)
 128 x  128: Q5_0    15.4 GFLOPS (128 runs) | Q5_1    21.0 GFLOPS (128 runs) | Q8_0    26.6 GFLOPS (128 runs)
 128 x  128: F16     35.0 GFLOPS (128 runs) | F32     28.6 GFLOPS (128 runs)
 256 x  256: Q4_0    41.2 GFLOPS (128 runs) | Q4_1    38.7 GFLOPS (128 runs) | Q4_2    30.5 GFLOPS (128 runs)
 256 x  256: Q5_0    31.2 GFLOPS (128 runs) | Q5_1    31.9 GFLOPS (128 runs) | Q8_0    49.1 GFLOPS (128 runs)
 256 x  256: F16     65.0 GFLOPS (128 runs) | F32     43.5 GFLOPS (128 runs)
 512 x  512: Q4_0    52.0 GFLOPS (128 runs) | Q4_1    45.4 GFLOPS (128 runs) | Q4_2    35.3 GFLOPS (128 runs)
 512 x  512: Q5_0    37.4 GFLOPS (128 runs) | Q5_1    36.8 GFLOPS (128 runs) | Q8_0    64.9 GFLOPS (128 runs)
 512 x  512: F16     78.1 GFLOPS (128 runs) | F32     30.6 GFLOPS (114 runs)
1024 x 1024: Q4_0    56.4 GFLOPS ( 27 runs) | Q4_1    47.4 GFLOPS ( 23 runs) | Q4_2    37.5 GFLOPS ( 18 runs)
1024 x 1024: Q5_0    39.5 GFLOPS ( 19 runs) | Q5_1    37.7 GFLOPS ( 18 runs) | Q8_0    70.8 GFLOPS ( 33 runs)
1024 x 1024: F16     47.2 GFLOPS ( 22 runs) | F32     21.8 GFLOPS ( 11 runs)
2048 x 2048: Q4_0    54.4 GFLOPS (  4 runs) | Q4_1    45.3 GFLOPS (  3 runs) | Q4_2    38.6 GFLOPS (  3 runs)
2048 x 2048: Q5_0    37.4 GFLOPS (  3 runs) | Q5_1    35.6 GFLOPS (  3 runs) | Q8_0    59.8 GFLOPS (  4 runs)
2048 x 2048: F16     41.2 GFLOPS (  3 runs) | F32     20.6 GFLOPS (  3 runs)
4096 x 4096: Q4_0    56.9 GFLOPS (  3 runs) | Q4_1    46.6 GFLOPS (  3 runs) | Q4_2    38.9 GFLOPS (  3 runs)
4096 x 4096: Q5_0    41.1 GFLOPS (  3 runs) | Q5_1    37.4 GFLOPS (  3 runs) | Q8_0    62.9 GFLOPS (  3 runs)
4096 x 4096: F16     39.8 GFLOPS (  3 runs) | F32     17.6 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| <todo> | <todo> |  NEON | tiny | 4 | 96 | 1199 | be5911a |
| <todo> | <todo> |  NEON | base | 4 | 137 | 2875 | be5911a |
| <todo> | <todo> |  NEON | small | 4 | 343 | 10635 | be5911a |
| <todo> | <todo> |  NEON | medium | 4 | 1013 | 35174 | be5911a |
| <todo> | <todo> |  NEON | large | 4 | 2019 | 71678 | be5911a |

If I run without previously echo performance | tee /sys/bus/cpu/devices/cpu[046]/cpufreq/scaling_governor /sys/class/devfreq/dmc/governor as the rk3588[x] is a tri-cluster 4-2-2 and dunno about the dmc but it was something we where using at that time. Prefix (taskset -c 4-7) to further enforce not using the efficiency cores.

The ondemand governor seems to load balance whilst at least Whisper.cpp a race-till-idle more like how Android is set up does seem to give a perf boost with little loss in efficiency, if none.

Without bench gives

memcpy: 7.82 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     3.1 GFLOPS (128 runs) | Q4_1     2.8 GFLOPS (128 runs) | Q                                                                                                          4_2     2.4 GFLOPS (128 runs)
  64 x   64: Q5_0     2.3 GFLOPS (128 runs) | Q5_1     2.2 GFLOPS (128 runs) | Q                                                                                                          8_0     2.7 GFLOPS (128 runs)
  64 x   64: F16      3.1 GFLOPS (128 runs) | F32      2.6 GFLOPS (128 runs)
 128 x  128: Q4_0     7.1 GFLOPS (128 runs) | Q4_1     7.0 GFLOPS (128 runs) | Q                                                                                                          4_2     6.2 GFLOPS (128 runs)
 128 x  128: Q5_0     5.4 GFLOPS (128 runs) | Q5_1     5.4 GFLOPS (128 runs) | Q                                                                                                          8_0     7.2 GFLOPS (128 runs)
 128 x  128: F16      9.3 GFLOPS (128 runs) | F32      5.9 GFLOPS (128 runs)
 256 x  256: Q4_0    10.1 GFLOPS (128 runs) | Q4_1     9.5 GFLOPS (128 runs) | Q                                                                                                          4_2     8.4 GFLOPS (128 runs)
 256 x  256: Q5_0     7.4 GFLOPS (128 runs) | Q5_1     6.9 GFLOPS (128 runs) | Q                                                                                                          8_0    10.9 GFLOPS (128 runs)
 256 x  256: F16     13.4 GFLOPS (128 runs) | F32      7.9 GFLOPS (128 runs)
 512 x  512: Q4_0    10.9 GFLOPS ( 41 runs) | Q4_1    10.4 GFLOPS ( 39 runs) | Q                                                                                                          4_2     8.5 GFLOPS ( 32 runs)
 512 x  512: Q5_0     8.9 GFLOPS ( 34 runs) | Q5_1     8.2 GFLOPS ( 31 runs) | Q                                                                                                          8_0    12.1 GFLOPS ( 46 runs)
 512 x  512: F16     14.5 GFLOPS ( 54 runs) | F32      8.7 GFLOPS ( 33 runs)
1024 x 1024: Q4_0    26.9 GFLOPS ( 13 runs) | Q4_1    24.9 GFLOPS ( 12 runs) | Q                                                                                                          4_2    21.7 GFLOPS ( 11 runs)
1024 x 1024: Q5_0    23.0 GFLOPS ( 11 runs) | Q5_1    22.0 GFLOPS ( 11 runs) | Q                                                                                                          8_0    29.1 GFLOPS ( 14 runs)
1024 x 1024: F16     28.2 GFLOPS ( 14 runs) | F32     17.9 GFLOPS (  9 runs)
2048 x 2048: Q4_0    50.1 GFLOPS (  3 runs) | Q4_1    41.3 GFLOPS (  3 runs) | Q                                                                                                          4_2    36.7 GFLOPS (  3 runs)
2048 x 2048: Q5_0    36.0 GFLOPS (  3 runs) | Q5_1    33.2 GFLOPS (  3 runs) | Q                                                                                                          8_0    53.7 GFLOPS (  4 runs)
2048 x 2048: F16     37.5 GFLOPS (  3 runs) | F32     19.3 GFLOPS (  3 runs)
4096 x 4096: Q4_0    55.7 GFLOPS (  3 runs) | Q4_1    43.7 GFLOPS (  3 runs) | Q                                                                                                          4_2    39.4 GFLOPS (  3 runs)
4096 x 4096: Q5_0    40.5 GFLOPS (  3 runs) | Q5_1    36.1 GFLOPS (  3 runs) | Q                                                                                                          8_0    65.8 GFLOPS (  3 runs)
4096 x 4096: F16     36.8 GFLOPS (  3 runs) | F32     18.5 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| <todo> | <todo> |  NEON | tiny | 4 | 171 | 1817 | be5911a |
| <todo> | <todo> |  NEON | base | 4 | 255 | 3529 | be5911a |
| <todo> | <todo> |  NEON | small | 4 | 433 | 11208 | be5911a |
| <todo> | <todo> |  NEON | medium | 4 | 1814 | 36829 | be5911a |
| <todo> | <todo> |  NEON | large | 4 | 36647 | 71393 | be5911a |

I will tack on the OPI5 next as think it is a smidge faster. So without again

memcpy: 8.26 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     3.1 GFLOPS (128 runs) | Q4_1     3.3 GFLOPS (128 runs) | Q4_2     3.4 GFLOPS (128 runs)
  64 x   64: Q5_0     1.7 GFLOPS (128 runs) | Q5_1     3.1 GFLOPS (128 runs) | Q8_0     2.9 GFLOPS (128 runs)
  64 x   64: F16      4.0 GFLOPS (128 runs) | F32      3.5 GFLOPS (128 runs)
 128 x  128: Q4_0     7.8 GFLOPS (128 runs) | Q4_1     6.6 GFLOPS (128 runs) | Q4_2     6.7 GFLOPS (128 runs)
 128 x  128: Q5_0     5.6 GFLOPS (128 runs) | Q5_1     5.4 GFLOPS (128 runs) | Q8_0     8.7 GFLOPS (128 runs)
 128 x  128: F16     10.1 GFLOPS (128 runs) | F32      6.3 GFLOPS (128 runs)
 256 x  256: Q4_0    10.5 GFLOPS (128 runs) | Q4_1     9.1 GFLOPS (128 runs) | Q4_2     7.9 GFLOPS (128 runs)
 256 x  256: Q5_0     7.0 GFLOPS (128 runs) | Q5_1     6.7 GFLOPS (128 runs) | Q8_0    12.6 GFLOPS (128 runs)
 256 x  256: F16     12.6 GFLOPS (128 runs) | F32      7.5 GFLOPS (128 runs)
 512 x  512: Q4_0    11.9 GFLOPS ( 45 runs) | Q4_1    10.8 GFLOPS ( 41 runs) | Q4_2    10.0 GFLOPS ( 38 runs)
 512 x  512: Q5_0     8.5 GFLOPS ( 32 runs) | Q5_1     7.9 GFLOPS ( 30 runs) | Q8_0    14.5 GFLOPS ( 54 runs)
 512 x  512: F16     14.2 GFLOPS ( 53 runs) | F32      8.3 GFLOPS ( 32 runs)
1024 x 1024: Q4_0    30.4 GFLOPS ( 15 runs) | Q4_1    28.9 GFLOPS ( 14 runs) | Q4_2    23.6 GFLOPS ( 11 runs)
1024 x 1024: Q5_0    23.0 GFLOPS ( 11 runs) | Q5_1    23.5 GFLOPS ( 12 runs) | Q8_0    37.4 GFLOPS ( 18 runs)
1024 x 1024: F16     33.9 GFLOPS ( 16 runs) | F32     18.0 GFLOPS (  9 runs)
2048 x 2048: Q4_0    51.4 GFLOPS (  4 runs) | Q4_1    42.5 GFLOPS (  3 runs) | Q4_2    36.5 GFLOPS (  3 runs)
2048 x 2048: Q5_0    36.0 GFLOPS (  3 runs) | Q5_1    32.7 GFLOPS (  3 runs) | Q8_0    59.0 GFLOPS (  4 runs)
2048 x 2048: F16     39.4 GFLOPS (  3 runs) | F32     17.5 GFLOPS (  3 runs)
4096 x 4096: Q4_0    58.8 GFLOPS (  3 runs) | Q4_1    47.0 GFLOPS (  3 runs) | Q4_2    39.7 GFLOPS (  3 runs)
4096 x 4096: Q5_0    40.8 GFLOPS (  3 runs) | Q5_1    37.3 GFLOPS (  3 runs) | Q8_0    65.1 GFLOPS (  3 runs)
4096 x 4096: F16     40.6 GFLOPS (  3 runs) | F32     18.6 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| <todo> | <todo> |  NEON | tiny | 4 | 133 | 1235 | be5911a |
| <todo> | <todo> |  NEON | base | 4 | 232 | 2941 | be5911a |
| <todo> | <todo> |  NEON | small | 4 | 470 | 10870 | be5911a |
| <todo> | <todo> |  NEON | medium | 4 | 23195 | 36162 | be5911a |
| <todo> | <todo> |  NEON | large | 4 | 46511 | 90187 | be5911a |

Then as sudo orangepi-config set the perf governor (no dmc) taskset -c 4-7 ,/extra/bench-all.sh

memcpy: 8.22 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     0.7 GFLOPS (128 runs) | Q4_1     1.6 GFLOPS (128 runs) | Q                                                                                                  4_2     1.0 GFLOPS (128 runs)
  64 x   64: Q5_0     0.6 GFLOPS (128 runs) | Q5_1     0.8 GFLOPS (128 runs) | Q                                                                                                  8_0     1.4 GFLOPS (128 runs)
  64 x   64: F16      1.9 GFLOPS (128 runs) | F32      0.8 GFLOPS (128 runs)
 128 x  128: Q4_0     8.9 GFLOPS (128 runs) | Q4_1     3.8 GFLOPS (128 runs) | Q                                                                                                  4_2     3.1 GFLOPS (128 runs)
 128 x  128: Q5_0     5.8 GFLOPS (128 runs) | Q5_1     3.8 GFLOPS (128 runs) | Q                                                                                                  8_0     7.8 GFLOPS (128 runs)
 128 x  128: F16      5.2 GFLOPS (128 runs) | F32      3.6 GFLOPS (128 runs)
 256 x  256: Q4_0    13.1 GFLOPS (128 runs) | Q4_1    12.1 GFLOPS (128 runs) | Q                                                                                                  4_2    12.1 GFLOPS (128 runs)
 256 x  256: Q5_0    12.8 GFLOPS (128 runs) | Q5_1    13.4 GFLOPS (128 runs) | Q                                                                                                  8_0    17.9 GFLOPS (128 runs)
 256 x  256: F16     17.6 GFLOPS (128 runs) | F32     11.0 GFLOPS (128 runs)
 512 x  512: Q4_0    33.3 GFLOPS (125 runs) | Q4_1    34.7 GFLOPS (128 runs) | Q                                                                                                  4_2    21.9 GFLOPS ( 82 runs)
 512 x  512: Q5_0    21.4 GFLOPS ( 80 runs) | Q5_1    22.4 GFLOPS ( 84 runs) | Q                                                                                                  8_0    35.2 GFLOPS (128 runs)
 512 x  512: F16     37.1 GFLOPS (128 runs) | F32     23.2 GFLOPS ( 87 runs)
1024 x 1024: Q4_0    54.9 GFLOPS ( 26 runs) | Q4_1    44.3 GFLOPS ( 21 runs) | Q                                                                                                  4_2    31.4 GFLOPS ( 15 runs)
1024 x 1024: Q5_0    35.7 GFLOPS ( 17 runs) | Q5_1    32.1 GFLOPS ( 15 runs) | Q                                                                                                  8_0    66.5 GFLOPS ( 31 runs)
1024 x 1024: F16     45.0 GFLOPS ( 21 runs) | F32     19.6 GFLOPS ( 10 runs)
2048 x 2048: Q4_0    54.6 GFLOPS (  4 runs) | Q4_1    45.2 GFLOPS (  3 runs) | Q                                                                                                  4_2    38.4 GFLOPS (  3 runs)
2048 x 2048: Q5_0    37.9 GFLOPS (  3 runs) | Q5_1    34.7 GFLOPS (  3 runs) | Q                                                                                                  8_0    59.9 GFLOPS (  4 runs)
2048 x 2048: F16     40.5 GFLOPS (  3 runs) | F32     20.0 GFLOPS (  3 runs)
4096 x 4096: Q4_0    59.5 GFLOPS (  3 runs) | Q4_1    47.7 GFLOPS (  3 runs) | Q                                                                                                  4_2    40.1 GFLOPS (  3 runs)
4096 x 4096: Q5_0    42.7 GFLOPS (  3 runs) | Q5_1    39.6 GFLOPS (  3 runs) | Q                                                                                                  8_0    60.7 GFLOPS (  3 runs)
4096 x 4096: F16     35.5 GFLOPS (  3 runs) | F32     20.8 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| <todo> | <todo> |  NEON | tiny | 4 | 119 | 1178 | be5911a |
| <todo> | <todo> |  NEON | base | 4 | 168 | 2910 | be5911a |
| <todo> | <todo> |  NEON | small | 4 | 399 | 10784 | be5911a |
| <todo> | <todo> |  NEON | medium | 4 | 23469 | 35952 | be5911a |
| <todo> | <todo> |  NEON | large | 4 | 47147 | 76405 | be5911a |

I ran that again as think transformers do bounce around abit to end up with the same tokens.

memcpy: 9.46 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     7.1 GFLOPS (128 runs) | Q4_1     7.6 GFLOPS (128 runs) | Q4_2     6.6 GFLOPS (128 runs)
  64 x   64: Q5_0     6.3 GFLOPS (128 runs) | Q5_1     6.9 GFLOPS (128 runs) | Q8_0     6.6 GFLOPS (128 runs)
  64 x   64: F16      7.8 GFLOPS (128 runs) | F32      7.3 GFLOPS (128 runs)
 128 x  128: Q4_0    23.8 GFLOPS (128 runs) | Q4_1    25.0 GFLOPS (128 runs) | Q4_2     8.5 GFLOPS (128 runs)
 128 x  128: Q5_0    19.1 GFLOPS (128 runs) | Q5_1    20.8 GFLOPS (128 runs) | Q8_0    26.4 GFLOPS (128 runs)
 128 x  128: F16     34.8 GFLOPS (128 runs) | F32     28.6 GFLOPS (128 runs)
 256 x  256: Q4_0    43.4 GFLOPS (128 runs) | Q4_1    42.0 GFLOPS (128 runs) | Q4_2    31.3 GFLOPS (128 runs)
 256 x  256: Q5_0    30.5 GFLOPS (128 runs) | Q5_1    32.0 GFLOPS (128 runs) | Q8_0    41.7 GFLOPS (128 runs)
 256 x  256: F16     60.0 GFLOPS (128 runs) | F32     42.9 GFLOPS (128 runs)
 512 x  512: Q4_0    56.5 GFLOPS (128 runs) | Q4_1    49.5 GFLOPS (128 runs) | Q4_2    36.6 GFLOPS (128 runs)
 512 x  512: Q5_0    36.7 GFLOPS (128 runs) | Q5_1    36.8 GFLOPS (128 runs) | Q8_0    69.9 GFLOPS (128 runs)
 512 x  512: F16     78.5 GFLOPS (128 runs) | F32     30.1 GFLOPS (113 runs)
1024 x 1024: Q4_0    62.7 GFLOPS ( 30 runs) | Q4_1    52.2 GFLOPS ( 25 runs) | Q4_2    38.9 GFLOPS ( 19 runs)
1024 x 1024: Q5_0    39.2 GFLOPS ( 19 runs) | Q5_1    38.2 GFLOPS ( 18 runs) | Q8_0    76.2 GFLOPS ( 36 runs)
1024 x 1024: F16     46.7 GFLOPS ( 22 runs) | F32     21.6 GFLOPS ( 11 runs)
2048 x 2048: Q4_0    60.4 GFLOPS (  4 runs) | Q4_1    50.3 GFLOPS (  3 runs) | Q4_2    39.6 GFLOPS (  3 runs)
2048 x 2048: Q5_0    37.9 GFLOPS (  3 runs) | Q5_1    35.4 GFLOPS (  3 runs) | Q8_0    66.5 GFLOPS (  4 runs)
2048 x 2048: F16     33.8 GFLOPS (  3 runs) | F32     15.0 GFLOPS (  3 runs)
4096 x 4096: Q4_0    64.2 GFLOPS (  3 runs) | Q4_1    51.2 GFLOPS (  3 runs) | Q4_2    40.2 GFLOPS (  3 runs)
4096 x 4096: Q5_0    40.7 GFLOPS (  3 runs) | Q5_1    37.2 GFLOPS (  3 runs) | Q8_0    71.5 GFLOPS (  3 runs)
4096 x 4096: F16     38.5 GFLOPS (  3 runs) | F32     20.3 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| <todo> | <todo> |  NEON | tiny | 4 | 103 | 1166 | be5911a |
| <todo> | <todo> |  NEON | base | 4 | 152 | 2888 | be5911a |
| <todo> | <todo> |  NEON | small | 4 | 379 | 10892 | be5911a |
| <todo> | <todo> |  NEON | medium | 4 | 22649 | 35767 | be5911a |
| <todo> | <todo> |  NEON | large | 4 | 45427 | 73967 | be5911a |

But don't seem to get that much variance, race-till-idle is just preference.

fquirin commented 1 year ago

Prefix (taskset -c 4-7) to further enforce not using the efficiency cores.

Tried that played with the CPU settings (performance mode etc.), even added some better cooling but it still keeps jumping all over the place with the tiny model at ~2s (in the good runs) while 'htop' shows consistent 100% load on the performance cores. Q5 models are sometimes a few ms faster sometimes slower. When I do the same tests with the CTranslate2 Whisper version results are pretty stable and always about twice as fast.

StuartIanNaylor commented 1 year ago

Dunno just to show the next run is very consistant and considerabilly faster... ?

memcpy: 10.52 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     2.5 GFLOPS (128 runs) | Q4_1     2.5 GFLOPS (128 runs) | Q4_2     1.3 GFLOPS (128 runs)
  64 x   64: Q5_0     1.0 GFLOPS (128 runs) | Q5_1     0.6 GFLOPS (128 runs) | Q8_0     0.8 GFLOPS (128 runs)
  64 x   64: F16      1.0 GFLOPS (128 runs) | F32      1.8 GFLOPS (128 runs)
 128 x  128: Q4_0     2.8 GFLOPS (128 runs) | Q4_1     2.2 GFLOPS (128 runs) | Q4_2     6.7 GFLOPS (128 runs)
 128 x  128: Q5_0     3.2 GFLOPS (128 runs) | Q5_1     5.5 GFLOPS (128 runs) | Q8_0     3.0 GFLOPS (128 runs)
 128 x  128: F16     11.2 GFLOPS (128 runs) | F32      8.5 GFLOPS (128 runs)
 256 x  256: Q4_0    13.5 GFLOPS (128 runs) | Q4_1     8.8 GFLOPS (128 runs) | Q4_2     9.9 GFLOPS (128 runs)
 256 x  256: Q5_0    10.7 GFLOPS (128 runs) | Q5_1     6.7 GFLOPS (128 runs) | Q8_0     7.3 GFLOPS (128 runs)
 256 x  256: F16     18.3 GFLOPS (128 runs) | F32     10.1 GFLOPS (128 runs)
 512 x  512: Q4_0    36.4 GFLOPS (128 runs) | Q4_1    31.2 GFLOPS (117 runs) | Q4_2    19.0 GFLOPS ( 71 runs)
 512 x  512: Q5_0    18.5 GFLOPS ( 69 runs) | Q5_1    20.4 GFLOPS ( 77 runs) | Q8_0    30.7 GFLOPS (115 runs)
 512 x  512: F16     33.8 GFLOPS (126 runs) | F32     20.7 GFLOPS ( 79 runs)
1024 x 1024: Q4_0    40.0 GFLOPS ( 19 runs) | Q4_1    36.4 GFLOPS ( 18 runs) | Q4_2    29.6 GFLOPS ( 14 runs)
1024 x 1024: Q5_0    32.9 GFLOPS ( 16 runs) | Q5_1    30.6 GFLOPS ( 15 runs) | Q8_0    54.2 GFLOPS ( 26 runs)
1024 x 1024: F16     44.1 GFLOPS ( 21 runs) | F32     20.0 GFLOPS ( 10 runs)
2048 x 2048: Q4_0    57.7 GFLOPS (  4 runs) | Q4_1    47.7 GFLOPS (  3 runs) | Q4_2    38.7 GFLOPS (  3 runs)
2048 x 2048: Q5_0    37.8 GFLOPS (  3 runs) | Q5_1    35.1 GFLOPS (  3 runs) | Q8_0    63.6 GFLOPS (  4 runs)
2048 x 2048: F16     33.6 GFLOPS (  3 runs) | F32     14.8 GFLOPS (  3 runs)
4096 x 4096: Q4_0    61.9 GFLOPS (  3 runs) | Q4_1    50.2 GFLOPS (  3 runs) | Q4_2    38.8 GFLOPS (  3 runs)
4096 x 4096: Q5_0    40.6 GFLOPS (  3 runs) | Q5_1    37.9 GFLOPS (  3 runs) | Q8_0    70.4 GFLOPS (  3 runs)
4096 x 4096: F16     38.0 GFLOPS (  3 runs) | F32     20.8 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| <todo> | <todo> |  NEON | tiny | 4 | 134 | 1176 | be5911a |
| <todo> | <todo> |  NEON | base | 4 | 179 | 2964 | be5911a |
| <todo> | <todo> |  NEON | small | 4 | 416 | 11037 | be5911a |
| <todo> | <todo> |  NEON | medium | 4 | 23462 | 36469 | be5911a |
| <todo> | <todo> |  NEON | large | 4 | 47286 | 77494 | be5911a |

tazz4843 commented 1 year ago

System76 Pangolin (pang12) w/ Ryzen 7 6800U (8c16t) @ 2.7GHz + 32GB DDR5 at 6400MT/s Models stored on a Samsung 970 Evo Plus

Running memcpy benchmark with 1 thread

memcpy: 11.18 GB/s
sum:    error -536870997.000000

Running ggml_mul_mat benchmark with 16 threads

ggml_mul_mat:   64 x   64: Q4_0     0.9 GFLOPS (128 runs) / Q4_1     0.4 GFLOPS (128 runs) / F16     1.2 GFLOPS (128 runs) / F32     1.2 GFLOPS (128 runs)
ggml_mul_mat:  128 x  128: Q4_0     6.1 GFLOPS (128 runs) / Q4_1     7.5 GFLOPS (128 runs) / F16     4.6 GFLOPS (128 runs) / F32    10.0 GFLOPS (128 runs)
ggml_mul_mat:  256 x  256: Q4_0    26.2 GFLOPS (128 runs) / Q4_1    42.3 GFLOPS (128 runs) / F16    19.9 GFLOPS (128 runs) / F32    47.9 GFLOPS (128 runs)
ggml_mul_mat:  512 x  512: Q4_0    66.6 GFLOPS (128 runs) / Q4_1    98.6 GFLOPS (128 runs) / F16    90.1 GFLOPS (128 runs) / F32   110.4 GFLOPS (128 runs)
ggml_mul_mat: 1024 x 1024: Q4_0    97.8 GFLOPS ( 46 runs) / Q4_1   154.3 GFLOPS ( 72 runs) / F16   158.7 GFLOPS ( 74 runs) / F32   132.2 GFLOPS ( 62 runs)
ggml_mul_mat: 2048 x 2048: Q4_0   126.7 GFLOPS (  8 runs) / Q4_1   164.8 GFLOPS ( 10 runs) / F16   164.1 GFLOPS ( 10 runs) / F32    96.4 GFLOPS (  6 runs)
ggml_mul_mat: 4096 x 4096: Q4_0   138.6 GFLOPS (  3 runs) / Q4_1   166.9 GFLOPS (  3 runs) / F16   136.0 GFLOPS (  3 runs) / F32    62.8 GFLOPS (  3 runs)

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Ryzen 7 6800U	Arch Linux	AVX2	tiny	16	37	510	9c61f5f
Ryzen 7 6800U	Arch Linux	AVX2	base	16	51	1222	9c61f5f
Ryzen 7 6800U	Arch Linux	AVX2	small	16	123	4283	9c61f5f
Ryzen 7 6800U	Arch Linux	AVX2	medium	16	341	14178	9c61f5f
Ryzen 7 6800U	Arch Linux	AVX2	large	16	650	25801	9c61f5f

Tetsuya81 commented 1 year ago

MacBook Air M2 24GB 2022 (CoreML model)

It is interesting to note that when converted to a CoreML model and executed, even a Macbook Air M2 has a processing speed close to that of a high-spec Mac, perhaps because the specifications of the Neural engine are the same for the same generation of Apple Silicon.

./extra/bench-all.sh 4 Usage: ./bench.sh [n_threads]

Running memcpy benchmark with 1 thread

memcpy: 34.33 GB/s sum: ok -536870910.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_mul_mat: 64 x 64: F16 11.4 GFLOPS (128 runs) / F32 10.5 GFLOPS (128 runs) ggml_mul_mat: 128 x 128: F16 89.0 GFLOPS (128 runs) / F32 74.8 GFLOPS (128 runs) ggml_mul_mat: 256 x 256: F16 422.6 GFLOPS (128 runs) / F32 419.4 GFLOPS (128 runs) ggml_mul_mat: 512 x 512: F16 793.4 GFLOPS (128 runs) / F32 801.8 GFLOPS (128 runs) ggml_mul_mat: 1024 x 1024: F16 827.0 GFLOPS (128 runs) / F32 849.3 GFLOPS (128 runs) ggml_mul_mat: 2048 x 2048: F16 821.8 GFLOPS ( 48 runs) / F32 773.4 GFLOPS ( 46 runs) ggml_mul_mat: 4096 x 4096: F16 765.2 GFLOPS ( 6 runs) / F32 743.6 GFLOPS ( 6 runs)

Running benchmark for all models This can take a while!

CPU	OS	Config	Model	Th	Load	Enc.	Commit
		NEON BLAS COREML	tiny	4			c23588c
		NEON BLAS COREML	base	4			c23588c
M2	13.3.1 (a)（22E772610a）	NEON BLAS COREML	small	4	153	199	c23588c
M2	13.3.1 (a)（22E772610a）	NEON BLAS COREML	medium	4	450	746	c23588c
M2	13.3.1 (a)（22E772610a）	NEON BLAS COREML	large	4	1053	1439	c23588c

nickovs commented 1 year ago

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Raspberry Pi 4 2GB	Bullseye 6.1.21-v8+	OPENBLAS	tiny.en	4	393	7882	14bee39
Raspberry Pi 4 2GB	Bullseye 6.1.21-v8+	OPENBLAS	tiny.en-q5	4	265	8564	14bee39
Raspberry Pi 4 2GB	Bullseye 6.1.21-v8+	OPENBLAS	base.en	4	571	16328	14bee39
Raspberry Pi 4 2GB	Bullseye 6.1.21-v8+	OPENBLAS	base.en-q5	4	306	16169	14bee39

Tests performed using Raspberry Pi OS libopenblas-dev package (version 0.3.13+ds-3).

StuartIanNaylor commented 1 year ago

Ryzen 3 2200GE (Lenovo M715q)

Running memcpy benchmark

memcpy: 12.14 GB/s (1 thread)
sum:    -536869898.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     5.3 GFLOPS (128 runs) | Q4_1     1.6 GFLOPS (128 runs) | Q4_2     5.2 GFLOPS (128 runs)
  64 x   64: Q5_0     5.5 GFLOPS (128 runs) | Q5_1     1.7 GFLOPS (128 runs) | Q8_0     1.7 GFLOPS (128 runs)
  64 x   64: F16      1.1 GFLOPS (128 runs) | F32      2.0 GFLOPS (128 runs)
 128 x  128: Q4_0     9.9 GFLOPS (128 runs) | Q4_1    10.8 GFLOPS (128 runs) | Q4_2     9.8 GFLOPS (128 runs)
 128 x  128: Q5_0    16.7 GFLOPS (128 runs) | Q5_1    19.0 GFLOPS (128 runs) | Q8_0    20.6 GFLOPS (128 runs)
 128 x  128: F16      9.4 GFLOPS (128 runs) | F32     29.8 GFLOPS (128 runs)
 256 x  256: Q4_0    26.1 GFLOPS (128 runs) | Q4_1    29.4 GFLOPS (128 runs) | Q4_2    31.2 GFLOPS (128 runs)
 256 x  256: Q5_0    28.4 GFLOPS (128 runs) | Q5_1    31.0 GFLOPS (128 runs) | Q8_0    32.5 GFLOPS (128 runs)
 256 x  256: F16     21.5 GFLOPS (128 runs) | F32     41.6 GFLOPS (128 runs)
 512 x  512: Q4_0    41.4 GFLOPS (128 runs) | Q4_1    42.7 GFLOPS (128 runs) | Q4_2    43.2 GFLOPS (128 runs)
 512 x  512: Q5_0    39.2 GFLOPS (128 runs) | Q5_1    37.2 GFLOPS (128 runs) | Q8_0    56.7 GFLOPS (128 runs)
 512 x  512: F16     29.3 GFLOPS (110 runs) | F32     56.0 GFLOPS (128 runs)
1024 x 1024: Q4_0    52.5 GFLOPS ( 25 runs) | Q4_1    51.6 GFLOPS ( 25 runs) | Q4_2    48.3 GFLOPS ( 23 runs)
1024 x 1024: Q5_0    44.1 GFLOPS ( 21 runs) | Q5_1    41.9 GFLOPS ( 20 runs) | Q8_0    71.4 GFLOPS ( 34 runs)
1024 x 1024: F16     30.4 GFLOPS ( 15 runs) | F32     35.5 GFLOPS ( 17 runs)
2048 x 2048: Q4_0    54.6 GFLOPS (  4 runs) | Q4_1    50.6 GFLOPS (  3 runs) | Q4_2    49.8 GFLOPS (  3 runs)
2048 x 2048: Q5_0    44.8 GFLOPS (  3 runs) | Q5_1    40.8 GFLOPS (  3 runs) | Q8_0    67.1 GFLOPS (  4 runs)
2048 x 2048: F16     29.1 GFLOPS (  3 runs) | F32     20.0 GFLOPS (  3 runs)
4096 x 4096: Q4_0    54.3 GFLOPS (  3 runs) | Q4_1    50.0 GFLOPS (  3 runs) | Q4_2    49.5 GFLOPS (  3 runs)
4096 x 4096: Q5_0    44.7 GFLOPS (  3 runs) | Q5_1    40.2 GFLOPS (  3 runs) | Q8_0    64.0 GFLOPS (  3 runs)
4096 x 4096: F16     28.3 GFLOPS (  3 runs) | F32     19.7 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| Ryzen 3 2200GE |  Ubuntu 22.04.2 |  AVX2 | tiny | 4 | 68 | 1676 | 2b6a074 |
| Ryzen 3 2200GE |  Ubuntu 22.04.2 |  AVX2 | base | 4 | 96 | 3850 | 2b6a074 |
| Ryzen 3 2200GE |  Ubuntu 22.04.2 |  AVX2 | small | 4 | 235 | 14734 | 2b6a074 |
| Ryzen 3 2200GE |  Ubuntu 22.04.2 |  AVX2 | medium | 4 | 660 | 49288 | 2b6a074 |
| Ryzen 3 2200GE |  Ubuntu 22.04.2 |  AVX2 | large | 4 | 1302 | 105757 | 2b6a074 |

kaspar030 commented 1 year ago

This is what I get with clblast on an AMD RX6700XT:

Running memcpy benchmark

memcpy: 11.94 GB/s (1 thread)
sum: -536869898.000000

Running ggml_mul_mat benchmark with 16 threads

Initializing CLBlast (First Run)...
Attempting to use: Platform=0, Device=0 (If invalid, program will crash)
Using Platform: AMD Accelerated Parallel Processing Device: gfx1031
64 x 64: Q4_0 0.8 GFLOPS (128 runs) | Q4_1 0.8 GFLOPS (128 runs)
64 x 64: Q5_0 0.8 GFLOPS (128 runs) | Q5_1 0.8 GFLOPS (128 runs) | Q8_0 0.8 GFLOPS (128 runs) 64 x 64: F16 0.8 GFLOPS (128 runs) | F32 0.8 GFLOPS (128 runs) 128 x 128: Q4_0 5.6 GFLOPS (128 runs) | Q4_1 5.6 GFLOPS (128 runs) 128 x 128: Q5_0 6.1 GFLOPS (128 runs) | Q5_1 5.7 GFLOPS (128 runs) | Q8_0 6.1 GFLOPS (128 runs) 128 x 128: F16 5.8 GFLOPS (128 runs) | F32 6.0 GFLOPS (128 runs) 256 x 256: Q4_0 43.4 GFLOPS (128 runs) | Q4_1 40.3 GFLOPS (128 runs) 256 x 256: Q5_0 38.2 GFLOPS (128 runs) | Q5_1 39.2 GFLOPS (128 runs) | Q8_0 39.0 GFLOPS (128 runs) 256 x 256: F16 38.3 GFLOPS (128 runs) | F32 38.6 GFLOPS (128 runs) 512 x 512: Q4_0 210.9 GFLOPS (128 runs) | Q4_1 212.8 GFLOPS (128 runs) 512 x 512: Q5_0 212.0 GFLOPS (128 runs) | Q5_1 213.2 GFLOPS (128 runs) | Q8_0 210.2 GFLOPS (128 runs) 512 x 512: F16 195.5 GFLOPS (128 runs) | F32 208.7 GFLOPS (128 runs) 1024 x 1024: Q4_0 1280.6 GFLOPS (128 runs) | Q4_1 1289.0 GFLOPS (128 runs) 1024 x 1024: Q5_0 1292.2 GFLOPS (128 runs) | Q5_1 1287.4 GFLOPS (128 runs) | Q8_0 1271.0 GFLOPS (128 runs) 1024 x 1024: F16 1025.9 GFLOPS (128 runs) | F32 1227.8 GFLOPS (128 runs) 2048 x 2048: Q4_0 3423.2 GFLOPS (128 runs) | Q4_1 3414.1 GFLOPS (128 runs) 2048 x 2048: Q5_0 3393.6 GFLOPS (128 runs) | Q5_1 3385.8 GFLOPS (128 runs) | Q8_0 3385.2 GFLOPS (128 runs) 2048 x 2048: F16 2434.4 GFLOPS (128 runs) | F32 3045.8 GFLOPS (128 runs) 4096 x 4096: Q4_0 4187.6 GFLOPS ( 31 runs) | Q4_1 4193.6 GFLOPS ( 31 runs) 4096 x 4096: Q5_0 4204.3 GFLOPS ( 31 runs) | Q5_1 4187.1 GFLOPS ( 31 runs) | Q8_0 4135.0 GFLOPS ( 31 runs) 4096 x 4096: F16 3491.1 GFLOPS ( 26 runs) | F32 3911.3 GFLOPS ( 29 runs)

Running benchmark for all models This can take a while!

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Ryzen 5950X / RX6700XT	Arch	AVX2 BLAS	tiny	16	382	603	95b02d7
Ryzen 5950X / RX6700XT	Arch	AVX2 BLAS	base	16	371	717	95b02d7
Ryzen 5950X / RX6700XT	Arch	AVX2 BLAS	small	16	427	1271	95b02d7
Ryzen 5950X / RX6700XT	Arch	AVX2 BLAS	medium	16	636	2784	95b02d7
Ryzen 5950X / RX6700XT	Arch	AVX2 BLAS	large	16	868	4308	95b02d7

randomshinichi commented 1 year ago

Thinkpad T480, Core i7 8550U

Usage: ./bench.sh [n_threads] [encoder-only]

Running memcpy benchmark

memcpy: 12.67 GB/s (1 thread)
sum:    -536869898.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     6.1 GFLOPS (128 runs) | Q4_1     6.4 GFLOPS (128 runs)
  64 x   64: Q5_0     6.6 GFLOPS (128 runs) | Q5_1     6.7 GFLOPS (128 runs) | Q8_0     6.3 GFLOPS (128 runs)
  64 x   64: F16      7.8 GFLOPS (128 runs) | F32      5.4 GFLOPS (128 runs)
 128 x  128: Q4_0    25.3 GFLOPS (128 runs) | Q4_1    25.5 GFLOPS (128 runs)
 128 x  128: Q5_0    29.6 GFLOPS (128 runs) | Q5_1    26.9 GFLOPS (128 runs) | Q8_0    31.7 GFLOPS (128 runs)
 128 x  128: F16     34.8 GFLOPS (128 runs) | F32     13.8 GFLOPS (128 runs)
 256 x  256: Q4_0    49.9 GFLOPS (128 runs) | Q4_1    43.3 GFLOPS (128 runs)
 256 x  256: Q5_0    46.6 GFLOPS (128 runs) | Q5_1    45.4 GFLOPS (128 runs) | Q8_0    64.0 GFLOPS (128 runs)
 256 x  256: F16     61.2 GFLOPS (128 runs) | F32     18.7 GFLOPS (128 runs)
 512 x  512: Q4_0    66.7 GFLOPS (128 runs) | Q4_1    54.7 GFLOPS (128 runs)
 512 x  512: Q5_0    53.5 GFLOPS (128 runs) | Q5_1    57.9 GFLOPS (128 runs) | Q8_0    80.6 GFLOPS (128 runs)
 512 x  512: F16     65.5 GFLOPS (128 runs) | F32     22.2 GFLOPS ( 83 runs)
1024 x 1024: Q4_0    77.7 GFLOPS ( 37 runs) | Q4_1    66.9 GFLOPS ( 32 runs)
1024 x 1024: Q5_0    66.3 GFLOPS ( 31 runs) | Q5_1    60.2 GFLOPS ( 29 runs) | Q8_0    91.6 GFLOPS ( 44 runs)
1024 x 1024: F16     63.8 GFLOPS ( 30 runs) | F32     21.2 GFLOPS ( 10 runs)
2048 x 2048: Q4_0    74.3 GFLOPS (  5 runs) | Q4_1    71.1 GFLOPS (  5 runs)
2048 x 2048: Q5_0    59.5 GFLOPS (  4 runs) | Q5_1    56.4 GFLOPS (  4 runs) | Q8_0    90.2 GFLOPS (  6 runs)
2048 x 2048: F16     49.9 GFLOPS (  3 runs) | F32     15.9 GFLOPS (  3 runs)
4096 x 4096: Q4_0    61.1 GFLOPS (  3 runs) | Q4_1    54.7 GFLOPS (  3 runs)
4096 x 4096: Q5_0    48.4 GFLOPS (  3 runs) | Q5_1    45.1 GFLOPS (  3 runs) | Q8_0    62.7 GFLOPS (  3 runs)
4096 x 4096: F16     38.4 GFLOPS (  3 runs) | F32     12.9 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |

I don't know why it stopped when it wanted to run the benchmark for all models? I have ggml-base.en.bin, and I have for-tests-ggml*.bin.

StuartIanNaylor commented 1 year ago

@randomshinichi That is what its does when the non en models are not avail

mark-beeby commented 1 year ago

Jetson Orin Nano (Developer Kit) - Unoptimised install (no CLBlast, CUBLAS etc)

Running memcpy benchmark

memcpy: 6.28 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     4.1 GFLOPS (128 runs) | Q4_1     4.2 GFLOPS (128 runs)
  64 x   64: Q5_0     4.2 GFLOPS (128 runs) | Q5_1     4.1 GFLOPS (128 runs) | Q8_0     4.6 GFLOPS (128 runs)
  64 x   64: F16      4.0 GFLOPS (128 runs) | F32      5.2 GFLOPS (128 runs)
 128 x  128: Q4_0    12.9 GFLOPS (128 runs) | Q4_1    13.2 GFLOPS (128 runs)
 128 x  128: Q5_0    12.7 GFLOPS (128 runs) | Q5_1    12.5 GFLOPS (128 runs) | Q8_0    14.1 GFLOPS (128 runs)
 128 x  128: F16      9.3 GFLOPS (128 runs) | F32     20.9 GFLOPS (128 runs)
 256 x  256: Q4_0    17.9 GFLOPS (128 runs) | Q4_1    17.5 GFLOPS (128 runs)
 256 x  256: Q5_0    17.8 GFLOPS (128 runs) | Q5_1    16.2 GFLOPS (128 runs) | Q8_0    20.3 GFLOPS (128 runs)
 256 x  256: F16     10.4 GFLOPS (128 runs) | F32     28.8 GFLOPS (128 runs)
 512 x  512: Q4_0    21.1 GFLOPS ( 79 runs) | Q4_1    20.0 GFLOPS ( 75 runs)
 512 x  512: Q5_0    18.6 GFLOPS ( 70 runs) | Q5_1    19.1 GFLOPS ( 72 runs) | Q8_0    22.0 GFLOPS ( 83 runs)
 512 x  512: F16     10.5 GFLOPS ( 40 runs) | F32     25.7 GFLOPS ( 97 runs)
1024 x 1024: Q4_0    20.6 GFLOPS ( 10 runs) | Q4_1    20.4 GFLOPS ( 10 runs)
1024 x 1024: Q5_0    20.2 GFLOPS ( 10 runs) | Q5_1    18.7 GFLOPS (  9 runs) | Q8_0    23.2 GFLOPS ( 11 runs)
1024 x 1024: F16     11.4 GFLOPS (  6 runs) | F32     16.6 GFLOPS (  8 runs)
2048 x 2048: Q4_0    22.3 GFLOPS (  3 runs) | Q4_1    22.4 GFLOPS (  3 runs)
2048 x 2048: Q5_0    22.0 GFLOPS (  3 runs) | Q5_1    20.9 GFLOPS (  3 runs) | Q8_0    25.8 GFLOPS (  3 runs)
2048 x 2048: F16     11.9 GFLOPS (  3 runs) | F32     11.5 GFLOPS (  3 runs)
4096 x 4096: Q4_0    22.7 GFLOPS (  3 runs) | Q4_1    22.6 GFLOPS (  3 runs)
4096 x 4096: Q5_0    22.2 GFLOPS (  3 runs) | Q5_1    21.0 GFLOPS (  3 runs) | Q8_0    26.2 GFLOPS (  3 runs)
4096 x 4096: F16     12.0 GFLOPS (  3 runs) | F32     13.1 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

CPU	OS	Config	Model	Th	Load	Enc.	Commit
6-core Arm Cortex-A78AE	Ubuntu 20.04	NEON	tiny	4	117	3631	5e2b340
6-core Arm Cortex-A78AE	Ubuntu 20.04	NEON	base	4	153	8603	5e2b340
6-core Arm Cortex-A78AE	Ubuntu 20.04	NEON	small	4	323	33605	5e2b340
6-core Arm Cortex-A78AE	Ubuntu 20.04	NEON	medium	4	1059	111404	5e2b340
6-core Arm Cortex-A78AE	Ubuntu 20.04	NEON	large	4	3187	222130	5e2b340

StuartIanNaylor commented 1 year ago

Jetson Orin Nano (Developer Kit)

Running memcpy benchmark

memcpy: 6.28 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     4.1 GFLOPS (128 runs) | Q4_1     4.2 GFLOPS (128 runs)
  64 x   64: Q5_0     4.2 GFLOPS (128 runs) | Q5_1     4.1 GFLOPS (128 runs) | Q8_0     4.6 GFLOPS (128 runs)
  64 x   64: F16      4.0 GFLOPS (128 runs) | F32      5.2 GFLOPS (128 runs)
 128 x  128: Q4_0    12.9 GFLOPS (128 runs) | Q4_1    13.2 GFLOPS (128 runs)
 128 x  128: Q5_0    12.7 GFLOPS (128 runs) | Q5_1    12.5 GFLOPS (128 runs) | Q8_0    14.1 GFLOPS (128 runs)
 128 x  128: F16      9.3 GFLOPS (128 runs) | F32     20.9 GFLOPS (128 runs)
 256 x  256: Q4_0    17.9 GFLOPS (128 runs) | Q4_1    17.5 GFLOPS (128 runs)
 256 x  256: Q5_0    17.8 GFLOPS (128 runs) | Q5_1    16.2 GFLOPS (128 runs) | Q8_0    20.3 GFLOPS (128 runs)
 256 x  256: F16     10.4 GFLOPS (128 runs) | F32     28.8 GFLOPS (128 runs)
 512 x  512: Q4_0    21.1 GFLOPS ( 79 runs) | Q4_1    20.0 GFLOPS ( 75 runs)
 512 x  512: Q5_0    18.6 GFLOPS ( 70 runs) | Q5_1    19.1 GFLOPS ( 72 runs) | Q8_0    22.0 GFLOPS ( 83 runs)
 512 x  512: F16     10.5 GFLOPS ( 40 runs) | F32     25.7 GFLOPS ( 97 runs)
1024 x 1024: Q4_0    20.6 GFLOPS ( 10 runs) | Q4_1    20.4 GFLOPS ( 10 runs)
1024 x 1024: Q5_0    20.2 GFLOPS ( 10 runs) | Q5_1    18.7 GFLOPS (  9 runs) | Q8_0    23.2 GFLOPS ( 11 runs)
1024 x 1024: F16     11.4 GFLOPS (  6 runs) | F32     16.6 GFLOPS (  8 runs)
2048 x 2048: Q4_0    22.3 GFLOPS (  3 runs) | Q4_1    22.4 GFLOPS (  3 runs)
2048 x 2048: Q5_0    22.0 GFLOPS (  3 runs) | Q5_1    20.9 GFLOPS (  3 runs) | Q8_0    25.8 GFLOPS (  3 runs)
2048 x 2048: F16     11.9 GFLOPS (  3 runs) | F32     11.5 GFLOPS (  3 runs)
4096 x 4096: Q4_0    22.7 GFLOPS (  3 runs) | Q4_1    22.6 GFLOPS (  3 runs)
4096 x 4096: Q5_0    22.2 GFLOPS (  3 runs) | Q5_1    21.0 GFLOPS (  3 runs) | Q8_0    26.2 GFLOPS (  3 runs)
4096 x 4096: F16     12.0 GFLOPS (  3 runs) | F32     13.1 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

CPU OS Config Model Th Load Enc. Commit 6-core Arm Cortex-A78AE Ubuntu 20.04 NEON tiny 4 117 3631 5e2b340 6-core Arm Cortex-A78AE Ubuntu 20.04 NEON base 4 153 8603 5e2b340 6-core Arm Cortex-A78AE Ubuntu 20.04 NEON small 4 323 33605 5e2b340 6-core Arm Cortex-A78AE Ubuntu 20.04 NEON medium 4 1059 111404 5e2b340 6-core Arm Cortex-A78AE Ubuntu 20.04 NEON large 4 3187 222130 5e2b340

@mark-beeby You sure everything is correct with your distro as your results are really bad, to what I was expecting. As been looking forward to see what a Orin nano could do.

Check out an rk3588 https://github.com/ggerganov/whisper.cpp/issues/89#issuecomment-1529989153 as that is an A76x4 with DDR4 not DDR5...

Also interested in what you get with cuBlas https://github.com/ggerganov/whisper.cpp#opencl-gpu-support-via-clblast

mark-beeby commented 1 year ago

Jetson Orin Nano (Developer Kit) - CUBLAS

Usage: ./bench.sh [n_threads] [encoder-only]

Running memcpy benchmark

memcpy: 6.26 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     1.0 GFLOPS (128 runs) | Q4_1     0.9 GFLOPS (128 runs)
  64 x   64: Q5_0     0.7 GFLOPS (128 runs) | Q5_1     0.9 GFLOPS (128 runs) | Q8_0     1.0 GFLOPS (128 runs)
  64 x   64: F16      1.0 GFLOPS (128 runs) | F32      0.9 GFLOPS (128 runs)
 128 x  128: Q4_0     6.8 GFLOPS (128 runs) | Q4_1     7.3 GFLOPS (128 runs)
 128 x  128: Q5_0     7.8 GFLOPS (128 runs) | Q5_1     7.8 GFLOPS (128 runs) | Q8_0     7.8 GFLOPS (128 runs)
 128 x  128: F16      8.0 GFLOPS (128 runs) | F32      7.7 GFLOPS (128 runs)
 256 x  256: Q4_0    57.1 GFLOPS (128 runs) | Q4_1    62.5 GFLOPS (128 runs)
 256 x  256: Q5_0    62.3 GFLOPS (128 runs) | Q5_1    62.8 GFLOPS (128 runs) | Q8_0    64.6 GFLOPS (128 runs)
 256 x  256: F16     38.7 GFLOPS (128 runs) | F32     38.6 GFLOPS (128 runs)
 512 x  512: Q4_0   248.6 GFLOPS (128 runs) | Q4_1   250.9 GFLOPS (128 runs)
 512 x  512: Q5_0   250.2 GFLOPS (128 runs) | Q5_1   248.7 GFLOPS (128 runs) | Q8_0   247.8 GFLOPS (128 runs)
 512 x  512: F16    215.2 GFLOPS (128 runs) | F32    210.5 GFLOPS (128 runs)
1024 x 1024: Q4_0   884.6 GFLOPS (128 runs) | Q4_1   882.7 GFLOPS (128 runs)
1024 x 1024: Q5_0   879.2 GFLOPS (128 runs) | Q5_1   872.7 GFLOPS (128 runs) | Q8_0   632.0 GFLOPS (128 runs)
1024 x 1024: F16    651.2 GFLOPS (128 runs) | F32    627.2 GFLOPS (128 runs)
2048 x 2048: Q4_0  1349.9 GFLOPS ( 79 runs) | Q4_1  1337.1 GFLOPS ( 78 runs)
2048 x 2048: Q5_0  1332.3 GFLOPS ( 78 runs) | Q5_1  1327.7 GFLOPS ( 78 runs) | Q8_0  1304.8 GFLOPS ( 76 runs)
2048 x 2048: F16   1401.6 GFLOPS ( 82 runs) | F32   1140.0 GFLOPS ( 67 runs)
4096 x 4096: Q4_0  1967.6 GFLOPS ( 15 runs) | Q4_1  1962.9 GFLOPS ( 15 runs)
4096 x 4096: Q5_0  1956.3 GFLOPS ( 15 runs) | Q5_1  1952.7 GFLOPS ( 15 runs) | Q8_0  1929.9 GFLOPS ( 15 runs)
4096 x 4096: F16   2603.2 GFLOPS ( 19 runs) | F32   1742.4 GFLOPS ( 13 runs)

Running benchmark for all models
This can take a while!

CPU	OS	Config	Model	Th	Load	Enc.	Commit
6-core Arm Cortex-A78AE	Ubuntu 20.04	NEON BLAS	tiny	4	1296	544	5e2b340
6-core Arm Cortex-A78AE	Ubuntu 20.04	NEON BLAS	base	4	1350	1015	5e2b340
6-core Arm Cortex-A78AE	Ubuntu 20.04	NEON BLAS	small	4	1557	2901	5e2b340
6-core Arm Cortex-A78AE	Ubuntu 20.04	NEON BLAS	medium	4	2303	7977	5e2b340
6-core Arm Cortex-A78AE	Ubuntu 20.04	NEON BLAS	large	4	6716	12913	5e2b340

@StuartIanNaylor I've struggled to get clblast installed, and moved back to a CUDA install, and after a few hiccups and setting export CUDA_VISIBLE_DEVICES=0 I got the much more favourable results above. Hope that helps!

tazz4843 commented 1 year ago

New desktop I built - CPU i7-13700K (turbo overclock +200MHz base), DDR5 @ 5600MT/s, GPU Intel Arc A770 LE

I tried differing numbers of thread counts, before settling on 20. Anything past 20 resulted in a drop in performance, which is obviously going to happen.

Running memcpy benchmark

memcpy: 23.16 GB/s (1 thread)
sum:    -536869898.000000

Running ggml_mul_mat benchmark with 20 threads

Initializing CLBlast (First Run)...
Attempting to use: Platform=0, Device=0 (If invalid, program will crash)
Using Platform: Intel(R) OpenCL HD Graphics Device: Intel(R) Arc(TM) A770 Graphics
  64 x   64: Q4_0     0.9 GFLOPS (128 runs) | Q4_1     1.0 GFLOPS (128 runs)
  64 x   64: Q5_0     1.0 GFLOPS (128 runs) | Q5_1     1.0 GFLOPS (128 runs) | Q8_0     1.0 GFLOPS (128 runs)
  64 x   64: F16      1.0 GFLOPS (128 runs) | F32      1.0 GFLOPS (128 runs)
 128 x  128: Q4_0     5.6 GFLOPS (128 runs) | Q4_1     5.8 GFLOPS (128 runs)
 128 x  128: Q5_0     5.7 GFLOPS (128 runs) | Q5_1     5.4 GFLOPS (128 runs) | Q8_0     5.0 GFLOPS (128 runs)
 128 x  128: F16      5.6 GFLOPS (128 runs) | F32      5.5 GFLOPS (128 runs)
 256 x  256: Q4_0    40.4 GFLOPS (128 runs) | Q4_1    38.9 GFLOPS (128 runs)
 256 x  256: Q5_0    40.7 GFLOPS (128 runs) | Q5_1    40.3 GFLOPS (128 runs) | Q8_0    38.5 GFLOPS (128 runs)
 256 x  256: F16     40.8 GFLOPS (128 runs) | F32     40.8 GFLOPS (128 runs)
 512 x  512: Q4_0   260.5 GFLOPS (128 runs) | Q4_1   264.6 GFLOPS (128 runs)
 512 x  512: Q5_0   234.3 GFLOPS (128 runs) | Q5_1   254.8 GFLOPS (128 runs) | Q8_0   260.2 GFLOPS (128 runs)
 512 x  512: F16    223.7 GFLOPS (128 runs) | F32    261.0 GFLOPS (128 runs)
1024 x 1024: Q4_0  1158.0 GFLOPS (128 runs) | Q4_1  1158.2 GFLOPS (128 runs)
1024 x 1024: Q5_0  1119.2 GFLOPS (128 runs) | Q5_1  1157.4 GFLOPS (128 runs) | Q8_0  1125.5 GFLOPS (128 runs)
1024 x 1024: F16    871.3 GFLOPS (128 runs) | F32   1029.7 GFLOPS (128 runs)
2048 x 2048: Q4_0  2847.7 GFLOPS (128 runs) | Q4_1  2749.8 GFLOPS (128 runs)
2048 x 2048: Q5_0  2752.3 GFLOPS (128 runs) | Q5_1  2879.4 GFLOPS (128 runs) | Q8_0  2770.3 GFLOPS (128 runs)
2048 x 2048: F16   2061.0 GFLOPS (120 runs) | F32   2504.5 GFLOPS (128 runs)
4096 x 4096: Q4_0  4681.2 GFLOPS ( 35 runs) | Q4_1  4637.2 GFLOPS ( 34 runs)
4096 x 4096: Q5_0  4646.7 GFLOPS ( 34 runs) | Q5_1  4586.6 GFLOPS ( 34 runs) | Q8_0  4589.7 GFLOPS ( 34 runs)
4096 x 4096: F16   3444.7 GFLOPS ( 26 runs) | F32   4128.2 GFLOPS ( 31 runs)

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Intel Core i7-13700K	Arch Linux	AVX2 BLAS	tiny	20	145	417	5e2b340
Intel Core i7-13700K	Arch Linux	AVX2 BLAS	base	20	161	560	5e2b340
Intel Core i7-13700K	Arch Linux	AVX2 BLAS	small	20	281	1072	5e2b340
Intel Core i7-13700K	Arch Linux	AVX2 BLAS	medium	20	606	2771	5e2b340
Intel Core i7-13700K	Arch Linux	AVX2 BLAS	large	20	1116	4105	5e2b340

CPU power draw during these last tests averaged 140 watts, peaking at 141. GPU metrics are currently not exposed in Linux for Arc, so I'm unable to check what that was drawing.

ggerganov / whisper.cpp

Benchmark results #89

Encoder

memcpy

MacBook M1 Pro

Ryzen 9 5950X

ggml_mul_mat

MacBook M1 Pro

Ryzen 9 5950X

-march=armv8.2-a+fp16, gcc-11.3

-mcpu=native, gcc-11.3

-mcpu=native, gcc-12.1

whisper-bin-x64

whisper-blas-bin-x64

MacBook Pro 14" with M2 Pro

MacBook Pro M2 Max 96 GB 16-inch, 2023 13.3.1 (22E261)

MacBook Air M2 24GB 2022 (CoreML model)