Closed nyadla-sys closed 2 years ago
@ggerganov
Hi I did try to run the bench with CLBlast
memcpy: 9.33 GB/s (1 thread)
sum: 136902081526.000000
Running ggml_mul_mat benchmark with 8 threads
Initializing CLBlast (First Run)...
Attempting to use: Platform=0, Device=0 (If invalid, program will crash)
arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '5'.
Using Platform: ARM Platform Device: Mali-LODX r0p0
64 x 64: Q4_0 0.5 GFLOPS (128 runs) | Q4_1 0.4 GFLOPS (128 runs) | Q4_2 0.4 GFLOPS (128 runs)
64 x 64: Q5_0 0.5 GFLOPS (128 runs) | Q5_1 0.5 GFLOPS (128 runs) | Q8_0 0.5 GFLOPS (128 runs)
64 x 64: F16 0.5 GFLOPS (128 runs) | F32 0.5 GFLOPS (128 runs)
128 x 128: Q4_0 1.9 GFLOPS (128 runs) | Q4_1 1.6 GFLOPS (128 runs) | Q4_2 1.6 GFLOPS (128 runs)
128 x 128: Q5_0 1.6 GFLOPS (128 runs) | Q5_1 1.8 GFLOPS (128 runs) | Q8_0 1.7 GFLOPS (128 runs)
128 x 128: F16 1.8 GFLOPS (128 runs) | F32 2.1 GFLOPS (128 runs)
256 x 256: Q4_0 11.7 GFLOPS (128 runs) | Q4_1 14.3 GFLOPS (128 runs) | Q4_2 12.8 GFLOPS (128 runs)
256 x 256: Q5_0 11.9 GFLOPS (128 runs) | Q5_1 11.6 GFLOPS (128 runs) | Q8_0 12.9 GFLOPS (128 runs)
256 x 256: F16 12.7 GFLOPS (128 runs) | F32 13.4 GFLOPS (128 runs)
512 x 512: Q4_0 42.5 GFLOPS (128 runs) | Q4_1 42.0 GFLOPS (128 runs) | Q4_2 40.7 GFLOPS (128 runs)
512 x 512: Q5_0 41.0 GFLOPS (128 runs) | Q5_1 41.7 GFLOPS (128 runs) | Q8_0 42.0 GFLOPS (128 runs)
512 x 512: F16 38.3 GFLOPS (128 runs) | F32 39.0 GFLOPS (128 runs)
1024 x 1024: Q4_0 69.4 GFLOPS ( 33 runs) | Q4_1 71.0 GFLOPS ( 34 runs) | Q4_2 70.6 GFLOPS ( 33 runs)
1024 x 1024: Q5_0 70.4 GFLOPS ( 33 runs) | Q5_1 70.2 GFLOPS ( 33 runs) | Q8_0 69.2 GFLOPS ( 33 runs)
1024 x 1024: F16 66.4 GFLOPS ( 31 runs) | F32 70.0 GFLOPS ( 33 runs)
2048 x 2048: Q4_0 81.3 GFLOPS ( 5 runs) | Q4_1 81.4 GFLOPS ( 5 runs) | Q4_2 81.1 GFLOPS ( 5 runs)
2048 x 2048: Q5_0 80.9 GFLOPS ( 5 runs) | Q5_1 81.4 GFLOPS ( 5 runs) | Q8_0 81.4 GFLOPS ( 5 runs)
2048 x 2048: F16 78.9 GFLOPS ( 5 runs) | F32 80.1 GFLOPS ( 5 runs)
4096 x 4096: Q4_0 87.0 GFLOPS ( 3 runs) | Q4_1 86.9 GFLOPS ( 3 runs) | Q4_2 86.9 GFLOPS ( 3 runs)
4096 x 4096: Q5_0 86.4 GFLOPS ( 3 runs) | Q5_1 86.9 GFLOPS ( 3 runs) | Q8_0 86.7 GFLOPS ( 3 runs)
4096 x 4096: F16 85.6 GFLOPS ( 3 runs) | F32 86.0 GFLOPS ( 3 runs)
./extra/bench-all.sh: line 45: 2051349 Segmentation fault ./bench -w 2 -t $n_threads 2>&1
Running benchmark for all models
This can take a while!
| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
./extra/bench-all.sh: line 50: 2082709 Segmentation fault ./bench -m ./models/ggml-$model.bin -t $n_threads 2> /dev/null > /dev/null
./extra/bench-all.sh: line 50: 2082762 Segmentation fault ./bench -m ./models/ggml-$model.bin -t $n_threads 2> /dev/null > /dev/null
./extra/bench-all.sh: line 50: 2082834 Segmentation fault ./bench -m ./models/ggml-$model.bin -t $n_threads 2> /dev/null > /dev/null
corangepi@orangepi5:~$ clinfo
Number of platforms 1
Platform Name ARM Platform
Platform Vendor ARM
Platform Version OpenCL 2.1 v1.g6p0-01eac0.efb7 5e2978d783a80fe78be1bfb0efc1
Platform Profile FULL_PROFILE
Platform Extensions cl_khr_global_int32_base_atomi cs cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_l ocal_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp16 cl_khr_icd cl_khr_egl_image cl_khr_image2d_from_buffer cl_khr_depth_images cl_khr_subgroups cl_khr_subgroup_extended_types cl_khr_subgroup_non_uniform_vote cl_khr_subgroup _ballot cl_khr_il_program cl_khr_priority_hints cl_khr_create_command_queue cl_k hr_spirv_no_integer_wrap_decoration cl_khr_extended_versioning cl_khr_device_uui d cl_arm_core_id cl_arm_printf cl_arm_non_uniform_work_group_size cl_arm_import_ memory cl_arm_import_memory_dma_buf cl_arm_import_memory_host cl_arm_integer_dot _product_int8 cl_arm_integer_dot_product_accumulate_int8 cl_arm_integer_dot_prod uct_accumulate_saturate_int8 cl_arm_scheduling_controls cl_arm_controlled_kernel _termination cl_ext_cxx_for_opencl
Platform Extensions function suffix ARM
Platform Host timer resolution 1ns
Platform Name ARM Platform
Number of devices 1
arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '5'.
Device Name Mali-LODX r0p0
Device Vendor ARM
Device Vendor ID 0xa8670000
Device Version OpenCL 2.1 v1.g6p0-01eac0.efb7 5e2978d783a80fe78be1bfb0efc1
Device UUID 000067a8-0100-0000-0000-000000 000000
Driver UUID d9495bef-ea91-7c52-8a43-8a3c2f 7b49cc
Valid Device LUID No
Device LUID 0000-000000000000
Device Node Mask 0
Device Numeric Version 0x801000 (2.1.0)
Driver Version 2.1
Device OpenCL C Version OpenCL C 2.0 v1.g6p0-01eac0.ef b75e2978d783a80fe78be1bfb0efc1
Device C++ for OpenCL Numeric Version 0x400000 (1.0.0)
Device Type GPU
Device Profile FULL_PROFILE
Device Available Yes
Compiler Available Yes
Linker Available Yes
Max compute units 4
Available core IDs 0, 2, 16, 18
Max clock frequency 1000MHz
Device Partition (core)
Max number of sub-devices 0
Supported partition types None
Supported affinity domains (n/a)
Max work item dimensions 3
Max work item sizes 1024x1024x1024
Max work group size 1024
Preferred work group size multiple (kernel) 16
Max sub-groups per work group 64
Preferred / native vector sizes
char 16 / 4
short 8 / 2
int 4 / 1
long 2 / 1
half 8 / 2 (cl_khr_fp 16)
float 4 / 1
double 0 / 0 (n/a)
Half-precision Floating-point support (cl_khr_fp16)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Single-precision Floating-point support (core)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Correctly-rounded divide and sqrt operations No
Double-precision Floating-point support (n/a)
Address bits 64, Little-Endian
Global memory size 3910270976 (3.642GiB)
Error Correction support No
Max memory allocation 3910270976 (3.642GiB)
Unified memory for Host and Device Yes
Shared Virtual Memory (SVM) capabilities (core)
Coarse-grained buffer sharing Yes
Fine-grained buffer sharing No
Fine-grained system sharing No
Atomics No
Minimum alignment for any data type 128 bytes
Alignment of base address 1024 bits (128 bytes)
Preferred alignment for atomics
SVM 0 bytes
Global 0 bytes
Local 0 bytes
Max size for global variable 65536 (64KiB)
Preferred total size of global vars 0
Global Memory cache type Read/Write
Global Memory cache size 1048576 (1024KiB)
Global Memory cache line size 64 bytes
Image support Yes
Max number of samplers per kernel 16
Max size for 1D images from buffer 65536 pixels
Max 1D or 2D image array size 2048 images
Base address alignment for 2D image buffers 32 bytes
Pitch alignment for 2D image buffers 64 pixels
Max 2D image size 65536x65536 pixels
Max 3D image size 65536x65536x65536 pixels
Max number of read image args 128
Max number of write image args 64
Max number of read/write image args 64
Max number of pipe args 16
Max active pipe reservations 1
Max pipe packet size 1024
Local memory type Global
Local memory size 32768 (32KiB)
Max number of constant args 128
Max constant buffer size 3910270976 (3.642GiB)
Max size of kernel argument 1024
Queue properties (on host)
Out-of-order execution Yes
Profiling Yes
Queue properties (on device)
Out-of-order execution Yes
Profiling Yes
Preferred size 2097152 (2MiB)
Max size 16777216 (16MiB)
Max queues on device 1
Max events on device 1024
Prefer user sync for interop No
Profiling timer resolution 1000ns
Execution capabilities
Run OpenCL kernels Yes
Run native kernels No
Sub-group independent forward progress Yes
IL version SPIR-V_1.0
ILs with version SPIR-V 0x400000 (1.0.0)
printf() buffer size 1048576 (1024KiB)
Built-in kernels (n/a)
Built-in kernels with version (n/a)
Device Extensions cl_khr_global_int32_base_atomi cs cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_l ocal_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp16 cl_khr_icd cl_khr_egl_image cl_khr_image2d_from_buffer cl_khr_depth_images cl_khr_subgroups cl_khr_subgroup_extended_types cl_khr_subgroup_non_uniform_vote cl_khr_subgroup _ballot cl_khr_il_program cl_khr_priority_hints cl_khr_create_command_queue cl_k hr_spirv_no_integer_wrap_decoration cl_khr_extended_versioning cl_khr_device_uui d cl_arm_core_id cl_arm_printf cl_arm_non_uniform_work_group_size cl_arm_import_ memory cl_arm_import_memory_dma_buf cl_arm_import_memory_host cl_arm_integer_dot _product_int8 cl_arm_integer_dot_product_accumulate_int8 cl_arm_integer_dot_prod uct_accumulate_saturate_int8 cl_arm_scheduling_controls cl_arm_controlled_kernel _termination cl_ext_cxx_for_opencl
Device Extensions with Version cl_khr_global_int32_base_atomi cs 0x400000 (1.0.0)
cl_khr_global_int32_extended_a tomics 0x400000 (1.0.0)
cl_khr_local_int32_base_atomic s 0x400000 (1.0.0)
cl_khr_local_int32_extended_at omics 0x400000 (1.0.0)
cl_khr_byte_addressable_store 0x400000 (1.0.0)
cl_khr_3d_image_writes 0x400000 (1.0.0)
cl_khr_int64_base_atomics 0x400000 (1.0.0)
cl_khr_int64_extended_atomics 0x400000 (1.0.0)
cl_khr_fp16 0x400000 (1.0.0)
cl_khr_icd 0x400000 (1.0.0)
cl_khr_egl_image 0x400000 (1.0.0)
cl_khr_image2d_from_buffer 0x400000 (1.0.0)
cl_khr_depth_images 0x400000 (1.0.0)
cl_khr_subgroups 0x400000 (1.0.0)
cl_khr_subgroup_extended_types 0x400000 (1.0.0)
cl_khr_subgroup_non_uniform_vo te 0x400000 (1.0.0)
cl_khr_subgroup_ballot 0x400000 (1.0.0)
cl_khr_il_program 0x400000 (1.0.0)
cl_khr_priority_hints 0x400000 (1.0.0)
cl_khr_create_command_queue 0x400000 (1.0.0)
cl_khr_spirv_no_integer_wrap_d ecoration 0x400000 (1.0.0)
cl_khr_extended_versioning 0x400000 (1.0.0)
cl_khr_device_uuid 0x400000 (1.0.0)
cl_arm_core_id 0x400000 (1.0.0)
cl_arm_printf 0x400000 (1.0.0)
cl_arm_non_uniform_work_group_ size 0x400000 (1.0.0)
cl_arm_import_memory 0x400000 (1.0.0)
cl_arm_import_memory_dma_buf 0x400000 (1.0.0)
cl_arm_import_memory_host 0x400000 (1.0.0)
cl_arm_integer_dot_product_int 8 0x400000 (1.0.0)
cl_arm_integer_dot_product_acc umulate_int8 0x400000 (1.0.0)
cl_arm_integer_dot_product_acc umulate_saturate_int8 0x400000 (1.0.0)
cl_arm_scheduling_controls 0x3000 (0.3.0)
cl_arm_controlled_kernel_termi nation 0x400000 (1.0.0)
cl_ext_cxx_for_opencl 0x400000 (1.0.0)
NULL platform behavior
clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) ARM Platform
clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) Success [ARM]
clCreateContext(NULL, ...) [default] Success [ARM]
clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT) Success (1)
Platform Name ARM Platform
Device Name Mali-LODX r0p0
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No devices found in platfor m
clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) Success (1)
Platform Name ARM Platform
Device Name Mali-LODX r0p0
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No devices found in plat form
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) Success (1)
Platform Name ARM Platform
Device Name Mali-LODX r0p0
ICD loader properties
ICD loader Name OpenCL ICD Loader
ICD loader Vendor OCL Icd free software
ICD loader Version 2.2.14
ICD loader Profile OpenCL 3.0
Have you ever just opened up another Cli window and monitored the temps vs clock speed?
On Armbian I used armbianmonitor -m
while running the benchmark and saw temperatures above 80°C pretty quickly. CPU clockspeeds seemed to stay above 2 GHz though. A 20% drop in clock speed could probably have a larger effect on the inference time I guess. If you say it starts throttling at 75°C then I'm definitely in the range.
watch -n1 cat /sys/class/thermal/thermal_zone/temp watch -n 1 cat /sys/devices/system/cpu/cpu/cpufreq/cpuinfo_cur_freq
I'll test that again on the Ubuntu system 👍
If you want to go all out on a cooling solution then https://www.amazon.com/dp/B0C2T9N9L2?
Looks pretty fancy 😎
Yeah I got one of the 'armor' cases and its a strange implementation as the fan sits flat on the metal base and has no space to push air through. So gave it a little space by adding some m3 nuts as spacers. Even then it was terrible and think the thermal pads are bad but also its puts a lot of pressure on the board and warps it slightly. So it got a full makeover with 2x Gelid thermal pads both sides of the CPU to provide pressure from both sides. It also got a 30mm PWM fan so it turns off when idle.
The above is prob the most OP cooler you can get for the Opi5 and after all the additions I prob spent about same. /sys/devices/virtual/thermal/thermal_zone0/trip_point_0_temp is 75000 and think that is throttle temp Which is pretty low but that is what its set to.
But as said a 30/40mm stick with a fan will suffice the SoC is capable of 12watts TDP if I rememeber correctly
orangepi@orangepi5:~/whisper.cpp$ taskset -c 4-7 ./main -m models/ggml-tiny-q8_0.bin -f ./samples/jfk.wav
whisper_init_from_file_no_state: loading model from 'models/ggml-tiny-q8_0.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51865
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 384
whisper_model_load: n_text_head = 6
whisper_model_load: n_text_layer = 4
whisper_model_load: n_mels = 80
whisper_model_load: ftype = 7
whisper_model_load: type = 1
whisper_model_load: mem required = 172.00 MB (+ 3.00 MB per decoder)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx = 43.18 MB
whisper_model_load: model size = 43.14 MB
whisper_init_state: kv self size = 2.62 MB
whisper_init_state: kv cross size = 8.79 MB
system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | COREML = 0 |
main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...
[00:00:00.000 --> 00:00:10.500] And so my fellow Americans ask not what your country can do for you ask what you can do for your country.
whisper_print_timings: load time = 112.34 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 290.00 ms
whisper_print_timings: sample time = 24.98 ms / 25 runs ( 1.00 ms per run)
whisper_print_timings: encode time = 1050.98 ms / 1 runs ( 1050.98 ms per run)
whisper_print_timings: decode time = 129.20 ms / 25 runs ( 5.17 ms per run)
whisper_print_timings: total time = 1667.61 ms
@StuartIanNaylor Hi, Have you tried running whisper.cpp with GPU on rk3588? Now that I'm having a lot of trouble trying to cross-compile CLBlast to my rk3588 dev board, I'm wondering how much of a speed boost the GPU acceleration with clblast can bring and if it's worth doing?
Yeah its not all roses, I get
WHISPER_CLBLAST=1 make -j4
I whisper.cpp build info:
I UNAME_S: Linux
I UNAME_P: unknown
I UNAME_M: aarch64
I CFLAGS: -I. -O3 -DNDEBUG -std=c11 -fPIC -pthread -DGGML_USE_CLBLAST
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread
I LDFLAGS: -lclblast -lOpenCL
I CC: cc (Debian 10.2.1-6) 10.2.1 20210110
I CXX: g++ (Debian 10.2.1-6) 10.2.1 20210110
cc -I. -O3 -DNDEBUG -std=c11 -fPIC -pthread -DGGML_USE_CLBLAST -c ggml.c -o ggml.o
cc -I. -O3 -DNDEBUG -std=c11 -fPIC -pthread -DGGML_USE_CLBLAST -c ggml-opencl.c -o ggml-opencl.o
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread -c whisper.cpp -o whisper.o
CFLAGS += -mcpu=native
make: CFLAGS: No such file or directory
make: *** [Makefile:181: ggml-opencl.o] Error 127
make: *** Waiting for unfinished jobs....
I just ignore and run the cli again and it works even though slow.
I think there are problems with the driver and Mali G610 as when you run some of the clblast tuners you get errors forgot which but they are located in /usr/bin/clblast_tuner_copy_fast and all have a different name than sending a param. But the methods needed work.
Out of curiosity I tried something non AMD as don't have a Vega board and a HD630 installs the same and seems to behave simulary.
But run but are slow and its a question to how much is running on GPU as wondering if its trying and it returns a fail. There is a proc or sys where you can show gpu load just forgot where it is located.
| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| <todo> | <todo> | NEON BLAS | tiny | 4 | 219 | 2857 | 14bee39 |
| <todo> | <todo> | NEON BLAS | tiny-q4_0 | 4 | 156 | 3330 | 14bee39 |
| <todo> | <todo> | NEON BLAS | tiny-q4_1 | 4 | 167 | 3022 | 14bee39 |
| <todo> | <todo> | NEON BLAS | tiny-q5_0 | 4 | 163 | 3650 | 14bee39 |
| <todo> | <todo> | NEON BLAS | tiny-q5_1 | 4 | 169 | 3300 | 14bee39 |
| <todo> | <todo> | NEON BLAS | tiny-q8_0 | 4 | 183 | 3259 | 14bee39 |
| <todo> | <todo> | NEON BLAS | base | 4 | 291 | 4680 | 14bee39 |
| <todo> | <todo> | NEON BLAS | base-q4_0 | 4 | 188 | 5261 | 14bee39 |
| <todo> | <todo> | NEON BLAS | base-q4_1 | 4 | 198 | 4861 | 14bee39 |
| <todo> | <todo> | NEON BLAS | base-q5_0 | 4 | 198 | 5161 | 14bee39 |
| <todo> | <todo> | NEON BLAS | base-q5_1 | 4 | 196 | 5025 | 14bee39 |
| <todo> | <todo> | NEON BLAS | base-q8_0 | 4 | 232 | 5340 | 14bee39 |
Also spuriously things go awry
rock@rock-5b:~/whisper.cpp$ ./main -m models/ggml-tiny.bin -f samples/jfk.wav
whisper_init_from_file_no_state: loading model from 'models/ggml-tiny.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51865
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 384
whisper_model_load: n_text_head = 6
whisper_model_load: n_text_layer = 4
whisper_model_load: n_mels = 80
whisper_model_load: ftype = 1
whisper_model_load: type = 1
whisper_model_load: mem required = 201.00 MB (+ 3.00 MB per decoder)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx = 73.58 MB
Initializing CLBlast (First Run)...
Attempting to use: Platform=0, Device=0 (If invalid, program will crash)
arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '5'.
Using Platform: ARM Platform Device: Mali-LODX r0p0
whisper_model_load: model size = 73.54 MB
whisper_init_state: kv self size = 2.62 MB
whisper_init_state: kv cross size = 8.79 MB
system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | COREML = 0 |
main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...
[00:00:00.000 --> 00:00:03.000] And so my fellow Americans?
[00:00:03.000 --> 00:00:04.000] Are you not?
[00:00:04.000 --> 00:00:05.000] Not.
[00:00:05.000 --> 00:00:08.000] What your country can do for you?
[00:00:08.000 --> 00:00:11.000] Ask what you can do for your country.
whisper_print_timings: load time = 224.94 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 306.92 ms
whisper_print_timings: sample time = 41.27 ms / 39 runs ( 1.06 ms per run)
whisper_print_timings: encode time = 3658.08 ms / 1 runs ( 3658.08 ms per run)
whisper_print_timings: decode time = 397.50 ms / 39 runs ( 10.19 ms per run)
whisper_print_timings: total time = 4725.34 ms
Its prob still a bit fresh for the mali g610 but I did sucessfully run the ArmNN OPenCL based Wav2Vec example they do, but its very different to whisper as near all load is on the GPU and it looks fantastic as slightly slower than the CPU. The CPU does get tickled but there is near no load compared to running ArmNN on the CPU whilst above your wondering if it actually used the GPU at all apart from posturing that it was.
I think the Intel is similular and far more less fresh that the Mali G610 which really is still waiting for kernel changes. https://www.collabora.com/news-and-blog/news-and-events/pancsf-a-new-drm-driver-for-mali-csf-based-gpus.html
To be honest I have a gut feeling OpenCL might be similar to OpenGL and not granular enough and tend to limit performance. I was sort of hoping that the GPU implementaion would be CoreML/Metal & GGML/Vulkan as maybe that would be more performant and available. I am not really sure what is happening with NPU's as they don't seem to have any compliance and have specific frameworks.
Maybe what Nvidia say is the way to go and create opensopurce versions of https://developer.nvidia.com/blog/machine-learning-acceleration-vulkan-cooperative-matrices/
https://gist.github.com/itzmeanjan/84613bc7595372c5e6b6c22481d42f9a
https://github.com/bartwojcik/vulkano-matmul
https://www.khronos.org/assets/uploads/developers/presentations/Cooperative_Matrix_May22.pdf
FYI: You can run Whisper models with onnxruntime in C++ using sherpa-onnx on Raspberry Pi.
You can find the documentation at https://k2-fsa.github.io/sherpa/onnx/pretrained_models/whisper/index.html
The following is the RTF running tiny.en
on Raspberry Pi Model 4B:
is it possible to run this gghml model on raspberry pi hardware?