is it possible to run openai-whisper ggml model on raspberry pi hardware?

nyadla-sys commented 2 years ago

is it possible to run this gghml model on raspberry pi hardware?

StuartIanNaylor commented 1 year ago

@ggerganov

Hi I did try to run the bench with CLBlast

memcpy: 9.33 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 8 threads

Initializing CLBlast (First Run)...
Attempting to use: Platform=0, Device=0 (If invalid, program will crash)
arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '5'.
Using Platform: ARM Platform Device: Mali-LODX r0p0
  64 x   64: Q4_0     0.5 GFLOPS (128 runs) | Q4_1     0.4 GFLOPS (128 runs) | Q4_2     0.4 GFLOPS (128 runs)
  64 x   64: Q5_0     0.5 GFLOPS (128 runs) | Q5_1     0.5 GFLOPS (128 runs) | Q8_0     0.5 GFLOPS (128 runs)
  64 x   64: F16      0.5 GFLOPS (128 runs) | F32      0.5 GFLOPS (128 runs)
 128 x  128: Q4_0     1.9 GFLOPS (128 runs) | Q4_1     1.6 GFLOPS (128 runs) | Q4_2     1.6 GFLOPS (128 runs)
 128 x  128: Q5_0     1.6 GFLOPS (128 runs) | Q5_1     1.8 GFLOPS (128 runs) | Q8_0     1.7 GFLOPS (128 runs)
 128 x  128: F16      1.8 GFLOPS (128 runs) | F32      2.1 GFLOPS (128 runs)
 256 x  256: Q4_0    11.7 GFLOPS (128 runs) | Q4_1    14.3 GFLOPS (128 runs) | Q4_2    12.8 GFLOPS (128 runs)
 256 x  256: Q5_0    11.9 GFLOPS (128 runs) | Q5_1    11.6 GFLOPS (128 runs) | Q8_0    12.9 GFLOPS (128 runs)
 256 x  256: F16     12.7 GFLOPS (128 runs) | F32     13.4 GFLOPS (128 runs)
 512 x  512: Q4_0    42.5 GFLOPS (128 runs) | Q4_1    42.0 GFLOPS (128 runs) | Q4_2    40.7 GFLOPS (128 runs)
 512 x  512: Q5_0    41.0 GFLOPS (128 runs) | Q5_1    41.7 GFLOPS (128 runs) | Q8_0    42.0 GFLOPS (128 runs)
 512 x  512: F16     38.3 GFLOPS (128 runs) | F32     39.0 GFLOPS (128 runs)
1024 x 1024: Q4_0    69.4 GFLOPS ( 33 runs) | Q4_1    71.0 GFLOPS ( 34 runs) | Q4_2    70.6 GFLOPS ( 33 runs)
1024 x 1024: Q5_0    70.4 GFLOPS ( 33 runs) | Q5_1    70.2 GFLOPS ( 33 runs) | Q8_0    69.2 GFLOPS ( 33 runs)
1024 x 1024: F16     66.4 GFLOPS ( 31 runs) | F32     70.0 GFLOPS ( 33 runs)
2048 x 2048: Q4_0    81.3 GFLOPS (  5 runs) | Q4_1    81.4 GFLOPS (  5 runs) | Q4_2    81.1 GFLOPS (  5 runs)
2048 x 2048: Q5_0    80.9 GFLOPS (  5 runs) | Q5_1    81.4 GFLOPS (  5 runs) | Q8_0    81.4 GFLOPS (  5 runs)
2048 x 2048: F16     78.9 GFLOPS (  5 runs) | F32     80.1 GFLOPS (  5 runs)
4096 x 4096: Q4_0    87.0 GFLOPS (  3 runs) | Q4_1    86.9 GFLOPS (  3 runs) | Q4_2    86.9 GFLOPS (  3 runs)
4096 x 4096: Q5_0    86.4 GFLOPS (  3 runs) | Q5_1    86.9 GFLOPS (  3 runs) | Q8_0    86.7 GFLOPS (  3 runs)
4096 x 4096: F16     85.6 GFLOPS (  3 runs) | F32     86.0 GFLOPS (  3 runs)
./extra/bench-all.sh: line 45: 2051349 Segmentation fault      ./bench -w 2 -t $n_threads 2>&1

Running benchmark for all models
This can take a while!

| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
./extra/bench-all.sh: line 50: 2082709 Segmentation fault      ./bench -m ./models/ggml-$model.bin -t $n_threads 2> /dev/null > /dev/null
./extra/bench-all.sh: line 50: 2082762 Segmentation fault      ./bench -m ./models/ggml-$model.bin -t $n_threads 2> /dev/null > /dev/null
./extra/bench-all.sh: line 50: 2082834 Segmentation fault      ./bench -m ./models/ggml-$model.bin -t $n_threads 2> /dev/null > /dev/null

corangepi@orangepi5:~$ clinfo
Number of platforms                               1
  Platform Name                                   ARM Platform
  Platform Vendor                                 ARM
  Platform Version                                OpenCL 2.1 v1.g6p0-01eac0.efb7                                                                                          5e2978d783a80fe78be1bfb0efc1
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_global_int32_base_atomi                                                                                          cs cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_l                                                                                          ocal_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes                                                                                           cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp16 cl_khr_icd                                                                                           cl_khr_egl_image cl_khr_image2d_from_buffer cl_khr_depth_images cl_khr_subgroups                                                                                           cl_khr_subgroup_extended_types cl_khr_subgroup_non_uniform_vote cl_khr_subgroup                                                                                          _ballot cl_khr_il_program cl_khr_priority_hints cl_khr_create_command_queue cl_k                                                                                          hr_spirv_no_integer_wrap_decoration cl_khr_extended_versioning cl_khr_device_uui                                                                                          d cl_arm_core_id cl_arm_printf cl_arm_non_uniform_work_group_size cl_arm_import_                                                                                          memory cl_arm_import_memory_dma_buf cl_arm_import_memory_host cl_arm_integer_dot                                                                                          _product_int8 cl_arm_integer_dot_product_accumulate_int8 cl_arm_integer_dot_prod                                                                                          uct_accumulate_saturate_int8 cl_arm_scheduling_controls cl_arm_controlled_kernel                                                                                          _termination cl_ext_cxx_for_opencl
  Platform Extensions function suffix             ARM
  Platform Host timer resolution                  1ns

  Platform Name                                   ARM Platform
Number of devices                                 1
arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '5'.
  Device Name                                     Mali-LODX r0p0
  Device Vendor                                   ARM
  Device Vendor ID                                0xa8670000
  Device Version                                  OpenCL 2.1 v1.g6p0-01eac0.efb7                                                                                          5e2978d783a80fe78be1bfb0efc1
  Device UUID                                     000067a8-0100-0000-0000-000000                                                                                          000000
  Driver UUID                                     d9495bef-ea91-7c52-8a43-8a3c2f                                                                                          7b49cc
  Valid Device LUID                               No
  Device LUID                                     0000-000000000000
  Device Node Mask                                0
  Device Numeric Version                          0x801000 (2.1.0)
  Driver Version                                  2.1
  Device OpenCL C Version                         OpenCL C 2.0 v1.g6p0-01eac0.ef                                                                                          b75e2978d783a80fe78be1bfb0efc1
  Device C++ for OpenCL Numeric Version           0x400000 (1.0.0)
  Device Type                                     GPU
  Device Profile                                  FULL_PROFILE
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Max compute units                               4
  Available core IDs                              0, 2, 16, 18
  Max clock frequency                             1000MHz
  Device Partition                                (core)
    Max number of sub-devices                     0
    Supported partition types                     None
    Supported affinity domains                    (n/a)
  Max work item dimensions                        3
  Max work item sizes                             1024x1024x1024
  Max work group size                             1024
  Preferred work group size multiple (kernel)     16
  Max sub-groups per work group                   64
  Preferred / native vector sizes
    char                                                16 / 4
    short                                                8 / 2
    int                                                  4 / 1
    long                                                 2 / 1
    half                                                 8 / 2        (cl_khr_fp                                                                                          16)
    float                                                4 / 1
    double                                               0 / 0        (n/a)
  Half-precision Floating-point support           (cl_khr_fp16)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
  Single-precision Floating-point support         (core)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  No
  Double-precision Floating-point support         (n/a)
  Address bits                                    64, Little-Endian
  Global memory size                              3910270976 (3.642GiB)
  Error Correction support                        No
  Max memory allocation                           3910270976 (3.642GiB)
  Unified memory for Host and Device              Yes
  Shared Virtual Memory (SVM) capabilities        (core)
    Coarse-grained buffer sharing                 Yes
    Fine-grained buffer sharing                   No
    Fine-grained system sharing                   No
    Atomics                                       No
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       1024 bits (128 bytes)
  Preferred alignment for atomics
    SVM                                           0 bytes
    Global                                        0 bytes
    Local                                         0 bytes
  Max size for global variable                    65536 (64KiB)
  Preferred total size of global vars             0
  Global Memory cache type                        Read/Write
  Global Memory cache size                        1048576 (1024KiB)
  Global Memory cache line size                   64 bytes
  Image support                                   Yes
    Max number of samplers per kernel             16
    Max size for 1D images from buffer            65536 pixels
    Max 1D or 2D image array size                 2048 images
    Base address alignment for 2D image buffers   32 bytes
    Pitch alignment for 2D image buffers          64 pixels
    Max 2D image size                             65536x65536 pixels
    Max 3D image size                             65536x65536x65536 pixels
    Max number of read image args                 128
    Max number of write image args                64
    Max number of read/write image args           64
  Max number of pipe args                         16
  Max active pipe reservations                    1
  Max pipe packet size                            1024
  Local memory type                               Global
  Local memory size                               32768 (32KiB)
  Max number of constant args                     128
  Max constant buffer size                        3910270976 (3.642GiB)
  Max size of kernel argument                     1024
  Queue properties (on host)
    Out-of-order execution                        Yes
    Profiling                                     Yes
  Queue properties (on device)
    Out-of-order execution                        Yes
    Profiling                                     Yes
    Preferred size                                2097152 (2MiB)
    Max size                                      16777216 (16MiB)
  Max queues on device                            1
  Max events on device                            1024
  Prefer user sync for interop                    No
  Profiling timer resolution                      1000ns
  Execution capabilities
    Run OpenCL kernels                            Yes
    Run native kernels                            No
    Sub-group independent forward progress        Yes
    IL version                                    SPIR-V_1.0
    ILs with version                              SPIR-V                                                                                                                                                     0x400000 (1.0.0)
  printf() buffer size                            1048576 (1024KiB)
  Built-in kernels                                (n/a)
  Built-in kernels with version                   (n/a)
  Device Extensions                               cl_khr_global_int32_base_atomi                                                                                          cs cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_l                                                                                          ocal_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes                                                                                           cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp16 cl_khr_icd                                                                                           cl_khr_egl_image cl_khr_image2d_from_buffer cl_khr_depth_images cl_khr_subgroups                                                                                           cl_khr_subgroup_extended_types cl_khr_subgroup_non_uniform_vote cl_khr_subgroup                                                                                          _ballot cl_khr_il_program cl_khr_priority_hints cl_khr_create_command_queue cl_k                                                                                          hr_spirv_no_integer_wrap_decoration cl_khr_extended_versioning cl_khr_device_uui                                                                                          d cl_arm_core_id cl_arm_printf cl_arm_non_uniform_work_group_size cl_arm_import_                                                                                          memory cl_arm_import_memory_dma_buf cl_arm_import_memory_host cl_arm_integer_dot                                                                                          _product_int8 cl_arm_integer_dot_product_accumulate_int8 cl_arm_integer_dot_prod                                                                                          uct_accumulate_saturate_int8 cl_arm_scheduling_controls cl_arm_controlled_kernel                                                                                          _termination cl_ext_cxx_for_opencl
  Device Extensions with Version                  cl_khr_global_int32_base_atomi                                                                                          cs                                 0x400000 (1.0.0)
                                                  cl_khr_global_int32_extended_a                                                                                          tomics                             0x400000 (1.0.0)
                                                  cl_khr_local_int32_base_atomic                                                                                          s                                  0x400000 (1.0.0)
                                                  cl_khr_local_int32_extended_at                                                                                          omics                              0x400000 (1.0.0)
                                                  cl_khr_byte_addressable_store                                                                                                                              0x400000 (1.0.0)
                                                  cl_khr_3d_image_writes                                                                                                                                     0x400000 (1.0.0)
                                                  cl_khr_int64_base_atomics                                                                                                                                  0x400000 (1.0.0)
                                                  cl_khr_int64_extended_atomics                                                                                                                              0x400000 (1.0.0)
                                                  cl_khr_fp16                                                                                                                                                0x400000 (1.0.0)
                                                  cl_khr_icd                                                                                                                                                 0x400000 (1.0.0)
                                                  cl_khr_egl_image                                                                                                                                           0x400000 (1.0.0)
                                                  cl_khr_image2d_from_buffer                                                                                                                                 0x400000 (1.0.0)
                                                  cl_khr_depth_images                                                                                                                                        0x400000 (1.0.0)
                                                  cl_khr_subgroups                                                                                                                                           0x400000 (1.0.0)
                                                  cl_khr_subgroup_extended_types                                                                                                                             0x400000 (1.0.0)
                                                  cl_khr_subgroup_non_uniform_vo                                                                                          te                                 0x400000 (1.0.0)
                                                  cl_khr_subgroup_ballot                                                                                                                                     0x400000 (1.0.0)
                                                  cl_khr_il_program                                                                                                                                          0x400000 (1.0.0)
                                                  cl_khr_priority_hints                                                                                                                                      0x400000 (1.0.0)
                                                  cl_khr_create_command_queue                                                                                                                                0x400000 (1.0.0)
                                                  cl_khr_spirv_no_integer_wrap_d                                                                                          ecoration                          0x400000 (1.0.0)
                                                  cl_khr_extended_versioning                                                                                                                                 0x400000 (1.0.0)
                                                  cl_khr_device_uuid                                                                                                                                         0x400000 (1.0.0)
                                                  cl_arm_core_id                                                                                                                                             0x400000 (1.0.0)
                                                  cl_arm_printf                                                                                                                                              0x400000 (1.0.0)
                                                  cl_arm_non_uniform_work_group_                                                                                          size                               0x400000 (1.0.0)
                                                  cl_arm_import_memory                                                                                                                                       0x400000 (1.0.0)
                                                  cl_arm_import_memory_dma_buf                                                                                                                               0x400000 (1.0.0)
                                                  cl_arm_import_memory_host                                                                                                                                  0x400000 (1.0.0)
                                                  cl_arm_integer_dot_product_int                                                                                          8                                  0x400000 (1.0.0)
                                                  cl_arm_integer_dot_product_acc                                                                                          umulate_int8                       0x400000 (1.0.0)
                                                  cl_arm_integer_dot_product_acc                                                                                          umulate_saturate_int8              0x400000 (1.0.0)
                                                  cl_arm_scheduling_controls                                                                                                                                   0x3000 (0.3.0)
                                                  cl_arm_controlled_kernel_termi                                                                                          nation                             0x400000 (1.0.0)
                                                  cl_ext_cxx_for_opencl                                                                                                                                      0x400000 (1.0.0)

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  ARM Platform
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   Success [ARM]
  clCreateContext(NULL, ...) [default]            Success [ARM]
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  Success (1)
    Platform Name                                 ARM Platform
    Device Name                                   Mali-LODX r0p0
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platfor                                                                                          m
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  Success (1)
    Platform Name                                 ARM Platform
    Device Name                                   Mali-LODX r0p0
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in                                                                                           platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in plat                                                                                          form
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  Success (1)
    Platform Name                                 ARM Platform
    Device Name                                   Mali-LODX r0p0

ICD loader properties
  ICD loader Name                                 OpenCL ICD Loader
  ICD loader Vendor                               OCL Icd free software
  ICD loader Version                              2.2.14
  ICD loader Profile                              OpenCL 3.0

fquirin commented 1 year ago

Have you ever just opened up another Cli window and monitored the temps vs clock speed?

On Armbian I used armbianmonitor -m while running the benchmark and saw temperatures above 80°C pretty quickly. CPU clockspeeds seemed to stay above 2 GHz though. A 20% drop in clock speed could probably have a larger effect on the inference time I guess. If you say it starts throttling at 75°C then I'm definitely in the range.

watch -n1 cat /sys/class/thermal/thermal_zone/temp watch -n 1 cat /sys/devices/system/cpu/cpu/cpufreq/cpuinfo_cur_freq

I'll test that again on the Ubuntu system 👍

If you want to go all out on a cooling solution then https://www.amazon.com/dp/B0C2T9N9L2?

Looks pretty fancy 😎

StuartIanNaylor commented 1 year ago

Yeah I got one of the 'armor' cases and its a strange implementation as the fan sits flat on the metal base and has no space to push air through. So gave it a little space by adding some m3 nuts as spacers. Even then it was terrible and think the thermal pads are bad but also its puts a lot of pressure on the board and warps it slightly. So it got a full makeover with 2x Gelid thermal pads both sides of the CPU to provide pressure from both sides. It also got a 30mm PWM fan so it turns off when idle.

The above is prob the most OP cooler you can get for the Opi5 and after all the additions I prob spent about same. /sys/devices/virtual/thermal/thermal_zone0/trip_point_0_temp is 75000 and think that is throttle temp Which is pretty low but that is what its set to.

But as said a 30/40mm stick with a fan will suffice the SoC is capable of 12watts TDP if I rememeber correctly

orangepi@orangepi5:~/whisper.cpp$ taskset -c 4-7 ./main -m models/ggml-tiny-q8_0.bin -f ./samples/jfk.wav
whisper_init_from_file_no_state: loading model from 'models/ggml-tiny-q8_0.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 7
whisper_model_load: type          = 1
whisper_model_load: mem required  =  172.00 MB (+    3.00 MB per decoder)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx     =   43.18 MB
whisper_model_load: model size    =   43.14 MB
whisper_init_state: kv self size  =    2.62 MB
whisper_init_state: kv cross size =    8.79 MB

system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | COREML = 0 |

main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...

[00:00:00.000 --> 00:00:10.500]   And so my fellow Americans ask not what your country can do for you ask what you can do for your country.

whisper_print_timings:     load time =   112.34 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   290.00 ms
whisper_print_timings:   sample time =    24.98 ms /    25 runs (    1.00 ms per run)
whisper_print_timings:   encode time =  1050.98 ms /     1 runs ( 1050.98 ms per run)
whisper_print_timings:   decode time =   129.20 ms /    25 runs (    5.17 ms per run)
whisper_print_timings:    total time =  1667.61 ms

Leeviber commented 1 year ago

@StuartIanNaylor Hi, Have you tried running whisper.cpp with GPU on rk3588? Now that I'm having a lot of trouble trying to cross-compile CLBlast to my rk3588 dev board, I'm wondering how much of a speed boost the GPU acceleration with clblast can bring and if it's worth doing?

StuartIanNaylor commented 1 year ago

Yeah its not all roses, I get

WHISPER_CLBLAST=1 make -j4
I whisper.cpp build info:
I UNAME_S:  Linux
I UNAME_P:  unknown
I UNAME_M:  aarch64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -DGGML_USE_CLBLAST
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread
I LDFLAGS:  -lclblast -lOpenCL
I CC:       cc (Debian 10.2.1-6) 10.2.1 20210110
I CXX:      g++ (Debian 10.2.1-6) 10.2.1 20210110

cc  -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -DGGML_USE_CLBLAST   -c ggml.c -o ggml.o
cc -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -DGGML_USE_CLBLAST -c ggml-opencl.c -o ggml-opencl.o
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread -c whisper.cpp -o whisper.o
CFLAGS   += -mcpu=native
make: CFLAGS: No such file or directory
make: *** [Makefile:181: ggml-opencl.o] Error 127
make: *** Waiting for unfinished jobs....

I just ignore and run the cli again and it works even though slow.

I think there are problems with the driver and Mali G610 as when you run some of the clblast tuners you get errors forgot which but they are located in /usr/bin/clblast_tuner_copy_fast and all have a different name than sending a param. But the methods needed work.

Out of curiosity I tried something non AMD as don't have a Vega board and a HD630 installs the same and seems to behave simulary.

But run but are slow and its a question to how much is running on GPU as wondering if its trying and it returns a fail. There is a proc or sys where you can show gpu load just forgot where it is located.

| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| <todo> | <todo> |  NEON BLAS | tiny | 4 | 219 | 2857 | 14bee39 |
| <todo> | <todo> |  NEON BLAS | tiny-q4_0 | 4 | 156 | 3330 | 14bee39 |
| <todo> | <todo> |  NEON BLAS | tiny-q4_1 | 4 | 167 | 3022 | 14bee39 |
| <todo> | <todo> |  NEON BLAS | tiny-q5_0 | 4 | 163 | 3650 | 14bee39 |
| <todo> | <todo> |  NEON BLAS | tiny-q5_1 | 4 | 169 | 3300 | 14bee39 |
| <todo> | <todo> |  NEON BLAS | tiny-q8_0 | 4 | 183 | 3259 | 14bee39 |
| <todo> | <todo> |  NEON BLAS | base | 4 | 291 | 4680 | 14bee39 |
| <todo> | <todo> |  NEON BLAS | base-q4_0 | 4 | 188 | 5261 | 14bee39 |
| <todo> | <todo> |  NEON BLAS | base-q4_1 | 4 | 198 | 4861 | 14bee39 |
| <todo> | <todo> |  NEON BLAS | base-q5_0 | 4 | 198 | 5161 | 14bee39 |
| <todo> | <todo> |  NEON BLAS | base-q5_1 | 4 | 196 | 5025 | 14bee39 |
| <todo> | <todo> |  NEON BLAS | base-q8_0 | 4 | 232 | 5340 | 14bee39 |

Also spuriously things go awry

rock@rock-5b:~/whisper.cpp$ ./main -m models/ggml-tiny.bin -f samples/jfk.wav
whisper_init_from_file_no_state: loading model from 'models/ggml-tiny.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: type          = 1
whisper_model_load: mem required  =  201.00 MB (+    3.00 MB per decoder)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx     =   73.58 MB

Initializing CLBlast (First Run)...
Attempting to use: Platform=0, Device=0 (If invalid, program will crash)
arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '5'.
Using Platform: ARM Platform Device: Mali-LODX r0p0
whisper_model_load: model size    =   73.54 MB
whisper_init_state: kv self size  =    2.62 MB
whisper_init_state: kv cross size =    8.79 MB

system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | COREML = 0 |

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...

[00:00:00.000 --> 00:00:03.000]   And so my fellow Americans?
[00:00:03.000 --> 00:00:04.000]   Are you not?
[00:00:04.000 --> 00:00:05.000]   Not.
[00:00:05.000 --> 00:00:08.000]   What your country can do for you?
[00:00:08.000 --> 00:00:11.000]   Ask what you can do for your country.

whisper_print_timings:     load time =   224.94 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   306.92 ms
whisper_print_timings:   sample time =    41.27 ms /    39 runs (    1.06 ms per run)
whisper_print_timings:   encode time =  3658.08 ms /     1 runs ( 3658.08 ms per run)
whisper_print_timings:   decode time =   397.50 ms /    39 runs (   10.19 ms per run)
whisper_print_timings:    total time =  4725.34 ms

Its prob still a bit fresh for the mali g610 but I did sucessfully run the ArmNN OPenCL based Wav2Vec example they do, but its very different to whisper as near all load is on the GPU and it looks fantastic as slightly slower than the CPU. The CPU does get tickled but there is near no load compared to running ArmNN on the CPU whilst above your wondering if it actually used the GPU at all apart from posturing that it was.

I think the Intel is similular and far more less fresh that the Mali G610 which really is still waiting for kernel changes. https://www.collabora.com/news-and-blog/news-and-events/pancsf-a-new-drm-driver-for-mali-csf-based-gpus.html

To be honest I have a gut feeling OpenCL might be similar to OpenGL and not granular enough and tend to limit performance. I was sort of hoping that the GPU implementaion would be CoreML/Metal & GGML/Vulkan as maybe that would be more performant and available. I am not really sure what is happening with NPU's as they don't seem to have any compliance and have specific frameworks.

Maybe what Nvidia say is the way to go and create opensopurce versions of https://developer.nvidia.com/blog/machine-learning-acceleration-vulkan-cooperative-matrices/

https://gist.github.com/itzmeanjan/84613bc7595372c5e6b6c22481d42f9a

https://github.com/bartwojcik/vulkano-matmul

https://www.khronos.org/assets/uploads/developers/presentations/Cooperative_Matrix_May22.pdf

csukuangfj commented 1 year ago

FYI: You can run Whisper models with onnxruntime in C++ using sherpa-onnx on Raspberry Pi.

You can find the documentation at https://k2-fsa.github.io/sherpa/onnx/pretrained_models/whisper/index.html

The following is the RTF running tiny.en on Raspberry Pi Model 4B:

ggerganov / whisper.cpp

is it possible to run openai-whisper ggml model on raspberry pi hardware? #7