Whisper Python performance - Benchmarking

j1nx commented 1 year ago

Running on OpenVoiceOS, RaspberryPi 4 - 2GB model. Using Python 3.10 and Tensorflow-lite 2.11

With the tiny model;

mycroft@OpenVoiceOS-e3830c:~/whisper $ python3 test.py -f samples/ -m models/whisper.tflite -t 4
Importing tensorflow, numpy and torch
Importing whisper
Loading tflite model models/whisper.tflite ...
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.

Loading audio file: samples/test.wav
Samplerate: 16000, length: 30.0s
Calculating mel spectrogram...
Invoking interpreter ...
Preparing output data ...
Converting tokens ...
 Bili always listens to his mother. He always does what she says. If his mother says,

Inference took 4.74s for 30.0s audio file.

Loading audio file: samples/test_1.wav
Samplerate: 16000, length: 30.0s
Calculating mel spectrogram...
Invoking interpreter ...
Preparing output data ...
Converting tokens ...
 David lost his yellow pencil. He could not find it. Where is my yellow pencil? He asked his sister. His sister did not know. I don't know where your pencil is. She said David thought about it. He thought and thought. He used his yellow pencil for before lunch. He used it to write a note to his teacher. The notes said, dear teacher, thank you for helping me, David. He put the note in the envelope where was the envelope?

Inference took 8.57s for 30.0s audio file.

Loading audio file: samples/jfk.wav
Samplerate: 16000, length: 11.0s
Calculating mel spectrogram...
Invoking interpreter ...
Preparing output data ...
Converting tokens ...
 And so my fellow Americans ask not what your country can do for you, ask what you can do for your country.

Inference took 4.28s for 11.0s audio file.

j1nx commented 1 year ago

And with the two files from; https://github.com/fquirin/speech-recognition-experiments/tree/main/test-files

Loading audio file: samples/en_sh_lights_70pct_4s.wav
Samplerate: 16000, length: 3.575875s
Calculating mel spectrogram...
Invoking interpreter ...
Preparing output data ...
Converting tokens ...
 Set the lights in the living room to 70%.

Inference took 3.5s for 3.58s audio file.

Loading audio file: samples/en_speech_jfk_11s.wav
Samplerate: 16000, length: 11.0s
Calculating mel spectrogram...
Invoking interpreter ...
Preparing output data ...
Converting tokens ...
 And so my fellow Americans ask not what your country can do for you, ask what you can do for your country.

Inference took 4.26s for 11.0s audio file.

StuartIanNaylor commented 1 year ago

https://github.com/ggerganov/whisper.cpp/issues/7#issuecomment-1397467197

Thinking an arg might be better as with big.Little often its better just to use the big as have found its sometimes faster than all cores.

Also being trying to work out how to get floating point times so we get fractions of a second We have end.tv_sec-start.tv_sec Whats the best way of adding tv_usec and is there just time float alternative?

orangepi@orangepi5:~/openai-whisper/minimal_build$ ./minimal ../models/whisper.tflite ../samples/jfk.wav

n_vocab:50257

mel.n_len3000

mel.n_mel:80
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
Inference time 1 seconds

[_SOT_][_NOT_] And so my fellow Americans ask not what your country can do for you, ask what you can do for your country.

ps

orangepi@orangepi5:~/openai-whisper/minimal_build$ ./minimal ../models/whisper-small.tflite ../samples/jfk.wav

n_vocab:50257

mel.n_len3000

mel.n_mel:80
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
Inference time 12 seconds
terminate called after throwing an instance of 'std::out_of_range'
  what():  map::at
Aborted

Guess that is something to do with the tokeniser which is out of timings

j1nx commented 1 year ago

Am I correct, it only decodes one pass of 30 seconds? I can't seem to get the full transcribe of >30 seconds wav files. It just continues to the next wav file with the test.py file.

StuartIanNaylor commented 1 year ago

orangepi@orangepi5:~/openai-whisper/minimal_build$ ./minimal ../models/whisper.tflite ../samples/test.wav

n_vocab:50257

mel.n_len3000

mel.n_mel:80
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
Inference time 2 seconds

[_SOT_][_NOT_] Bili always listens to his mother. He always does what she says. If his mother says, brush your teeth, Bili brushes his teeth. If his mother says, go to bed, Bili goes to bed. Bili is a very good boy, a good boy listens to his mother. His mother does not have to ask him again. She asks him to do something one time and she does not ask again. Bili is a good boy. He does what his mother asks the first time. She does not have to ask again.

Seems something to do with the small model as ok with tiny

j1nx commented 1 year ago

Loading audio file: samples/A_J_Cook_Speech_from_Lansbury's_Labour_Weekly.wav
Samplerate: 16000, length: 188.231125s
Calculating mel spectrogram...
Invoking interpreter ...
Preparing output data ...
Converting tokens ...
 The field of the workers have put a million miners with their wives and children something like one tends to the whole population of this country have long called a loud progester. If you were to believe all the things that capitalist press pay about us, you would think that we were the most terrible people on earth. They tell you that we are never satisfied. That we are always psychic, that we are never content for our wages, with our hours, or with the hoses we live in. And yet,

Inference took 9.12s for 1.88e+02s audio file.

Looks like indeed only one pass of 30 seconds is transcribed after it loads the next wav file.

StuartIanNaylor commented 1 year ago

Yeah I am talking about the above

INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
Inference time 12 seconds
terminate called after throwing an instance of 'std::out_of_range'
  what():  map::at
Aborted

Happens with the small model but fine with tiny but yeah we only get the 1st 30sec beamsearch

orangepi@orangepi5:~/openai-whisper/minimal_build$ ./minimal ../models/whisper.tflite ../samples/gb0.wav

n_vocab:50257

mel.n_len3000

mel.n_mel:80
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
Inference time 2 seconds

[_SOT_][_NOT_] Good morning. This Tuesday is Election Day. After months of spirited debate in vigorous campaigning, the time has come for Americans to make important decisions about our nation's future and encourage all Americans to go to the polls and vote. Election season brings out the spirit of competition between our political parties. And that competition is an essential part of a healthy democracy. But as the campaigns come to a close, Republicans, Democrats, and independents can find common ground on at least one point. Our system of

wget --quiet --show-progress -O samples/gb0.ogg https://upload.wikimedia.org/wikipedia/commons/2/22/George_W._Bush%27s_weekly_radio_address_%28November_1%2C_2008%29.oga ffmpeg -loglevel -0 -y -i samples/gb0.ogg -ar 16000 -ac 1 -c:a pcm_s16le samples/gb0.wav

j1nx commented 1 year ago

Yeah, posted that error over at the now closed issue at whisper.cpp

the small model is not yet correct.

j1nx commented 1 year ago

mycroft@OpenVoiceOS-e3830c:~/whisper $ minimal models/whisper.tflite samples/gb0.wav

n_vocab:50257

mel.n_len3000

mel.n_mel:80
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
Inference time 9 seconds

[_SOT_][_NOT_] Good morning. This Tuesday is Election Day. After months of spirited debate in vigorous campaigning, the time has come for Americans to make important decisions about our nation's future and encourage all Americans to go to the polls and vote. Election season brings out the spirit of competition between our political parties. And that competition is an essential part of a healthy democracy. But as the campaigns come to a close, Republicans, Democrats, and independents can find common ground on at least one point. Our system of

And with Python

Loading audio file: samples/gb0.wav
Samplerate: 16000, length: 127.36s
Calculating mel spectrogram...
Invoking interpreter ...
Preparing output data ...
Converting tokens ...
 Good morning. This Tuesday is Election Day. After months of spirited debate in vigorous campaigning, the time has come for Americans to make important decisions about our nation's future and encourage all Americans to go to the polls and vote. Election season brings out the spirit of competition between our political parties. And that competition is an essential part of a healthy democracy. But as the campaigns come to a close, Republicans, Democrats, and independents can find common ground on at least one point. Our system of

Inference took 8.75s for 1.27e+02s audio file.

nyadla-sys commented 1 year ago

I ran test.py and it worked fine with whisper-small.tflite, maybe on raspberry pi tflite version is bit older one $python test.py Importing tensorflow and numpy 2023-01-19 12:00:17.805994: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. Importing whisper Loading tflite model models/whisper.tflite ... INFO: Created TensorFlow Lite XNNPACK delegate for CPU.

Loading audio file: ../test-files/en_sh_lights_70pct_4s.wav Samplerate: 16000, length: 3.575875s Calculating mel spectrogram... Invoking interpreter ... Preparing output data ... Converting tokens ... !!!

Inference took 9.42s for 3.58s audio file.

Loading audio file: ../test-files/en_speech_jfk_11s.wav Samplerate: 16000, length: 11.0s Calculating mel spectrogram... Invoking interpreter ... Preparing output data ... Converting tokens ... !! And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.

nyadla-sys commented 1 year ago

@j1nx pls downloasd the latest whisper-small.tflite and I will also run using minimal c++ example

j1nx commented 1 year ago

@nyadla-sys Will do in a bit.

That test.py you linked to uses full tensorflow, however we use the tensorflow-lite. https://github.com/fquirin/speech-recognition-experiments/blob/main/whisper-tflite/test.py#L9

Could you flip the # at line 8 and 9 and try again?

(PS, I run TFlite 2.11, however without any custom ops. Perhaps that is what we need)

StuartIanNaylor commented 1 year ago

PS the guys at tensorflow took pity on me :)

https://github.com/tensorflow/tensorflow/issues/59273#issuecomment-1384441333

j1nx commented 1 year ago

@nyadla-sys Ran with the latest small model

mycroft@OpenVoiceOS-e3830c:~/whisper $ minimal models/whisper-small.tflite samples/jfk.wav

n_vocab:50257

mel.n_len3000

mel.n_mel:80
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
ERROR: gather index out of bounds
ERROR: Node number 35 (GATHER) failed to invoke.
ERROR: Node number 3435 (WHILE) failed to invoke.
Error at ../minimal.cc:211

Same error

I think it is because of the Custom OPS ( oneDNN custom operations ) are used as I saw in your output.

fquirin commented 1 year ago

Could you flip the # at line 8 and 9 and try again?

If you do that don't forget to change line 24 to interpreter = tf.Interpreter(model_path, num_threads=int(args.threads)) as well

PS the guys at tensorflow took pity on me :)

Did you notice any difference? I need to check what they've actually changed since they rewrote a lot without comments

j1nx commented 1 year ago

Yeah, indeed. Grabbed the snippet here which has that also flipped; https://github.com/ggerganov/whisper.cpp/issues/7#issuecomment-1384419135

Anyhow, could you or @nyadla-sys check out both? As we do not train, all we need is the tflite runtime for inference.

nyadla-sys commented 1 year ago

I just ran whisper-small.tflite on my linux ubuntu and please refer the latest README.md

$ ./minimal ../models/whisper-small.tflite ../samples/jfk.wav

n_vocab:50257

mel.n_len3000

mel.n_mel:80 INFO: Created TensorFlow Lite XNNPACK delegate for CPU. Inference time 26 seconds

[_extra_token_50258][_extra_token_50259]!! And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.

StuartIanNaylor commented 1 year ago

[_extra_token_50258][_extra_token_50259] are tokens output stays the same

nyadla-sys commented 1 year ago

are tokens output stays the same

tokens output gets changed depending on the english/multilingual model

fquirin commented 1 year ago

To use a multilingual model in Python, you can simply change the line "wtokenizer = whisper.tokenizer.get_tokenizer(False, language="en")" to "wtokenizer = whisper.tokenizer.get_tokenizer(True, language="en")"

Continuation from here

Does this require the whisper-small.tflite model? Because I've tried that with whisper.tflite (en only?) but the output is completely scrambled and still English when I set for example "de".

nyadla-sys commented 1 year ago

To use a multilingual model in Python, you can simply change the line "wtokenizer = whisper.tokenizer.get_tokenizer(False, language="en")" to "wtokenizer = whisper.tokenizer.get_tokenizer(True, language="en")"

Continuation from here

Does this require the whisper-small.tflite model? Because I've tried that with whisper.tflite (en only?) but the output is completely scrambled and still English when I set for example "de".

Please use whisper-small.tflite or whisper-medium.tflite

nyadla-sys commented 1 year ago

These two whisper-small.tflite or whisper-medium.tflite models are multilingual, "whisper.tflite" is identical to "whisper-tiny.en.tflite", I intended to change the name but I refrained from doing so because many people are using in their examples.

j1nx commented 1 year ago

Strangs as I run those test with the right vocab multilang bin.

will double check in the morning.

fquirin commented 1 year ago

I've updated test.py with ne parameters for "--lang", "--runtime" and the tweaks mentioned in the tensorflow issue:

$ python3 test.py -h
usage: test.py [-h] [-f FOLDER] [-m MODEL] [-t THREADS] [-l LANG] [-r RUNTIME]

Running Whisper TFlite test inference.

optional arguments:
  -h, --help            show this help message and exit
  -f FOLDER, --folder FOLDER
                        Folder with WAV input files
  -m MODEL, --model MODEL
                        Path to model
  -t THREADS, --threads THREADS
                        Threads used
  -l LANG, --lang LANG  Language used
  -r RUNTIME, --runtime RUNTIME
                        Tensorflow runtime, use '1' for tf.lite or '2' for tflite_runtime

On my Rpi400 tflite_runtime is still about 1.5s slower than tf.lite. I could not test whisper-small.tflite because it keeps crashing (will post error in a minute).

StuartIanNaylor commented 1 year ago

Change the name @nyadla-sys as we can change things but think we are all used the the original naming convention

StuartIanNaylor commented 1 year ago

@fquirin dunno why but the tensorflow rewrote the test script for me, maybe its because the full tf has some special sauce optimisation with cmake for Arm that maybe we are missing with the TensorFlow Lite Python Wheel Package build?

https://github.com/tensorflow/tensorflow/issues/59273#issuecomment-1384441333

nyadla-sys commented 1 year ago

Change the name @nyadla-sys as we can change things but think we are all used the the original naming convention

I have updated the model to "whisper-tiny.en.tflite" and I need to update the README. For backward compatibility, I will keep "whisper.tflite" for a while.

fquirin commented 1 year ago

Error with whisper-small.tflite (I think we had this somewhere a few hours ago already?):

Traceback (most recent call last):
  File "/home/pi/whisper-tflite/openai-whisper/test.py", line 93, in <module>
    transcribe(args.folder + file)
  File "/home/pi/whisper-tflite/openai-whisper/test.py", line 68, in transcribe
    interpreter.invoke()
  File "/home/pi/whisper-tflite/venv/lib/python3.9/site-packages/tensorflow/lite/python/interpreter.py", line 917, in invoke
    self._interpreter.Invoke()
RuntimeError: gather index out of boundsNode number 35 (GATHER) failed to invoke.Node number 3435 (WHILE) failed to invoke.

dunno why but the tensorflow rewrote the test script for me, maybe its because the full tf has some special sauce optimisation with cmake for Arm that maybe we are missing with the TensorFlow Lite Python Wheel Package build?

It must be something like that, yes 🤔

nyadla-sys commented 1 year ago

Error with whisper-small.tflite (I think we had this somewhere a few hours ago already?):
Traceback (most recent call last):
  File "/home/pi/whisper-tflite/openai-whisper/test.py", line 93, in <module>
    transcribe(args.folder + file)
  File "/home/pi/whisper-tflite/openai-whisper/test.py", line 68, in transcribe
    interpreter.invoke()
  File "/home/pi/whisper-tflite/venv/lib/python3.9/site-packages/tensorflow/lite/python/interpreter.py", line 917, in invoke
    self._interpreter.Invoke()
RuntimeError: gather index out of boundsNode number 35 (GATHER) failed to invoke.Node number 3435 (WHILE) failed to invoke.
dunno why but the tensorflow rewrote the test script for me, maybe its because the full tf has some special sauce optimisation with cmake for Arm that maybe we are missing with the TensorFlow Lite Python Wheel Package build?

It must be something like that, yes thinking

Can you provide me with additional information such as the operating system, machine, and whether you are using a minimal C++ build or a Python script?

fquirin commented 1 year ago

Can you provide me with additional information such as the operating system, machine, and whether you are using a minimal C++ build or a Python script?

Sure: Aarch64, Raspberry Pi 400, 4GB RAM, Debian Bullseye (11), Python script.

Maybe my Pi is actually out of memory when using the small model but according to OpenAI 2GB should be fine 🤔

j1nx commented 1 year ago

Can you provide me with additional information such as the operating system, machine, and whether you are using a minimal C++ build or a Python script?

Sure: Aarch64, Raspberry Pi 400, 4GB RAM, Debian Bullseye (11), Python script.

Maybe my Pi is actually out of memory when using the small model but according to OpenAI 2GB should be fine 🤔

Run 'htop' in another shell and you see what is going on.

fquirin commented 1 year ago

Some recent benchmark results with my Rpi400:

Whisper TFlite - tiny-en - tensorflow.lite - 4 threads:
-------------------------------------------------------

Loading audio file: samples/jfk.wav
Samplerate: 16000, length: 11.0s
Calculating mel spectrogram...
Invoking interpreter ...
Preparing output data ...
Converting tokens ...
 And so my fellow Americans ask not what your country can do for you, ask what you can do for your country.

Inference took 4.083s for 11.000s audio file.

Whisper TFlite - tiny-en - tflite_runtime - 4 threads:
------------------------------------------------------

Loading audio file: samples/jfk.wav
Samplerate: 16000, length: 11.0s
Calculating mel spectrogram...
Invoking interpreter ...
Preparing output data ...
Converting tokens ...
 And so my fellow Americans ask not what your country can do for you, ask what you can do for your country.

Inference took 5.668s for 11.000s audio file.

Run 'htop' in another shell and you see what is going on.

Runs at 780MB (used) for a while than quickly maxes out at 1.8GB (~50%) then crashes.

What I did notice during the test: tensorflow.lite seems to use the 4 cores more efficiently than tflite_runtime 🤔

[EDIT] Screenshot (tflite_runtime: TOP, tensorflow.lite: BOTTOM):

nyadla-sys commented 1 year ago

Can you provide me with additional information such as the operating system, machine, and whether you are using a minimal C++ build or a Python script?

I run C++ minimal and python build on below and both works fine Linux pop-os 5.19.0-76051900-generic #202207312230~1663791054~22.04~28340d4 SMP PREEMPT_DYNAMIC Wed S x86_64 x86_64 x86_64 GNU/Linux

fquirin commented 1 year ago

I run C++ minimal and python build on below and both works fine

I've tested it on my x86 Debian 11 laptop and it worked as well (Python test.py). So it seems to be a ARM or Raspberry Pi issue 🤔.

Language selection still doesn't work though. The small model with "de" setting adds "!!!" to the beginning of a line, removes some words or entire texts but never gives any results in German.

j1nx commented 1 year ago

What does; "-funsafe-math-optimizations" do exactly?

Because all Tensorflow Lite documentation shows, it should be used and so did I. https://www.tensorflow.org/lite/guide/build_cmake_arm

However looking at the build script used within the repo, I can't find it (anymore); https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/tools/pip_package/build_pip_package_with_cmake.sh

It is also not present using Bazel; https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/tools/pip_package/build_pip_package_with_bazel.sh

However the bazel based buildscript does explicitely set "-O3" which is not done with cmake. (I also do not set it explicitely and compile OpenVoiceOS-buildroot with "-O2".

Perhaps the ~25% comes from there?

j1nx commented 1 year ago

Just double checked if I had used the right vocab multilingual filter bin used, but indeed I did. It has something to do with the model and the gather function being different between the tflite_runtime lib and the tensorflow lite lib.

mycroft@OpenVoiceOS-e3830c:~/whisper $ python3 test.py -f samples/ -m models/whisper-small.tflite -t 4 -l en -r 2    Importing tflite_runtime
Importing numpy
Importing whisper
Loading tflite model models/whisper-small.tflite ...
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.

Loading audio file: samples/jfk.wav
Samplerate: 16000, length: 11.0s
Calculating mel spectrogram...
Invoking interpreter ...
Traceback (most recent call last):
  File "/home/mycroft/whisper/test.py", line 93, in <module>
    transcribe(args.folder + file)
  File "/home/mycroft/whisper/test.py", line 68, in transcribe
    interpreter.invoke()
  File "/usr/lib/python3.10/site-packages/tflite_runtime/interpreter.py", line 917, in invoke
    self._interpreter.Invoke()
RuntimeError: gather index out of boundsNode number 35 (GATHER) failed to invoke.Node number 3435 (WHILE) failed to invoke.

And with the C++ version

mycroft@OpenVoiceOS-e3830c:~/whisper $ minimal models/whisper-small.tflite samples/jfk.wav

n_vocab:50257

mel.n_len3000

mel.n_mel:80
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
ERROR: gather index out of bounds
ERROR: Node number 35 (GATHER) failed to invoke.
ERROR: Node number 3435 (WHILE) failed to invoke.
Error at ../minimal.cc:211

https://www.tensorflow.org/api_docs/python/tf/gather

StuartIanNaylor commented 1 year ago

funsafe-math-optimizations

A quirk of Neon in Armv7 devices is that it flushes all subnormal numbers to zero, and as a result the GCC compiler will not use it unless -funsafe-math-optimizations, which allows losing denormals, is turned on. "Enhanced" Neon defined since Armv8 does not have this quirk, but as of GCC 8.2 the same flag is still required to enable Neon instructions.[133] On the other hand, GCC does consider Neon safe on AArch64 for Armv8.

nyadla-sys commented 1 year ago

I managed to generate encoder and decoder tflite models and just pending to complete the decoder post processing to generate tokens with text https://colab.research.google.com/github/usefulsensors/openai-whisper/blob/main/notebooks/whisper_encoder_decoder_tflite.ipynb

StuartIanNaylor commented 1 year ago

https://shop.allnetchina.cn/products/rock5-model-a-pre-order-discount-redeem-code?variant=44314435813692

@fquirin

j1nx commented 1 year ago

I rebuild the tflite_runtime with GPU support.

Running the python based inference now takes longer, while it still says it is using XNNPACK

mycroft@OpenVoiceOS-e3830c:~/whisper $ python3 test.py -f samples/ -m models/whisper-tiny.en.tflite -t 4 -l en -r 2
Importing tflite_runtime
Importing numpy
Importing whisper
Loading tflite model models/whisper-tiny.en.tflite ...
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.

Loading audio file: samples/jfk.wav
Samplerate: 16000, length: 11.0s
Calculating mel spectrogram...
Invoking interpreter ...
Preparing output data ...
Converting tokens ...
 And so my fellow Americans ask not what your country can do for you, ask what you can do for your country.

Inference took 6.56s for 11.00s audio file.

There is no GPU GL support on linux, only for Android. It is just in preparation of that Vulkan clvk thingy I would like to test.

j1nx commented 1 year ago

BTW @nyadla-sys You are converting the small model with SELECT_OPS wich is not available for the tflite_intepreter. Perhaps that is the reason why we can't run it while you can.

j1nx commented 1 year ago

@StuartIanNaylor Am in process of crosscompiling ComputeLibrary and ArmNN for rpi4 (armv8a). Interested to see if ArmNN outperforms XNNPACK. They claim it does, so interested to see...

nyadla-sys commented 1 year ago

I successfully executed the TFLite encoder and decoder models,it open doors to run on two different processors and it supports multilingual along with translate feature.

https://colab.research.google.com/github/usefulsensors/openai-whisper/blob/main/notebooks/whisper_encoder_decoder_tflite.ipynb

nyadla-sys commented 1 year ago

this will open up multilingual support for whisper tflite models

StuartIanNaylor commented 1 year ago

@StuartIanNaylor Am in process of crosscompiling ComputeLibrary and ArmNN for rpi4 (armv8a). Interested to see if ArmNN outperforms XNNPACK. They claim it does, so interested to see...

I have had some problems with ArmNN in that it seems very platform dependent and you might find if you use Ubuntu 22.04 for Pi it might work whilst you could have problems otherwise but see how you go. Its whatever version they use in the Wav2Letter example.

@nyadla-sys This is interesting as those who can run on GPU/CPU/NPU what benchmarks can be provided, I haven't looked or tried yet but will.

j1nx commented 1 year ago

@StuartIanNaylor Am in process of crosscompiling ComputeLibrary and ArmNN for rpi4 (armv8a). Interested to see if ArmNN outperforms XNNPACK. They claim it does, so interested to see...

I have had some problems with ArmNN in that it seems very platform dependent and you might find if you use Ubuntu 22.04 for Pi it might work whilst you could have problems otherwise but see how you go. Its whatever version they use in the Wav2Letter example.

That is most likely because of this; https://github.com/ARM-software/ComputeLibrary/blob/main/SConstruct#L93

For the OpenVoiceOS project everything gets compiled from source optimized for the specific board (for now rpi only but other might follow).

It defaults to armv7a, while for your board you could better use one of the arm64-v8* architectures.

StuartIanNaylor commented 1 year ago

@j1nx Have you managed to build it? https://review.mlplatform.org/plugins/gitiles/ml/armnn/+/747b9c6748802f862a86c85e43ba028b64ac809a/delegate/BuildGuideNative.md

I am still playing with a delegate build and have in minimal.cc

/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
==============================================================================*/
#include "tensorflow/lite/core/interpreter.h"
#include "tensorflow/lite/kernels/register.h"
#include "tensorflow/lite/model.h"
#include "tensorflow/lite/optional_debug_tools.h"
#include "whisper.h"
#include "input_features.h"

// This is an example that is minimal to read a model
// from disk and perform inference. There is no data being loaded
// that is up to you to add as a user.
//
// NOTE: Do not add any dependencies to this that cannot be built with
// the minimal makefile. This example must remain trivial to build with
// the minimal build tool.
//
// Usage: minimal <tflite model>

#define TFLITE_MINIMAL_CHECK(x)                              \
  if (!(x)) {                                                \
    fprintf(stderr, "Error at %s:%d\n", __FILE__, __LINE__); \
    exit(1);                                                 \
  }

int main(int argc, char* argv[]) {
  if ((argc != 2) && (argc != 3)) {
    fprintf(stderr, "'minimal <tflite model>' or 'minimal <tflite model> <pcm_file name>'\n");
    return 1;
  }
  const char* filename = argv[1];
  whisper_filters filters;
  whisper_mel mel;
  struct timeval start_time,end_time;
  std::string word;
  int32_t n_vocab = 0;
  std::string fname = "./filters_vocab_gen.bin";
  auto fin = std::ifstream(fname, std::ios::binary);
  {
    uint32_t magic=0;
    fin.read((char *) &magic, sizeof(magic));
    //@magic:USEN
    if (magic != 0x5553454e) {
        printf("%s: invalid vocab file '%s' (bad magic)\n", __func__, fname.c_str());
        return 0;
    }
  }

  // load mel filters
  {
      fin.read((char *) &filters.n_mel, sizeof(filters.n_mel));
      fin.read((char *) &filters.n_fft, sizeof(filters.n_fft));

      filters.data.resize(filters.n_mel * filters.n_fft);
      fin.read((char *) filters.data.data(), filters.data.size() * sizeof(float));
  }

  // load vocab
  {
    fin.read((char *) &n_vocab, sizeof(n_vocab));
    g_vocab.n_vocab = n_vocab;
    printf("\nn_vocab:%d\n",(int)n_vocab);

    for (int i = 0; i < n_vocab; i++) {
      uint32_t len;
      fin.read((char *) &len, sizeof(len));

      word.resize(len);
      fin.read((char *) word.data(), len);
      g_vocab.id_to_token[i] = word;
      //printf("len:%d",(int)len);
      //printf("'%s'\n", g_vocab.id_to_token[i].c_str());
    }

    g_vocab.n_vocab = 51864;//add additional vocab ids
    if (g_vocab.is_multilingual()) {
        g_vocab.token_eot++;
        g_vocab.token_sot++;
        g_vocab.token_prev++;
        g_vocab.token_solm++;
        g_vocab.token_not++;
        g_vocab.token_beg++;
    }
    for (int i = n_vocab; i < g_vocab.n_vocab; i++) {
        if (i > g_vocab.token_beg) {
            word = "[_TT_" + std::to_string(i - g_vocab.token_beg) + "]";
        } else if (i == g_vocab.token_eot) {
            word = "[_EOT_]";
        } else if (i == g_vocab.token_sot) {
            word = "[_SOT_]";
        } else if (i == g_vocab.token_prev) {
            word = "[_PREV_]";
        } else if (i == g_vocab.token_not) {
            word = "[_NOT_]";
        } else if (i == g_vocab.token_beg) {
            word = "[_BEG_]";
        } else {
            word = "[_extra_token_" + std::to_string(i) + "]";
        }
        g_vocab.id_to_token[i] = word;
        // printf("%s: g_vocab[%d] = '%s'\n", __func__, i, word.c_str());
    }
  }

  //Generate input_features for Audio file
  if (argc == 3) {
    const char* pcmfilename = argv[2];
    // WAV input
    std::vector<float> pcmf32;
    {
      drwav wav;
      if (!drwav_init_file(&wav, pcmfilename, NULL)) {
          fprintf(stderr, "%s: failed to open WAV file '%s' - check your input\n", argv[0],pcmfilename);
        //  whisper_print_usage(argc, argv, {});
          return 3;
      }

      if (wav.channels != 1 && wav.channels != 2) {
          fprintf(stderr, "%s: WAV file '%s' must be mono or stereo\n", argv[0], pcmfilename);
          return 4;
      }

      if (wav.sampleRate != WHISPER_SAMPLE_RATE) {
          fprintf(stderr, "%s: WAV file '%s' must be 16 kHz\n", argv[0], pcmfilename);
          return 5;
      }

      if (wav.bitsPerSample != 16) {
          fprintf(stderr, "%s: WAV file '%s' must be 16-bit\n", argv[0], pcmfilename);
          return 6;
      }

      int n = wav.totalPCMFrameCount;

      std::vector<int16_t> pcm16;
      pcm16.resize(n*wav.channels);
      drwav_read_pcm_frames_s16(&wav, n, pcm16.data());
      drwav_uninit(&wav);
      // convert to mono, float
      pcmf32.resize(n);
      if (wav.channels == 1) {
          for (int i = 0; i < n; i++) {
              pcmf32[i] = float(pcm16[i])/32768.0f;
          }
      } else {
          for (int i = 0; i < n; i++) {
              pcmf32[i] = float(pcm16[2*i] + pcm16[2*i + 1])/65536.0f;
          }
      }
    }

    //Hack if the audio file size is less than 30ms append with 0's
    pcmf32.resize((WHISPER_SAMPLE_RATE*WHISPER_CHUNK_SIZE),0);
    if (!log_mel_spectrogram(pcmf32.data(), pcmf32.size(), WHISPER_SAMPLE_RATE, WHISPER_N_FFT, WHISPER_HOP_LENGTH, WHISPER_N_MEL, 1,filters, mel)) {
      fprintf(stderr, "%s: failed to compute mel spectrogram\n", __func__);
      return -1;
    }

    printf("\nmel.n_len%d\n",mel.n_len);
    printf("\nmel.n_mel:%d\n",mel.n_mel);
  }//end of audio file processing

  // Load tflite model
  std::unique_ptr<tflite::FlatBufferModel> model =
      tflite::FlatBufferModel::BuildFromFile(filename);
  TFLITE_MINIMAL_CHECK(model != nullptr);

  // Build the interpreter with the InterpreterBuilder.
  // Note: all Interpreters should be built with the InterpreterBuilder,
  // which allocates memory for the Interpreter and does various set up
  // tasks so that the Interpreter can read the provided model.

  tflite::ops::builtin::BuiltinOpResolver resolver;
  tflite::InterpreterBuilder builder(*model, resolver);
  std::unique_ptr<tflite::Interpreter> interpreter;
  builder(&interpreter);
  TFLITE_MINIMAL_CHECK(interpreter != nullptr);

  // Allocate tensor buffers.
  TFLITE_MINIMAL_CHECK(interpreter->SetNumThreads(4) == kTfLiteOk);
  TFLITE_MINIMAL_CHECK(interpreter->AllocateTensors() == kTfLiteOk);
  //printf("=== Pre-invoke Interpreter State ===\n");
  // tflite::PrintInterpreterState(interpreter.get());
  // Get information about the memory area to use for the model's input.
  float* input = interpreter->typed_input_tensor<float>(0);
  if (argc == 2) {
    memcpy(input, _content_input_features_bin, WHISPER_N_MEL*WHISPER_MEL_LEN*sizeof(float)); //to load pre generated input_features
  }
  else if (argc == 3) {
    memcpy(input, mel.data.data(), mel.n_mel*mel.n_len*sizeof(float));
  }
  // Fill input buffers
  // TODO(user): Insert code to fill input tensors.
  // Note: The buffer of the input tensor with index `i` of type T can
  // be accessed with `T* input = interpreter->typed_input_tensor<T>(i);`
  gettimeofday(&start_time, NULL);
  // Run inference
  TFLITE_MINIMAL_CHECK(interpreter->Invoke() == kTfLiteOk);
  gettimeofday(&end_time, NULL);
  if (end_time.tv_usec-start_time.tv_usec <= 0) {
      printf("Inference time %ld.%ld seconds \n",(end_time.tv_sec-start_time.tv_sec-1),(end_time.tv_usec-start_time.tv_usec+1000000));
  }
  else if (end_time.tv_usec-start_time.tv_usec >= 0) {
      printf("Inference time %ld.%ld seconds \n",(end_time.tv_sec-start_time.tv_sec),(end_time.tv_usec-start_time.tv_usec));
  }
  int output = interpreter->outputs()[0];
  TfLiteTensor *output_tensor = interpreter->tensor(output);
  TfLiteIntArray *output_dims = output_tensor->dims;
  // assume output dims to be something like (1, 1, ... ,size)
  auto output_size = output_dims->data[output_dims->size - 1];
  //printf("output size:%d\n",output_size );
  int *output_int = interpreter->typed_output_tensor<int>(0);
  std::string text = "";
  std::string word_add;
  for (int i = 0; i < output_size; i++) {
    //printf("%d\t",output_int[i]);
    if(output_int[i] == g_vocab.token_eot){
      break;
    }
    text += whisper_token_to_str(output_int[i]);
  }
  printf("\n%s\n", text.c_str());
  printf("\n");

  //printf("\n\n=== Post-invoke Interpreter State ===\n");
  ////  tflite::PrintInterpreterState(interpreter.get());
  // Read output buffers
  // TODO(user): Insert getting data out code.
  // Note: The buffer of the output tensor with index `i` of type T can
  // be accessed with `T* output = interpreter->typed_output_tensor<T>(i);`
  return 0;
}

If I build with cmake --build ../tensorflow_src/tensorflow/lite/examples/minimal -DTFLITE_ENABLE_XNNPACK=OFF

orangepi@orangepi5:~/openai-whisper/minimal_build$ ./minimal ../models/whisper.tflite ../samples/test_1.wav

n_vocab:50257

mel.n_len3000

mel.n_mel:80
Inference time 2.469990 seconds

[_SOT_][_NOT_] David lost his yellow pencil. He could not find it. Where is my yellow pencil? Yes his sister. His sister did not know. I don't know where your pencil is. She said David thought about it. He thought and thought. He used his yellow pencil before lunch. He used it to write a note to his teacher. The notes said, dear teacher, thank you for helping me, David. He put the note in the envelope where was the envelope?

Then build with cmake ../tensorflow_src/tensorflow/lite/examples/minimal

orangepi@orangepi5:~/openai-whisper/minimal_build$ ./minimal ../models/whisper.tflite ../samples/test_1.wav

n_vocab:50257

mel.n_len3000

mel.n_mel:80
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
Inference time 2.490417 seconds

[_SOT_][_NOT_] David lost his yellow pencil. He could not find it. Where is my yellow pencil? Yes his sister. His sister did not know. I don't know where your pencil is. She said David thought about it. He thought and thought. He used his yellow pencil before lunch. He used it to write a note to his teacher. The notes said, dear teacher, thank you for helping me, David. He put the note in the envelope where was the envelope?

Is that the speedup XNNPACK gives !?

j1nx commented 1 year ago

Still working on it as I am cross compiling it into buildroot as package for it.

solving error by error as usual.

so can't really comment on your performance numbers.

have you tried the python way with both normal and the external delegates? Interested to see those numbers.

nyadla-sys commented 1 year ago

an apple iphone11 is only taking 0.7 seconds to run inference with whisper-tiny.en.tflite

StuartIanNaylor commented 1 year ago

Prob the Amx blocks are the same sort of secret sauce that the M1 has

The Apple A13 Bionic features an Apple-designed 64-bit six-core CPU implementing ARMv8.4-A[1] ISA, with two high-performance cores running at 2.65 GHz[6] called Lightning and four energy-efficient cores called Thunder. The Lightning cores feature machine learning accelerators called AMX blocks

https://medium.com/swlh/apples-m1-secret-coprocessor-6599492fc1e1 Apple sauce.

j1nx commented 1 year ago

Prob the Amx blocks are the same sort of secret sauce that the M1 has

The Apple A13 Bionic features an Apple-designed 64-bit six-core CPU implementing ARMv8.4-A[1] ISA, with two high-performance cores running at 2.65 GHz[6] called Lightning and four energy-efficient cores called Thunder. The Lightning cores feature machine learning accelerators called AMX blocks

https://medium.com/swlh/apples-m1-secret-coprocessor-6599492fc1e1 Apple sauce.

Jup, now I know for sure; I am a software guy.🤣

usefulsensors / openai-whisper

Whisper Python performance - Benchmarking #15