usefulsensors / openai-whisper

Robust Speech Recognition via Large-Scale Weak Supervision
MIT License
62 stars 24 forks source link

Error running minimal against your models. #2

Open sriram-srinivasan opened 1 year ago

sriram-srinivasan commented 1 year ago

Thank you for your work. I get this error when running against any of your tflite models.

This is on a mac, Apple clang version 13.1.6.

INFO: Created TensorFlow Lite XNNPACK delegate for CPU. ERROR: Select TensorFlow op(s), included in the given model, is(are) not supported by this interpreter. Make sure you apply/link the Flex delegate before inference. For the Android, it can be resolved by adding "org.tensorflow:tensorflow-lite-select-tf-ops" dependency. See instructions: https://www.tensorflow.org/lite/guide/ops_select ERROR: Node number 8 (FlexErf) failed to prepare. ERROR: Select TensorFlow op(s), included in the given model, is(are) not supported by this interpreter. Make sure you apply/link the Flex delegate before inference. For the Android, it can be resolved by adding "org.tensorflow:tensorflow-lite-select-tf-ops" dependency. See instructions: https://www.tensorflow.org/lite/guide/ops_select ERROR: Node number 8 (FlexErf) failed to prepare.

sriram-srinivasan commented 1 year ago

I just saw the other issue and perhaps I should delete this one. None of the models in the GitHub repo worked, but the whisper.tflite model seems to at least not crash.

INFO: Created TensorFlow Lite XNNPACK delegate for CPU. Inference time in seconds : 5 output size:21 50257 50362 1770 13 2264 346 353 318 262 46329 286 262 3504 6097 11 290 356 389 9675 284 7062

nyadla-sys commented 1 year ago

@sriram-srinivasan As you said earlier, whisper.tflite does the inference using a simple minimal example, and it produces the expected outcome.

For this I gave pre generated input_features ->Inference-> produces the generated_ids

We need to include front end processing and post processing, which are pending for the final application, in order to work with real-time audio.

sriram-srinivasan commented 1 year ago

For front-end processing, am I correct in assuming that the log-mel spectrograms have to be quantised? And if so, then what should be the quantising parameters?

nyadla-sys commented 1 year ago

According to the current model, quantization is not necessary for log-mel spectrograms. Float input features would be accepted as an input by the whisper.tflite hybrid model.

Open whisper.tflite file using below app to know more details about accepted input type. https://netron.app/

sriram-srinivasan commented 1 year ago

Hmm. I did look at it in netron, but it doesn't give the dtype for the inputs. So I did this, and I am now confused. At first I only saw the dtype=int32 for the shape, but later on, there's a separate dtype of float32. I'm not sure what the latter means.

>>> import tensorflow.lite as tfl
>>> i = tfl.Interpreter("whisper.tflite")
>>> i.get_input_details()
[{'name': 'serving_default_input_features:0', 'index': 0, 
'shape': array([   1,   80, 3000], dtype=int32), 
'shape_signature': array([   1,   80, 3000], dtype=int32), 
'dtype': <class 'numpy.float32'>, 
'quantization': (0.0, 0), 
'quantization_parameters': {'scales': array([], dtype=float32), 
'zero_points': array([], dtype=int32), 
'quantized_dimension': 0}, 'sparsity_parameters': {}}]
nyadla-sys commented 1 year ago

@sriram-srinivasan i have just added front end processing and please refer the minimal.cc file and command to run ./minimal ~/openai-whisper/models/whisper.tflite ~/openai-whisper/test.wav

sriram-srinivasan commented 1 year ago

Thanks. I think you have forgotten to check in dr_wave.h

There are a bunch of audio libraries on both iOS and Android that I lean on for converting audio to log mel, so that's somewhat of a solved problem for me. I'm particularly interested in the post-processing decoding phase. If there's some way I can contribute my effort and time on this project, please let me know.

nyadla-sys commented 1 year ago

@sriram-srinivasan sorry I missed uploading the file, let me upload that. Additionally, I'm attempting to learn and use the post-processing decoding step. Please feel free to submit the patch if you have a solution for this. My ultimate goal is to create a functioning Android app that can translate and/or transcribe audio while it is being live streamed. soon the entire program is functioning, I'll perform some code refactoring and improve the README.

nyadla-sys commented 1 year ago

@sriram-srinivasan end to end application is running ,pls follow the steps mentioned in README. https://github.com/usefulsensors/openai-whisper

sriram-srinivasan commented 1 year ago

Excellent. It works fine on mac os Monterey. Thanks.

There are two warnings that are easily fixed.

In input_features.h:

ifndef INPUT_FEATURESH

define INPUTFEATURESH //should be H

In minimal.cc:59, it should be return 0, not return false.

sriram-srinivasan commented 1 year ago

minimal works on test.wav, but not on any wav's I can generate, long or short. It seems that the wav file format is the same for both.

I'm converting the file to 16khz mono. Here's the ffmpeg information on your test.wav:

Input #0, wav, from 'test.wav':
  Duration: 00:00:30.00, bitrate: 256 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s

And this is from my test file.

Input #0, wav, from '/tmp/mytest.wav':
  Duration: 00:00:03.32, bitrate: 265 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s
nyadla-sys commented 1 year ago

at this time it works on 30 seconds audio chunks ,could you please share me your Audio file ?

nyadla-sys commented 1 year ago

@sriram-srinivasan could you please try with the latest change

nyadla-sys commented 1 year ago

This is only a proof-of-concept project to create an Android app based on Whisper TFLite, which leverages the stock Android UI to show off its features. Whisper-TFLIte-Android-Example