usefulsensors / openai-whisper

Robust Speech Recognition via Large-Scale Weak Supervision
MIT License
62 stars 24 forks source link

Python example? #11

Open fquirin opened 1 year ago

fquirin commented 1 year ago

Hi @nyadla-sys ,

this is a very interesting work on OpenAI's Whisper 🙂👍.

I've built a multi engine, streaming server for STT (SEPIA STT-Server) that runs on Raspberry Pi and was thinking about Whisper integration a while ago, but didn't really follow up on it since Whisper is a non-streaming system by design. Then I saw your TFlite port and was wondering if it may be fast enough to get something like a pseudo-real-time experience ^^.

Since the SEPIA STT-Server is build on Python I was wondering if you have simple Python demo available? 🙂

Ty, Florian

nyadla-sys commented 1 year ago

Please refer to the attached colab which uses python code to run inference on a TFLite model. If I have the time, I will write a simple python script to demonstrate how to run inference with a TFLite model. In general, you can use python code to run inference on a TFLite model .

To run inference on a TFLite model, you can use Python code. One way to do this is to use the TensorFlow Lite Interpreter. Here is an sample snippet of how to do this:

Import necessary packages

import tensorflow as tf import numpy as np

Load the TFLite model

interpreter = tf.lite.Interpreter(model_path="model.tflite")

Preprocess the input data,write the actual spectrograms data here

input_data = np.random.randn(1, 256, 256, 3) input_details = interpreter.get_input_details() interpreter.resize_tensor_input(input_details[0]['index'], input_data.shape) interpreter.allocate_tensors() interpreter.set_tensor(input_details[0]['index'], input_data)

Run inference

interpreter.invoke()

Obtain and postprocess the output

output_details = interpreter.get_output_details() output_data = interpreter.get_tensor(output_details[0]['index']) output_data = output_data.squeeze()

Save and/or visualize the results

np.savetxt("output.txt", output_data)

fquirin commented 1 year ago

Thanks, I'll check that out! 👍

fquirin commented 1 year ago

Using your instructions I built a very naive example but haven't had much luck so far. Here is the code:

import wave

import tensorflow as tf
import numpy as np

audio_file="test_wavs/1089-134686-0001.wav"
print(f'Loading audio file: {audio_file}')

wf = wave.open(audio_file, "rb")
sample_rate_orig = wf.getframerate()
audio_length = wf.getnframes() * (1 / sample_rate_orig)
if (wf.getnchannels() != 1 or wf.getsampwidth() != 2
    or wf.getcomptype() != "NONE" or sample_rate_orig != 16000):
    print("Audio file must be WAV format mono PCM.")
    exit (1)

input_data = np.frombuffer(wf.readframes(wf.getnframes()), np.int16)
#input_data = np.random.randn(1, 256, 256, 3)
print(f'Samplerate: {sample_rate_orig}, length: {audio_length}s')

print(f'Loading tflite model ...')
interpreter = tf.lite.Interpreter(model_path="models/whisper.tflite")

input_details = interpreter.get_input_details()
interpreter.resize_tensor_input(input_details[0]['index'], input_data.shape)
interpreter.allocate_tensors()
interpreter.set_tensor(input_details[0]['index'], input_data)

interpreter.invoke()

output_details = interpreter.get_output_details()
output_data = interpreter.get_tensor(output_details[0]['index'])
output_data = output_data.squeeze()

np.savetxt("output.txt", output_data)

The error I get is (for both real and random data):

Loading tflite model ...
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
Traceback (most recent call last):
  File "/home/pi/whisper-tflite/openai-whisper/test.py", line 29, in <module>
    interpreter.allocate_tensors()
  File "/home/pi/whisper-tflite/venv/lib/python3.9/site-packages/tensorflow/lite/python/interpreter.py", line 513, in allocate_tensors
    return self._interpreter.AllocateTensors()
RuntimeError: tensorflow/lite/kernels/transpose.cc:55 op_context->perm->dims->data[0] != dims (3 != 4)Node number 0 (TRANSPOSE) failed to prepare.Failed to apply the default TensorFlow Lite delegate indexed at 0.

I'm using Python 3.9 on aarch64, with tensorflow 2.11.0, tflite 2.10.0. and numpy 1.24.0. Any ideas?

nyadla-sys commented 1 year ago

I have used Google Colab to test the below code

Please install below required tools/repo

!git lfs install
!git clone https://github.com/usefulsensors/openai-whisper.git
!pip install git+https://github.com/openai/whisper.git 

and then run the below code to generate tokens

import whisper
from whisper.audio import load_audio, log_mel_spectrogram,pad_or_trim,N_FRAMES, SAMPLE_RATE

import tensorflow as tf
import numpy as np

audio_file="/content/openai-whisper/samples/jfk.wav"
print(f'Loading audio file: {audio_file}')

mel_from_file = log_mel_spectrogram(audio_file)
input_data = pad_or_trim(mel_from_file, N_FRAMES)
input_data = tf.expand_dims(input_data, 0)

print(f'Loading tflite model ...')
# Load the TFLite model and allocate tensors
interpreter = tf.lite.Interpreter(model_path="/content/openai-whisper/models/whisper.tflite")
interpreter.allocate_tensors()

input_details = interpreter.get_input_details()
print(input_data.shape)
interpreter.resize_tensor_input(input_details[0]['index'], input_data.shape)

interpreter.set_tensor(input_details[0]['index'], input_data)

interpreter.invoke()

output_details = interpreter.get_output_details()
output_data = interpreter.get_tensor(output_details[0]['index'])
print(output_data)

Convert tokens into text

import torch
wtokenizer = whisper.tokenizer.get_tokenizer(False, language="en")
for token in output_data:
    token[token == -100] = wtokenizer.eot
    text = wtokenizer.decode(token, skip_special_tokens=True)
    print(text)
fquirin commented 1 year ago

Thanks! Almost there I think. output data is generated, but token conversion creates the following error 🤔 :

Traceback (most recent call last):
  File "/home/pi/whisper-tflite/openai-whisper/test.py", line 55, in <module>
    token[token == -100] = wtokenizer.eot
TypeError: 'numpy.int32' object does not support item assignment
nyadla-sys commented 1 year ago

Please run the below Colab link https://colab.research.google.com/drive/1HFhOIdi-cO3FBpaOWOOh6Ey9kxBbj4Hq?usp=sharing

fquirin commented 1 year ago

Ah sorry, I didn't see that you had removed the output_data = output_data.squeeze() from the initial example! Now it's working! 🤩 Thanks a lot! 😃

I saw you've worked on a streaming version as well. I read that this will likely increase WER due to the missing context (Whisper is trained on 30s windows if I remember correct) but I'd like to try and see what happens ^^. Could you maybe give me a hint on how to adapt the basic demo to handle chunks instead of all data at once? Is it even possible with this configuration?

fquirin commented 1 year ago

Hey @nyadla-sys , I was doing some tests today and wondered if it is possible to use the whisper-int8.tflite model (on aarch64)? With the current setup its giving me this error: Cannot set tensor: Got value of type FLOAT32 but expected type INT64 for input 0

I'm assuming this is because the input has wrong data type:

mel_from_file = whisper.audio.log_mel_spectrogram(audio_file)
input_data = whisper.audio.pad_or_trim(mel_from_file, whisper.audio.N_FRAMES)
j1nx commented 1 year ago

@fquirin Interesting!

Did you get the Python to work with the streaming version?

btw: I have ran the inference on Aarch64 both on a rpi3 and a rpi4 without problems as seen at my benchmark issue; https://github.com/usefulsensors/openai-whisper/issues/8

As inferernce on the standard 11 second jfk.wav only teakes about 5 seconds to transcribe, utilizing this from Python together with streaming mode this could be a very nice STT engine on embedded devices.

Is you Python work available somewhere?

fquirin commented 1 year ago

Did you get the Python to work with the streaming version?

I kind of gave up on streaming for the current version of Whisper. The model is just not built for it and the workarounds all mess with the context and cut audio at specific time intervals which leads to all sorts of artifacts and WER problems, at least that is my experience. I hope Open-AI will make a proper streaming model soon (same as Nvidia NeMo ^^).

As inferernce on the standard 11 second jfk.wav only teakes about 5 seconds to transcribe, utilizing this from Python together with streaming mode this could be a very nice STT engine on embedded devices.

For SEPIA I rarely need anything larger than 4s, unfortunately that means the user-experience is rather bad when you have to wait another 3-4s for the result (on a RPi4) AFTER you've finished speaking.

Is you Python work available somewhere?

I quickly put together a new repository for ASR experiments with the test code 🙂. Btw I posted some performance test over at the Whisper.cpp repo, maybe you've seen them already.

nyadla-sys commented 1 year ago

Hey @nyadla-sys , I was doing some tests today and wondered if it is possible to use the whisper-int8.tflite model (on aarch64)? With the current setup its giving me this error: Cannot set tensor: Got value of type FLOAT32 but expected type INT64 for input 0

I'm assuming this is because the input has wrong data type:

mel_from_file = whisper.audio.log_mel_spectrogram(audio_file)
input_data = whisper.audio.pad_or_trim(mel_from_file, whisper.audio.N_FRAMES)

Currently I am seeing issue with TFLIte converter for int8 whisper model conversion and working with Google to resolve this issue

fquirin commented 1 year ago

thanks for the info 👍

nyadla-sys commented 1 year ago

I am investigating the option of converting the encoder and decoder into separate TFLite models and will provide additional information later.

j1nx commented 1 year ago

@fquirin I have seen your performance reports at whisper.cpp They are inline with what I did and reported for the RPI4 and RPI3.

Also, fully agree with you, streaming really has a PPP (Piss Poor Performance) WER. Just feeding the few seconds WAV to the normal binary which adds empty audio to the end up to 30 seconds works suprisingly well, so no clue what is happing there. I guess the streaming mode throws away to much at the beginning.

It really needs to keep all audio from beginning to the end and just continue inference on the growing audio till VAD detection shuts it down. From that point one last inference on the final wav needs to happen. I just don't know if that is possible or if it already happens.

@nyadla-sys Tried testing your last medium model, however as epected it does not fit into 2 GB memory of the RPI4. It appears to work as it does not segfault like reported in that other issue. Could you also upload tflite converted base and small models? I am able to load them into memory running whisper.cpp so gues the tflite one should fit as well.

Anyhow, great work!

fquirin commented 1 year ago

streaming really has a PPP (Piss Poor Performance) WER

The way I understand it is that the non-streaming Transformer models need to see the whole input at once to reach the high accuracy, because they are trained on long context. In Whispers case this seems to be a 30s window, meaning the model looks at all 30s at once and the first second will influence the last. That ist also the reason for hallucinations. In a way the model recognizes a certain part and makes up the most probable rest, similar to LLMs that finish a story. If you chop your input into small parts it is currently not able to remember the previous part and works out of context.

It really needs to keep all audio from beginning to the end and just continue inference on the growing audio till VAD detection shuts it down. From that point one last inference on the final wav needs to happen. I just don't know if that is possible or if it already happens.

I've seen systems doing that, but for obvious reasons on high-performance machines like Mac where you can afford to transcribe the same data over and over again until your audio is complete or reached the VAD stop signal.

nyadla-sys commented 1 year ago

@fquirin I have seen your performance reports at whisper.cpp They are inline with what I did and reported for the RPI4 and RPI3.

Also, fully agree with you, streaming really has a PPP (Piss Poor Performance) WER. Just feeding the few seconds WAV to the normal binary which adds empty audio to the end up to 30 seconds works suprisingly well, so no clue what is happing there. I guess the streaming mode throws away to much at the beginning.

It really needs to keep all audio from beginning to the end and just continue inference on the growing audio till VAD detection shuts it down. From that point one last inference on the final wav needs to happen. I just don't know if that is possible or if it already happens.

@nyadla-sys Tried testing your last medium model, however as epected it does not fit into 2 GB memory of the RPI4. It appears to work as it does not segfault like reported in that other issue. Could you also upload tflite converted base and small models? I am able to load them into memory running whisper.cpp so gues the tflite one should fit as well.

Anyhow, great work!

I will update base and small models ASAP