Only recognises 1st few seconds

StuartIanNaylor commented 1 year ago

Is it due to mel.n_len3000 that is the max of a single inference? If you feed some of the longers samples that whisper.cpp uses I presume its the mel.n_len3000 max as know they are much longer.

rock@rock-5b:~/openai-whisper/minimal_build$ ./minimal ../models/whisper.tflite ../samples/hp0.wav

n_vocab:50257

mel.n_len3000

mel.n_mel:80
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
Inference time 2 seconds

[_SOT_][_NOT_] Henry F. Phillips from Wikipedia, the free encyclopedia at en.wicopedia.org.

Anyway brilliant to see it running on tflite as running on debian with a rk3588, only have the 8gb model, so zram & a 32gb swap on a nvme finally got me there :)

I was wondering as you have split Whisper up into several models do any of the model quantise to full int8? I have also been playing with ArmNN which could also use the MaliG610 Also there is a 3 core 6 Tops NPU (3x 3 Tops) ArmNN think it needs full int8 quant, not really sure about RKNPU2 https://github.com/rockchip-linux/rknn-toolkit2 as haven't really played with it.

But wondering if there are some that could and models could be partitioned (they already are) but maybe ones that don't quant could run cpu and rest use either gpu/npu?

nyadla-sys commented 1 year ago

Each inference run requires 30 seconds of audio; to run for longer than 30 seconds, we must modify the application code. I could produce a hybrid whisper.tflite (activations are in float32 and weights are in quant int8) work in progress to generate int8 quant tflite model

StuartIanNaylor commented 1 year ago

Yeah its no prob as was just running longer samples to get a better benchmark the inference time given doesn't have much resolution. I was just interested how tflite benches against ggml of whisper.cpp To be honest I wish the window of whisper was 10 sec as likely it would reduce load much but be much more better aligned for command sentences. Transcription ASR & command ASR prob have slightly different profiles and requirements as Linux could do with a specification so that a wav or stream can feed any plugged in model and just pass on any associated metdata with its output. It would be really good to see Whisper as a standalone Linux service as well as the great work you and bjnortier seem to be doing on IOS/Android.

I was also looking at the models directory where encoder and decoder seem to be 2 different models as that could be great to run across multiple devices such as CPU/GPU/NPU. The RK3588 npu is a 3 core 2 Tops and got me thinking like the coral edge that also has a 2 core 2 Tops version or simply add another m.2 or usb npu as Tops is not the greatest metric but 1 watt for 2 Tops seems common and far more efficient than CPU or GPU. So the more the model can be partitioned to be faster and more efficient the better and hence thinking npu efficiency wise seem best and why asking about int8 quant models. Also thinking faster than realtime is a better option than streaming as the model wants that 10/30sec context for accuracy so trying to get the small model faster than realtime, likely would still fit in 4-6 tops maybe?

nyadla-sys commented 1 year ago

@StuartIanNaylor I plan to create encoder and decoder TFLite models. I will keep this issue open until I have successfully generated the models.

StuartIanNaylor commented 1 year ago

@nyadla-sys that would open up the possibility of the larger models on relatively modest hardware if they can be then imported into frameworks such as ArmmNN/RKNNToolkit2 then likely speech processing can reside on CPU whilst encoder & decoder could be split by mobile GPU & NPU or combinations. The WER actually drops off a bit of a cliff for the Tiny & Base models where the Small model only starts to approach the levels that has won Whisper so much acclaim and its actually the Medium model where Whisper ASR gains SOTA levels. It could be interesting especially with a RK3588 as the MaliG610 is approx equivalent to the CPU in terms of ML but doesn't have the threading limitations of TFLite whilst the NPU is 2Tops x3 cores which I think 2Tops is roughly about what either the CPU/GPU could achieve alone, Prob for the 1st time I might pay the Apple premium and try and pick up a M1 mini 16gb after Xmas 2nd user as the ML performance per watt on those things is just insane.

nyadla-sys commented 1 year ago

@StuartIanNaylor I managed to generate encoder/decoder tflite models ,please refer the below notebook for more details. https://colab.research.google.com/github/usefulsensors/openai-whisper/blob/main/notebooks/whisper_encoder_decoder_tflite.ipynb

StuartIanNaylor commented 1 year ago

That is really cool, I have got side tracked on another project but will definately give that a go. I need to have 2 seperate threads encoder/decoder and split cpu/gpu/npu prob cpu/gpu with TF as RKNN-Toolkit2 looks great and think the NPU might beat the GPU, I am suffering from oh not another ML toolkit! :)

You and Georgi have done a great job in dissecting Whisper and exporting in working formats, many thanks.

StuartIanNaylor commented 1 year ago

Thats weird as the tokeniser is a verbatum copy & paste but is giving no output

python encoder-decoder.py

INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
== Input details ==
name: serving_default_x.1:0
shape: [   1   80 3000]
type: <class 'numpy.float32'>

DUMP INPUT
{'name': 'serving_default_x.1:0', 'index': 0, 'shape': array([   1,   80, 3000], dtype=int32), 'shape_signature': array([   1,   80, 3000], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0), 'quantization_parameters': {'scales': array([], dtype=float32), 'zero_points': array([], dtype=int32), 'quantized_dimension': 0}, 'sparsity_parameters': {}}

== Output details ==
name: PartitionedCall:0
shape: [   1 1500  384]
type: <class 'numpy.float32'>

DUMP OUTPUT
{'name': 'PartitionedCall:0', 'index': 557, 'shape': array([   1, 1500,  384], dtype=int32), 'shape_signature': array([   1, 1500,  384], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0), 'quantization_parameters': {'scales': array([], dtype=float32), 'zero_points': array([], dtype=int32), 'quantized_dimension': 0}, 'sparsity_parameters': {}}

Inference time= 1.456505537033081
(1, 1500, 384)
[[[ 0.1696541   0.07308465  0.04009005 ...  0.05437365  0.14940399
    0.13367367]
  [-0.3666988   1.0912716   0.04315966 ...  0.24082077  0.16366142
    0.7891257 ]
  [ 0.07376133  1.5414851   0.41237825 ... -0.60459065 -0.8289902
    0.71447384]
  ...
  [ 0.7533766  -1.8113835   0.34083468 ... -0.24959154 -0.39260525
    0.47391105]
  [-0.03803208 -0.5879506   0.14105844 ...  0.60848457  0.19444972
    0.35870117]
  [ 0.1888677  -0.16534436 -1.418708   ...  0.077178   -0.5052388
   -0.16890407]]]
whisper-decoder-language-hybrid.tflite

(1, 4)
(1, 1500, 384)
400
370
452
7177
6280
1029
406
437
428
1941
393
360
337
291
11
1029
437
291
393
360
337
428
1941
13
50257
Inference time= 5.223773241043091

I just downloaded the models locally and would say working but the tokeniser is not but the code is identical

import whisper
import torch
import tensorflow as tf
import numpy as np
import argparse
import os
import warnings
import tqdm
from whisper.audio import load_audio, log_mel_spectrogram,pad_or_trim,N_FRAMES, SAMPLE_RATE
from transformers import AutoTokenizer
import time
tflite_model_path = 'whisper-encoder-hybrid.tflite'

def representative_dataset_random():
    for _ in range(100):
      data = np.random.rand(1, 80, 3000)
      yield [data.astype(np.float32)]

def representative_dataset():
    for _ in range(1):#Change this to 100 and provide 100 different audio files from known dataset 
      mel_from_file = log_mel_spectrogram('jfk.flac')
      segment = pad_or_trim(mel_from_file, N_FRAMES)
      segment = tf.expand_dims(segment, 0)
      print(segment.shape)
      yield [segment]

tflite_model_path = 'whisper-encoder-hybrid.tflite'

# Load the TFLite model and allocate tensors
interpreter_enc = tf.lite.Interpreter(model_path=tflite_model_path)
interpreter_enc.allocate_tensors()

print("== Input details ==")
print("name:", interpreter_enc.get_input_details()[0]['name'])
print("shape:", interpreter_enc.get_input_details()[0]['shape'])
print("type:", interpreter_enc.get_input_details()[0]['dtype'])

print("\nDUMP INPUT")
print(interpreter_enc.get_input_details()[0])

print("\n== Output details ==")
print("name:", interpreter_enc.get_output_details()[0]['name'])
print("shape:", interpreter_enc.get_output_details()[0]['shape'])
print("type:", interpreter_enc.get_output_details()[0]['dtype'])

print("\nDUMP OUTPUT")
print(interpreter_enc.get_output_details()[0])

# Get input and output tensors
input_details = interpreter_enc.get_input_details()
output_details = interpreter_enc.get_output_details()
output_tensor = interpreter_enc.get_output_details()[0]['index']

# Test the model with random data
input_shape = input_details[0]['shape']
mel_from_file = log_mel_spectrogram('jfk.flac')
input_tensor = pad_or_trim(mel_from_file, N_FRAMES)
input_tensor = tf.expand_dims(input_tensor, 0)

audio = whisper.load_audio('jfk.flac')
audio = whisper.pad_or_trim(audio)
mel = whisper.log_mel_spectrogram(audio)
mel = np.expand_dims(mel,0)
start_time = time.time()
#input_tensor = np.array(input_tensor-128, dtype=np.int8)
interpreter_enc.set_tensor(input_details[0]['index'], mel)

interpreter_enc.invoke()
print("Whisper Encoder Inference executed successfully\n")
print("Inference time=", time.time() - start_time)
encoder_output_data = interpreter_enc.get_tensor(output_tensor)
print(encoder_output_data.shape)
print(encoder_output_data)
#np.savetxt("encoder_output.txt", encoder_output_data.reshape((3,-1)), fmt="%s", header=str(encoder_output_data.shape))

tflite_model_path='whisper-decoder-language-hybrid.tflite'
print(tflite_model_path)

# Load the TFLite model and allocate tensors
interpreter = tf.lite.Interpreter(model_path=tflite_model_path)
interpreter.allocate_tensors()

decoder_input_ids = torch.tensor([50258, 50266, 50358, 50363])
decoder_input_ids = tf.expand_dims(decoder_input_ids, 0)
print(decoder_input_ids.shape)
print(encoder_output_data.shape)

input_tensor_1 = interpreter.get_input_details()[0]['index']
interpreter.set_tensor(input_tensor_1, encoder_output_data)

input_tensor_2 = interpreter.get_input_details()[1]['index']
interpreter.resize_tensor_input(input_tensor_2, decoder_input_ids.shape)
# Allocate memory for input and output tensors
interpreter.allocate_tensors()
interpreter.set_tensor(input_tensor_2, decoder_input_ids)
output_tensor = interpreter.get_output_details()[0]['index']
start_tokens = [50258, 50266, 50358, 50363] #<|startoftranscript|><|ja|><|translate|><|notimestamps|>
tokens = start_tokens
start_time = time.time()
while(True):
    interpreter.invoke()
    output_data = interpreter.get_tensor(output_tensor)    
    cleaned = np.argmax(output_data, axis=-1)
    last_token = cleaned[0,-1]
    print(last_token)
    tokens.append(last_token)
    new_value = tf.constant([last_token], dtype=tf.int64)
    new_value = tf.reshape(new_value, (1,1))
    decoder_input_ids = tf.concat([decoder_input_ids, new_value], axis=1)
    input_tensor_2 = interpreter.get_input_details()[1]['index']
    interpreter.resize_tensor_input(input_tensor_2, decoder_input_ids.shape)
    # Allocate memory for input and output tensors
    interpreter.allocate_tensors()
    interpreter.set_tensor(input_tensor_2, decoder_input_ids)
    if last_token == 50257:
      break

print("Inference time=", time.time() - start_time)
model_id = "openai/whisper-tiny"
tokenizer = AutoTokenizer.from_pretrained(model_id)
skip_special_tokens=True
tokenizer.batch_decode(np.expand_dims(tokens, axis=0), skip_special_tokens=skip_special_tokens)[0]

The inference time on those models now seems much slower though, so maybe not anyway.

nyadla-sys commented 1 year ago

could you please change decoder_input_ids = torch.tensor([50258, 50266, 50358, 50363]) #<|startoftranscript|><|ja|><|translate|><|notimestamps|> to decoder_input_ids = torch.tensor([50258, 50259, 50359, 50363]) #<|startoftranscript|><|en|><|transcribe|><|notimestamps|>

StuartIanNaylor commented 1 year ago

No the same, no output from the tokeniser or message. Just running locally with the same code from colab. Much slower as well.

nyadla-sys commented 1 year ago

//make sure you have done below two steps ''' pip install optimum[onnxruntime] transformers git+https://github.com/openai/whisper.git '''

''' from optimum.onnxruntime import ORTModelForSpeechSeq2Seq from transformers import ( set_seed, AutoProcessor ) from pathlib import Path import os

SEED = 42

def export_vanilla_optimized_onnx(model_checkpoint): set_seed(SEED) processor = AutoProcessor.from_pretrained(model_checkpoint)

model = ORTModelForSpeechSeq2Seq.from_pretrained(model_checkpoint, from_transformers=True, use_cache=True)
onnx_path = Path(os.path.join("exported_onnx_models/", model_checkpoint))
model.save_pretrained(onnx_path)
processor.save_pretrained(onnx_path)

export_vanilla_optimized_onnx('openai/whisper-tiny') '''

nyadla-sys commented 1 year ago

and also provide full path to model_id = "openai/whisper-tiny"

nyadla-sys commented 1 year ago

Please share your python script and I can try on my linux machine and update you

StuartIanNaylor commented 1 year ago

Its the above but I am missing the full path as just have 'openai/whisper-tiny' but no model in that path Code is what I pasted above

nyadla-sys commented 1 year ago

Pls make sure you have downloaded openai folder something like above in your machine

StuartIanNaylor commented 1 year ago

Colab is very slow at times :) I will let you know Nope but isn't that a hugging face url not a local file path as in your example it would be wrong? /content/exported_onnx_models/...

usefulsensors / openai-whisper

Only recognises 1st few seconds #6