usefulsensors / openai-whisper

Robust Speech Recognition via Large-Scale Weak Supervision
MIT License
62 stars 24 forks source link

Multilingual model with spanish #19

Open AlejandroLanaspa opened 1 year ago

AlejandroLanaspa commented 1 year ago

I have been trying to follow https://colab.research.google.com/github/usefulsensors/openai-whisper/blob/main/notebooks/generate_tflite_from_whisper.ipynb to generate a multilingual model that I can use for the android app with spanish detection.

However, when doing so, I was constantly getting the error 'TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType', which I could solve by adding to the model the forced_decoder_ids. Now, this works in the notebook, however, when trying to use it in the android app, I am constantly getting the following error message:

E/tflite: gather index out of bounds E/tflite: Node number 35 (GATHER) failed to invoke. E/tflite: Node number 694 (WHILE) failed to invoke. E/ANR_LOG: >>> msg's executing time is too long E/ANR_LOG: Blocked msg = { when=-2s941ms what=0 target=android.view.ViewRootImpl$ViewRootHandler callback=android.view.View$PerformClick } , cost = 2832 ms E/ANR_LOG: >>>Current msg List is: E/ANR_LOG: Current msg <1> = { when=-2s940ms what=0 target=android.view.ViewRootImpl$ViewRootHandler callback=android.view.View$UnsetPressedState } E/ANR_LOG: Current msg <2> = { when=-2s830ms what=3 target=android.media.AudioRecord$NativeEventHandler } E/ANR_LOG: Current msg <3> = { when=-2s727ms barrier=9 } E/ANR_LOG: Current msg <4> = { when=-2s645ms what=3 target=android.view.GestureDetector$GestureHandler } E/ANR_LOG: >>>CURRENT MSG DUMP OVER<<< I/Quality: Blocked msg = Package name: com.whisper.android.tflitecpp [ schedGroup: 5 schedPolicy: 0 ] process the message: { when=-2s942ms what=0 target=android.view.ViewRootImpl$ViewRootHandler callback=android.view.View$PerformClick } took 2833 ms E/com.whisper.android.tflitecpp.MainActivity$WavAudioRecorder: Error occured in updateListener, recording is aborted W/System.err: java.io.IOException: write failed: EBADF (Bad file descriptor) W/System.err: at libcore.io.IoBridge.write(IoBridge.java:654) W/System.err: at java.io.RandomAccessFile.writeBytes(RandomAccessFile.java:546) W/System.err: at java.io.RandomAccessFile.write(RandomAccessFile.java:559) W/System.err: at com.whisper.android.tflitecpp.MainActivity$WavAudioRecorder$1.onPeriodicNotification(MainActivity.java:250) W/System.err: at android.media.AudioRecord$NativeEventHandler.handleMessage(AudioRecord.java:2216) W/System.err: at android.os.Handler.dispatchMessage(Handler.java:106) W/System.err: at android.os.Looper.loopOnce(Looper.java:233) W/System.err: at android.os.Looper.loop(Looper.java:344) W/System.err: at android.app.ActivityThread.main(ActivityThread.java:8205) W/System.err: at java.lang.reflect.Method.invoke(Native Method) W/System.err: at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:589) W/System.err: at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:1071) W/System.err: Caused by: android.system.ErrnoException: write failed: EBADF (Bad file descriptor) W/System.err: at libcore.io.Linux.writeBytes(Native Method) W/System.err: at libcore.io.Linux.write(Linux.java:296) W/System.err: at libcore.io.ForwardingOs.write(ForwardingOs.java:951) W/System.err: at libcore.io.BlockGuardOs.write(BlockGuardOs.java:447) W/System.err: at libcore.io.ForwardingOs.write(ForwardingOs.java:951) W/System.err: at libcore.io.IoBridge.write(IoBridge.java:649) W/System.err: ... 11 more I/Choreographer: Skipped 163 frames! The application may be doing too much work on its main thread.

I have generated the tflite by changing the following comand on the notebook

class GenerateModel(tf.Module):
  def __init__(self, model):
    super(GenerateModel, self).__init__()
    self.model = model

  @tf.function(
    # shouldn't need static batch size, but throws exception without it (needs to be fixed)
    input_signature=[
      tf.TensorSpec((1, 80, 3000), tf.float32, name="input_features"), 
    ],
  )
  def serving(self, input_features):
    outputs = self.model.generate(
      input_features,
      max_new_tokens = 223,
      return_dict_in_generate=True,
      forced_decoder_ids = [(1, 50262), (2, 50359), (3, 50363)] # ids resulting from processor.get_decoder_prompt_ids(language="spanish", task="transcribe")
    )
    return {"sequences": outputs["sequences"]}

What am I doing wrong?

nyadla-sys commented 1 year ago

@AlejandroLanaspa Could you please try the below two models on android app,these are multilingual models https://github.com/usefulsensors/openai-whisper/blob/main/models/whisper-tiny.tflite https://github.com/usefulsensors/openai-whisper/blob/main/models/whisper-small.tflite

nyadla-sys commented 1 year ago

below is the result with the above models using minimal example mycroft@OpenVoiceOS-e3830c:~/whisper $ minimal models/whisper-tiny.tflite de_speech_thorsten_sample03_8s.wav

n_vocab:50257

mel.n_len3000

mel.n_mel:80 INFO: Created TensorFlow Lite XNNPACK delegate for CPU. Inference time 7 seconds

[_extra_token_50258][_extra_token_50261][_extra_token_50359][BEG] Für mich sind alle Menschen gleich unabhängig von Geschlecht, sexuelle Orientierung, Religion, Hautfarbe oder Geo-Kordinaten der Geburt.[SOT]

mycroft@OpenVoiceOS-e3830c:~/whisper $ minimal models/whisper-base.tflite de_speech_thorsten_sample03_8s.wav

n_vocab:50257

mel.n_len3000

mel.n_mel:80 INFO: Created TensorFlow Lite XNNPACK delegate for CPU. Inference time 12 seconds

[_extra_token_50258][_extra_token_50261][_extra_token_50358][BEG] For me, all people are equally independent of gender, sex, orientation, religion, hate, or gender coordinates of birth.[SOT]

mycroft@OpenVoiceOS-e3830c:~/whisper $ minimal models/whisper-small.tflite de_speech_thorsten_sample03_8s.wav

n_vocab:50257

mel.n_len3000

mel.n_mel:80 INFO: Created TensorFlow Lite XNNPACK delegate for CPU. Inference time 43 seconds

[_extra_token_50258][_extra_token_50261][_extra_token_50359][BEG] Für mich sind alle Menschen gleich, unabhängig von Geschlecht, sexueller Orientierung, Religion, Hautfarbe oder Geo-Koordinaten der Geburt.[SOT]

nyadla-sys commented 1 year ago

Please make sure to use https://github.com/usefulsensors/openai-whisper/blob/main/models/filters_vocab_multilingual.h instead of English vocab binary

AlejandroLanaspa commented 1 year ago

Thanks for the quick response. While using https://github.com/usefulsensors/openai-whisper/blob/main/models/whisper-tiny.tflite and https://github.com/usefulsensors/openai-whisper/blob/main/models/filters_vocab_multilingual.bin on the android app it does not crash but the trasncription does not properly for spanish, that was the reason to try to "enforce it" (also, to get rid of the print of [_extra_token_50258][_extra_token_50261][_extra_token_50359][BEG])

Any ideas?

nyadla-sys commented 1 year ago

Add something like below in the native_lib.cpp of Android APP as well

    if((output_int[i] !=50258)&& (output_int[i] !=50261)&& (output_int[i] !=50359))
        text += whisper_token_to_str(output_int[i]);

as well pls change filters_vocab_gen.bin with https://github.com/usefulsensors/openai-whisper/blob/main/models/filters_vocab_multilingual.bin

int main(int argc, char* argv[]) {
  if ((argc != 2) && (argc != 3)) {
    fprintf(stderr, "'minimal <tflite model>' or 'minimal <tflite model> <pcm_file name>'\n");
    return 1;
  }
  const char* filename = argv[1];
  whisper_filters filters;
  whisper_mel mel;
  struct timeval start_time,end_time;
  std::string word;
  int32_t n_vocab = 0;
  std::string fname = "./filters_vocab_gen.bin";
nyadla-sys commented 1 year ago

We tested with Germany and it is working ,I will try with other language and let you know.

AlejandroLanaspa commented 1 year ago

Thanks for the trick

if((output_int[i] !=50258)&& (output_int[i] !=50261)&& (output_int[i] !=50359))
        text += whisper_token_to_str(output_int[i]);

As per the filters_vocab_gen.bin, I was already replacing it with the filters_vocab_multilingual.bin (changing its name to filters_vocab_gen.bin )

It seems it does not recognize me speaking in spanish :/

nyadla-sys commented 1 year ago

@AlejandroLanaspa Could you please share the Spanish sample and will test and upload new tflite model which can support spanish language

AlejandroLanaspa commented 1 year ago

Here is a sample https://datasets-server.huggingface.co/assets/common_voice/--/es/train/99/audio/audio.mp3

Others accessible here https://huggingface.co/datasets/common_voice/viewer/es/train

Thank you very much, and also for the rest of your work, awesome materials!!!

nyadla-sys commented 1 year ago

I created two tflite models for encoder and decoder and it does multilanguage support. You may have to extend Android app to use two tflite models to perform ASR. https://colab.research.google.com/github/usefulsensors/openai-whisper/blob/main/notebooks/whisper_encoder_decoder_tflite.ipynb