How to use Whisper for multiple language STT or ASR in Android and IOS

k2-fsa / sherpa-onnx

Speech-to-text, text-to-speech, speaker recognition, and VAD using next-gen Kaldi with onnxruntime without Internet connection. Support embedded systems, Android, iOS, Raspberry Pi, RISC-V, x86_64 servers, websocket server/client, C/C++, Python, Kotlin, C#, Go, NodeJS, Java, Swift, Dart, JavaScript, Flutter, Object Pascal, Lazarus, Rust

https://k2-fsa.github.io/sherpa/onnx/index.html

Apache License 2.0

3.13k stars 364 forks source link

How to use Whisper for multiple language STT or ASR in Android and IOS #715

Open KhoaNgo18 opened 5 months ago

KhoaNgo18 commented 5 months ago

I wanted to use Whisper model for the STT but as I look into the code written for android and ios, I can't find the needed function to init the Whisper Model. I already see that the Whisper is supported for OfflineModel, btw I don't understand the concept of 2Pass, it would be great to get to know it better.

csukuangfj commented 5 months ago

please see https://github.com/k2-fsa/sherpa-onnx/blob/de655e838e7e1cc073275be119e7cdf0bd5d4108/android/SherpaOnnx2Pass/app/src/main/java/com/k2fsa/sherpa/onnx/SherpaOnnx.kt#L351-L373

https://github.com/k2-fsa/sherpa-onnx/blob/de655e838e7e1cc073275be119e7cdf0bd5d4108/android/SherpaOnnx2Pass/app/src/main/java/com/k2fsa/sherpa/onnx/MainActivity.kt#L209-L212

You can either see secondType to 2 or 3.

Remember to place the corresponding files to assets.

You can find pre-built ASR APKs with Whisper at https://github.com/k2-fsa/sherpa-onnx/releases/tag/v1.9.14

Screenshot 2024-03-28 at 14 21 07

csukuangfj commented 5 months ago

Similarly, for iOS, please see

https://github.com/k2-fsa/sherpa-onnx/blob/de655e838e7e1cc073275be119e7cdf0bd5d4108/ios-swiftui/SherpaOnnx2Pass/SherpaOnnx2Pass/Model.swift#L96

https://github.com/k2-fsa/sherpa-onnx/blob/de655e838e7e1cc073275be119e7cdf0bd5d4108/ios-swiftui/SherpaOnnx2Pass/SherpaOnnx2Pass/SherpaOnnxViewModel.swift#L94

You need to place the corresponding model files in your project.

KhoaNgo18 commented 5 months ago

Can I only use Whisper since when I test with 2Pass, it can only detect English. Ask I understand in the 2Pass code, I have to have 2 models, sherpa-onnx-streaming-zipformer-en-20M-2023-02-17 and sherpa-onnx-whisper-base.en. I cannot use only Whisper.

csukuangfj commented 5 months ago

whisper is a non-streaming ASR model, you cannot use it for real-time streaming ASR.

We don't provide an APK or an example to use Whisepr alone for non-streaming ASR in Android/iOS, but we do provide APIs.

So the answer is yes; you can use Whisper alone in Android/iOS.

KhoaNgo18 commented 5 months ago

can you guide me on how to use the APIs or at least where's the APIs at. I'm new to mobile and AI, so I appreciate your help a lot

csukuangfj commented 5 months ago

You can find all the required APIs in our two-pass example, which I have already posted in the first comment.

If you are new to Android and iOS and are also new to Kotlin and Swift, then it may be difficult for you.

iprovalo commented 2 months ago

I was able to get the whisper small multilingual to work with this 2Pass code for Romanian language:

func getNonStreamingWhisperSmall() -> SherpaOnnxOfflineModelConfig {
  let encoder = getResource("small-encoder.int8", "onnx")
  let decoder = getResource("small-decoder.int8", "onnx")
  let tokens = getResource("small-tokens", "txt")

  return sherpaOnnxOfflineModelConfig(
    tokens: tokens,
    whisper: sherpaOnnxOfflineWhisperModelConfig(
      encoder: encoder,
      decoder: decoder,
      language: "ro"
    ),
    numThreads: 1,
    modelType: "whisper"
  )
}

Then in the SherpaOnnxViewModel:initOfflineRecognizer()

change

let modelConfig = getNonStreamingWhisperTinyEn()

let modelConfig = getNonStreamingWhisperSmall()

I think it would make more sense for my use case to use VAD instead (as in SherpaOnnxVadAsr). I will try that next.

@csukuangfj how hard is it to make the code changes for the whisper model to take the language param at runtime, not while the model is loaded? Could you please point me to the general code area?

Thank you!

iprovalo commented 2 months ago

@csukuangfj I noticed that if I just set the language to en, whisper will switch from transcribe to translate mode.

iprovalo commented 2 months ago

@csukuangfj I think this is what I am looking for if I want to pass the language to decoder at runtime:

https://github.com/k2-fsa/sherpa-onnx/blob/master/sherpa-onnx/csrc/offline-whisper-greedy-search-decoder.cc#L30

csukuangfj commented 2 months ago

how hard is it to make the code changes for the whisper model to take the language param at runtime

I'm sorry; unfortunately, we don't provide an API for users to do that.

iprovalo commented 2 months ago

@csukuangfj my bad, I misread the code. Whisper multilingual model config with an empty language is working perfectly. I tested it with VAD in iOS SherpaOnnx2Pass. After init VAD, just doing something similar to Android's version:

          let array = convertedBuffer.array()
          if !array.isEmpty {
              self.vad.acceptWaveform(samples: [Float](array))
              while !self.vad.isEmpty() {
                  let s = self.vad.front()
                  self.vad.pop()
                  let lastSentence = self.offlineRecognizer.decode(samples: s.samples).text
                  self.sentences.append(lastSentence)
                  self.updateLabel()
              }
          }