k2-fsa / sherpa-onnx

Speech-to-text, text-to-speech, speaker recognition, and VAD using next-gen Kaldi with onnxruntime without Internet connection. Support embedded systems, Android, iOS, Raspberry Pi, RISC-V, x86_64 servers, websocket server/client, C/C++, Python, Kotlin, C#, Go, NodeJS, Java, Swift, Dart, JavaScript, Flutter, Object Pascal, Lazarus, Rust
https://k2-fsa.github.io/sherpa/onnx/index.html
Apache License 2.0
3.34k stars 391 forks source link

iOS Inference for Nemo Model #1068

Closed iprovalo closed 3 months ago

iprovalo commented 3 months ago

I have built and run one example of the ASR successfully for iOS - getBilingualStreamZhEnZipformer20230220.

I am trying to run this model now: sherpa-onnx-nemo-fast-conformer-transducer-be-de-en-es-fr-hr-it-pl-ru-uk-20k

I added a method in swift:

func getBilingualStreamingNemoFastConformerTransducerBeDeEnEsFrHrItPlRuUk() -> SherpaOnnxOnlineModelConfig {
  let encoder = getResource("encoder", "onnx")
  let decoder = getResource("decoder", "onnx")
  let joiner = getResource("joiner", "onnx")
  let tokens = getResource("tokens", "txt")

    return sherpaOnnxOnlineModelConfig(
      tokens: tokens,
      transducer: sherpaOnnxOnlineTransducerModelConfig(
        encoder: encoder,
        decoder: decoder,
        joiner: joiner),
      numThreads: 1,
      modelType: "nemo"
    )
}

I am getting this error:

sherpa-onnx/csrc/online-transducer-nemo-model.cc:InitEncoder:313 window_size does not exist in the metadata

Could you please point me to an example of how to run this model?

Thank you!

csukuangfj commented 3 months ago

sherpa-onnx-nemo-fast-conformer-transducer-be-de-en-es-fr-hr-it-pl-ru-uk-20k it is not a streaming model.

You can see that there is no streaming in the model filename.

Please use a streaming model from https://github.com/k2-fsa/sherpa-onnx/releases/tag/asr-models

iprovalo commented 3 months ago

@csukuangfj thank you very much!

Are there streaming multi-lingual models which would include the same languages as the one I mentioned - sherpa-onnx-nemo-fast-conformer-transducer-be-de-en-es-fr-hr-it-pl-ru-uk-20k?

If I want to try this multi-lingual nemo model, I would need to run SherpaOnnx2Pass example, correct?

What is the offline vs streaming difference? Is the streaming for active microphone, and offline for wav file only?

csukuangfj commented 3 months ago

We have an Android APK for this model.

Please download it from

https://k2-fsa.github.io/sherpa/onnx/vad/apk-asr.html

iprovalo commented 3 months ago

@csukuangfj thank you!

I was able to build Android APK locally and run this model:

        7 -> {
            val modelDir = "sherpa-onnx-nemo-fast-conformer-ctc-be-de-en-es-fr-hr-it-pl-ru-uk-20k"
            return OfflineModelConfig(
                nemo = OfflineNemoEncDecCtcModelConfig(
                    model = "$modelDir/model.onnx",
                ),
                tokens = "$modelDir/tokens.txt",
            )
        }

I could not find this model: sherpa-onnx-nemo-fast-conformer-transducer-be-de-en-es-fr-hr-it-pl-ru-uk-20k in the code. What is the correct configuration for this model?

I tried setting it up as a transducer:

        14 -> {
            val modelDir = "sherpa-onnx-nemo-fast-conformer-transducer-be-de-en-es-fr-hr-it-pl-ru-uk-20k"
            return OfflineModelConfig(
                transducer = OfflineTransducerModelConfig(
                    encoder = "$modelDir/encoder.onnx",
                    decoder = "$modelDir/decoder.onnx",
                    joiner = "$modelDir/joiner.onnx",
                ),
                tokens = "$modelDir/tokens.txt",
                modelType = "transducer",
            )
        }

And it crashes with these messages:

2024-01-31 14:19:49.380 24163-24163 sherpa-onnx             com.k2fsa.sherpa.onnx                I  Select model type 14 for ASR
2024-01-31 14:19:49.381 24163-24163 sherpa-onnx             com.k2fsa.sherpa.onnx                W  config:
                                                                                                    OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80, low_freq=20, high_freq=-400, dither=0), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="sherpa-onnx-nemo-fast-conformer-transducer-be-de-en-es-fr-hr-it-pl-ru-uk-20k/encoder.onnx", decoder_filename="sherpa-onnx-nemo-fast-conformer-transducer-be-de-en-es-fr-hr-it-pl-ru-uk-20k/decoder.onnx", joiner_filename="sherpa-onnx-nemo-fast-conformer-transducer-be-de-en-es-fr-hr-it-pl-ru-uk-20k/joiner.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig(encoder="", decoder="", language="en", task="transcribe", tail_paddings=1000), tdnn=OfflineTdnnModelConfig(model=""), zipformer_ctc=OfflineZipformerCtcModelConfig(model=""), wenet_ctc=OfflineWenetCtcModelConfig(model=""), telespeech_ctc="", tokens="sherpa-onnx-nemo-fast-conformer-transducer-be-de-en-es-fr-hr-it-pl-ru-uk-20k/tokens.txt", nu
2024-01-31 14:19:52.369 24163-24163 sherpa-onnx             com.k2fsa.sherpa.onnx                W  vocab_size does not exist in the metadata
csukuangfj commented 3 months ago

Please change

modelType = "transducer",

to

modelType = "nemo_transducer",

and everything should work as expected.

iprovalo commented 3 months ago

@csukuangfj Thank you! Worked like a charm!

csukuangfj commented 3 months ago

By the way,

What is the offline vs streaming difference? Is the streaming for active microphone, and offline for wav file only?

Please refer to

streaming == online, and non-streaming == offline in this context.

Generally speaking, streaming ASR gives you the recognition result as you speak. Non-streaming ASR MUST wait until you have finished speaking before it can start the recognition.

Microphones and wave files are just ways to get audio samples; they are not tied to any algorithms or models. You can consider the two as just different input devices. You can have many other input devices.

iprovalo commented 3 months ago

@csukuangfj online vs offline makes sense at runtime, but I want to clarify - when training a model, how are streaming vs non-streaming models different? Is it a different architecture?

Thank you!

iprovalo commented 3 months ago

@csukuangfj I think this is what I was looking for: https://arxiv.org/pdf/2010.14099