[Request] Add support for ReazonSpeech's reazonspeech-k2-v2 model

solaoi commented 1 month ago

Request for Adding ReazonSpeech's reazonspeech-k2-v2 Model

Hi, first of all, thank you for your excellent work on sherpa-rs! I would like to request the addition of support for the reazonspeech-k2-v2 model from ReazonSpeech in this project.

Model Information:

Model Name: reazonspeech-k2-v2
GitHub: https://github.com/reazon-research/reazonspeech

This model is built for automatic speech recognition (ASR) and has been fine-tuned for Japanese speech. Integrating this model into sherpa-rs would be incredibly helpful for expanding its ASR capabilities, especially for handling Japanese language tasks more effectively.

Reasons for Addition:

The reazonspeech-k2-v2 model offers high performance on Japanese speech datasets and is well-suited for ASR in Japanese.
Adding support for this model will diversify the available options for users working with Japanese speech recognition.

Model Integration:

The model is available on Hugging Face here, and I believe it should be compatible with the sherpa-rs framework with some adjustments to the existing code.
It might require some modifications to the current ASR pipeline to support this model's specific architecture.

If possible, I would greatly appreciate any guidance on how I could assist with this integration, or if it's something that can be considered in a future release.

Thank you again for your time and efforts on this project!

csukuangfj commented 1 month ago

Please have a look at https://k2-fsa.github.io/sherpa/onnx/pretrained_models/offline-transducer/zipformer-transducer-models.html#sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01-japanese

I think sherpa-rs should have already supported it.

solaoi commented 1 month ago

@csukuangfj Hi, thank you for your quick response.

I tried to modify the examples/transcribe.rs to use the reazonspeech-k2-v2 model, but I encountered an error during execution.

Steps to Reproduce:

I modified the transcribe.rs example with the following code:

use eyre::{bail, Result};
use sherpa_rs::{read_audio_file, transcribe::whisper::WhisperRecognizer};
use std::time::Instant;

fn main() -> Result<()> {
    let path = std::env::args().nth(1).expect("Missing file path argument");
    let provider = std::env::args().nth(2).unwrap_or("cpu".into());
    let (sample_rate, samples) = read_audio_file(&path)?;

    // Check if the sample rate is 16000
    if sample_rate != 16000 {
        bail!("The sample rate must be 16000.");
    }

    let mut recognizer = WhisperRecognizer::new(
        "reazonspeech-k2-v2/decoder-epoch-99-avg-1.onnx".into(),
        "reazonspeech-k2-v2/encoder-epoch-99-avg-1.onnx".into(),
        "reazonspeech-k2-v2/tokens.txt".into(),
        "ja".into(),
        Some(true),
        Some(provider),
        None,
        None,
    );

    let start_t = Instant::now();
    let result = recognizer.transcribe(sample_rate, samples);
    println!("{:?}", result);
    println!("Time taken for transcription: {:?}", start_t.elapsed());
    Ok(())
}

I then ran the code with the following command:
```
cargo run --example transcribe speech-001.wav
```
speech-001.wav is here.

Error Message:

/Users/solaoi/Projects/solaoi/sherpa-rs/target/debug/build/sherpa-rs-sys-e83e885fd8f7116f/out/sherpa-onnx/sherpa-onnx/c-api/c-api.cc:convertConfig:434 OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=512, low_freq=20, high_freq=-400, dither=0), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="", decoder_filename="", joiner_filename=""), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig(encoder="reazonspeech-k2-v2/encoder-epoch-99-avg-1.onnx", decoder="reazonspeech-k2-v2/decoder-epoch-99-avg-1.onnx", language="ja", task="transcribe", tail_paddings=-1), tdnn=OfflineTdnnModelConfig(model=""), zipformer_ctc=OfflineZipformerCtcModelConfig(model=""), wenet_ctc=OfflineWenetCtcModelConfig(model=""), sense_voice=OfflineSenseVoiceModelConfig(model="", language="", use_itn=False), telespeech_ctc="", tokens="reazonspeech-k2-v2/tokens.txt", num_threads=2, debug=True, provider="cpu", model_type="", modeling_unit="cjkchar", bpe_vocab=""), lm_config=OfflineLMConfig(model="", scale=1), ctc_fst_decoder_config=OfflineCtcFstDecoderConfig(graph="", max_active=3000), decoding_method="greedy_search", max_active_paths=4, hotwords_file="", hotwords_score=1.5, blank_penalty=0, rule_fsts="", rule_fars="")
/Users/solaoi/Projects/solaoi/sherpa-rs/target/debug/build/sherpa-rs-sys-e83e885fd8f7116f/out/sherpa-onnx/sherpa-onnx/csrc/offline-whisper-model.cc:InitEncoder:243 ---encoder---
model_author=k2-fsa
model_type=zipformer2
version=1
comment=non-streaming zipformer2

/Users/solaoi/Projects/solaoi/sherpa-rs/target/debug/build/sherpa-rs-sys-e83e885fd8f7116f/out/sherpa-onnx/sherpa-onnx/csrc/offline-whisper-model.cc:InitEncoder:247 n_mels does not exist in the metadata

thewh1teagle commented 1 month ago

Hey, Currently sherpa-rs has only support for whisper model which is multilingual.

@csukuangfj Does sherpa-onnx already support that Japanse model? If there's some example, I can add it to sherpa-rs as well.

csukuangfj commented 1 month ago

yes, it is just an offline transducer model

there is c api for it.

csukuangfj commented 1 month ago

https://github.com/k2-fsa/sherpa-onnx/blob/master/c-api-examples%2Fzipformer-c-api.c

here is the c api example

thewh1teagle commented 1 month ago

@solaoi

Add to latest version. See examples/zipformer.rs

solaoi commented 1 month ago

@thewh1teagle Thank you for the update! It's working perfectly now.

thewh1teagle / sherpa-rs