k2-fsa / sherpa-onnx

Speech-to-text, text-to-speech, speaker recognition, and VAD using next-gen Kaldi with onnxruntime without Internet connection. Support embedded systems, Android, iOS, Raspberry Pi, RISC-V, x86_64 servers, websocket server/client, C/C++, Python, Kotlin, C#, Go, NodeJS, Java, Swift, Dart, JavaScript, Flutter, Object Pascal, Lazarus, Rust
https://k2-fsa.github.io/sherpa/onnx/index.html
Apache License 2.0
3.24k stars 378 forks source link

How to train or optimize the sherpa-onnx-kws-zipformer-wenetspeech-3.3M-2024-01-01 model for my own voice? #1371

Open diyism opened 1 week ago

diyism commented 1 week ago

I've tested the latest kws-model(sherpa-onnx-kws-zipformer-wenetspeech-3.3M-2024-01-01.tar.bz2 from https://github.com/k2-fsa/sherpa-onnx/releases/tag/kws-models) against my own voice, but both of the 2 models(encoder-epoch-99-avg-1, encoder-epoch-12-avg-2) in it recognized my "bo2" as "guo2" wrongly:

$ cat ../keywords.txt
j iǎng @jiang3
y ǒu @you3
b ó @bo2
b èi @bei4
p āi @pai1
d ào @dao4
g uó @guo2

$ sherpa-onnx-keyword-spotter     --tokens=tokens.txt     --encoder=encoder-epoch-99-avg-1-chunk-16-left-64.onnx     --decoder=decoder-epoch-99-avg-1-chunk-16-left-64.onnx     --joiner=joiner-epoch-99-avg-1-chunk-16-left-64.onnx     --provider=cpu     --num-threads=8  --keywords-threshold=0.02  --max-active-paths=2 --keywords-file=../keywords.txt ./4_me.wav 2>&1 | grep start_time
{"start_time":0.00, "keyword": "jiang3", "timestamps": [1.36, 1.40], "tokens":["j", "iǎng"]}
{"start_time":0.00, "keyword": "you4", "timestamps": [1.68, 1.76], "tokens":["y", "òu"]}
{"start_time":0.00, "keyword": "guo2", "timestamps": [1.96, 2.04], "tokens":["g", "uó"]}
{"start_time":0.00, "keyword": "bei4", "timestamps": [2.36, 2.40], "tokens":["b", "èi"]}
{"start_time":0.00, "keyword": "pai1", "timestamps": [2.64, 2.68], "tokens":["p", "āi"]}
{"start_time":0.00, "keyword": "dao4", "timestamps": [2.88, 2.96], "tokens":["d", "ào"]}

$ sherpa-onnx-keyword-spotter     --tokens=tokens.txt     --encoder=encoder-epoch-12-avg-2-chunk-16-left-64.onnx     --decoder=decoder-epoch-12-avg-2-chunk-16-left-64.onnx     --joiner=joiner-epoch-12-avg-2-chunk-16-left-64.onnx     --provider=cpu     --num-threads=8  --keywords-threshold=0.03  --max-active-paths=2 --keywords-file=../keywords.txt ./4_me.wav 2>&1 | grep start_time
{"start_time":0.00, "keyword": "jiang3", "timestamps": [1.36, 1.40], "tokens":["j", "iǎng"]}
{"start_time":0.00, "keyword": "you4", "timestamps": [1.68, 1.76], "tokens":["y", "òu"]}
{"start_time":0.00, "keyword": "guo2", "timestamps": [1.96, 2.04], "tokens":["g", "uó"]}
{"start_time":0.00, "keyword": "bei4", "timestamps": [2.36, 2.40], "tokens":["b", "èi"]}
{"start_time":0.00, "keyword": "pai1", "timestamps": [2.64, 2.68], "tokens":["p", "āi"]}
{"start_time":0.00, "keyword": "dao4", "timestamps": [2.88, 2.96], "tokens":["d", "ào"]}

Is there any way to train the sherpa-onnx-kws model for my own voice? for example, as easy as the YonaVox project:

  1. record my every mono-syllable(pinyin) for 50 times on my phone chrome browser, and with 1 second mute between every syllable automatically (https://github.com/diyism/YonaVox/blob/master/training/recorder.ipynb): Screenshot 2024-09-23 at 15-59-01 recorder ipynb - Colab

  2. training or optimizing the model with google colab GPU(https://github.com/diyism/YonaVox/blob/master/training/Hebrew_AC_voice_activation_(public_version).ipynb) Screenshot 2024-09-23 at 16-05-22 Hebrew_AC_voice_activation_(public_version) ipynb - Colab

ref: https://github.com/k2-fsa/sherpa-onnx/issues/920

csukuangfj commented 1 week ago

Please see our icefall doc.

diyism commented 1 week ago

I found some jupyter notebooks in this: https://github.com/k2-fsa/colab/tree/master/sherpa-onnx , but they are not specific for optimizing keyword spotter model.

And this https://k2-fsa.github.io/icefall/recipes/Finetune/from_supervised/finetune_zipformer.html , but I can't figure out how to integrate the wav files of my own voice into it.

csukuangfj commented 1 week ago

Please see our icefall doc.

Please see this comment.

You need to spend some time reading our doc https://k2-fsa.github.io/icefall/

diyism commented 1 week ago

After reading https://k2-fsa.github.io/icefall/recipes/Non-streaming-ASR/yesno/tdnn.html#colab-notebook I can successfully run the modified my-icefall-yes-no-dataset-recipe.ipynb(https://github.com/diyism/colab_kaldi2/blob/main/my_icefall_yes_no_dataset_recipe.ipynb) in Colab: Screenshot 2024-09-25 at 19-09-00 my-icefall-yes-no-dataset-recipe ipynb - Colab

But I can't found a ipynb file for wenetspeech-kws recipe, so I try to modify the my-icefall-yes-no-dataset-recipe.ipynb for wenetspeech-kws(https://github.com/diyism/colab_kaldi2/blob/main/my_icefall_wenetspeech_kws_dataset_recipe.ipynb), but I found it's downloading 500GB dataset files, I think it won't work in colab: Screenshot 2024-09-25 at 22-31-17 my-icefall-wenetspeech-kws-dataset-recipe ipynb - Colab

I want to build a web UI to record my own voice of mandarin syllables to replace wenetspeech-kws dataset(without downloading the 500GB files) just like the YonaVox project(https://github.com/diyism/YonaVox/blob/master/training/recorder.ipynb), and to train the kws model only with these recorded voice, is it feasible?

I found another 2 ipynb files for creating recipes, but it seems that it's not specifically about creating voice dataset files for wenetspeech-kws: 00-basic-workflow.ipynb(https://colab.research.google.com/github/lhotse-speech/lhotse/blob/master/examples/00-basic-workflow.ipynb) espnet-and-lhotse-min-example.ipynb(https://colab.research.google.com/drive/1HKSYPsWx_HoCdrnLpaPdYj5zwlPsM3NH)

csukuangfj commented 1 week ago

but it seems that it's not specifically about creating voice dataset files for wenetspeech-kws

Are there any differences between the dataset you want to build with other dataset examples in icefall, e.g., the yesno dataset? The principle is the same.

diyism commented 6 days ago

but it seems that it's not specifically about creating voice dataset files for wenetspeech-kws

Are there any differences between the dataset you want to build with other dataset examples in icefall, e.g., the yesno dataset? The principle is the same.

I have some wav files of my own voice and corresponding transcription txt files. Then I wrote create_dataset.py (https://github.com/diyism/colab_kaldi2/blob/main/create_dataset.py) which can successfully generate a my_dataset.jsonl file.

Now I want to replace my_dataset.jsonl into egs/wenetspeech/KWS/prepare.sh, but I found that this prepare.sh is much more complex than the one for Yesno. It also calls egs/wenetspeech/ASR/prepare.sh, and ASR/prepare.sh is also very complex, containing 23 stages, and it requires more than just the my_dataset.jsonl file. I'm completely lost and don't know where to start.

Is it feasible to train a KWS model using only my voice files and my_dataset.jsonl?

diyism commented 4 days ago

I'm trying to use claude.ai to understand the raw files needed by icefall/egs/wenetspeech/ASR/prepare.sh. It seems only the Musan Dataset needs to be downloaded, other files are all generated from the voice wav files and transcription files. Screenshot 2024-09-28 at 08-32-31 Claude

Is it feasible to create a streamlined prepare.sh that uses only local voice wav files and their transcriptions, automatically downloads the Musan dataset, and generates all other dependent files to train a KWS model?

csukuangfj commented 4 days ago
  1. Please create 3 text files.

(a) wav.scp It should contain something like below

unique_id_1 /path/to/foo.wav
unique_id_2 /path/to/bar.wav
unique_id_3 /path/to/foobar.wav

(b) wav.scp

It should contain something like below

unique_id_1 transcript for /path/to/foo.wav
unqiue_id_2 transcript for /path/to/bar.wav
unique_id_3 transcript for /path/to/foobar.wav

(c) utt2spk

unique_id_1 unique_id_1
unique_id_2 unique_id_2
unique_id_3 unique_id_3
  1. Follow https://lhotse.readthedocs.io/en/latest/kaldi.html#example Note that you don't have feats.scp so you will only get two files

    recordings.jsonl.gz  supervisions.jsonl.gz

    after following the doc

  2. Please follow our yesno recipe or any other recipes in icefall to compute features.


Again, I suggest that you spend time, maybe several days, reading our existing examples. All you need can be found in our examples.