Open diyism opened 1 week ago
Please see our icefall doc.
I found some jupyter notebooks in this: https://github.com/k2-fsa/colab/tree/master/sherpa-onnx , but they are not specific for optimizing keyword spotter model.
And this https://k2-fsa.github.io/icefall/recipes/Finetune/from_supervised/finetune_zipformer.html , but I can't figure out how to integrate the wav files of my own voice into it.
Please see our icefall doc.
Please see this comment.
You need to spend some time reading our doc https://k2-fsa.github.io/icefall/
After reading https://k2-fsa.github.io/icefall/recipes/Non-streaming-ASR/yesno/tdnn.html#colab-notebook I can successfully run the modified my-icefall-yes-no-dataset-recipe.ipynb(https://github.com/diyism/colab_kaldi2/blob/main/my_icefall_yes_no_dataset_recipe.ipynb) in Colab:
But I can't found a ipynb file for wenetspeech-kws recipe, so I try to modify the my-icefall-yes-no-dataset-recipe.ipynb for wenetspeech-kws(https://github.com/diyism/colab_kaldi2/blob/main/my_icefall_wenetspeech_kws_dataset_recipe.ipynb), but I found it's downloading 500GB dataset files, I think it won't work in colab:
I want to build a web UI to record my own voice of mandarin syllables to replace wenetspeech-kws dataset(without downloading the 500GB files) just like the YonaVox project(https://github.com/diyism/YonaVox/blob/master/training/recorder.ipynb), and to train the kws model only with these recorded voice, is it feasible?
I found another 2 ipynb files for creating recipes, but it seems that it's not specifically about creating voice dataset files for wenetspeech-kws: 00-basic-workflow.ipynb(https://colab.research.google.com/github/lhotse-speech/lhotse/blob/master/examples/00-basic-workflow.ipynb) espnet-and-lhotse-min-example.ipynb(https://colab.research.google.com/drive/1HKSYPsWx_HoCdrnLpaPdYj5zwlPsM3NH)
but it seems that it's not specifically about creating voice dataset files for wenetspeech-kws
Are there any differences between the dataset you want to build with other dataset examples in icefall, e.g., the yesno dataset? The principle is the same.
but it seems that it's not specifically about creating voice dataset files for wenetspeech-kws
Are there any differences between the dataset you want to build with other dataset examples in icefall, e.g., the yesno dataset? The principle is the same.
I have some wav files of my own voice and corresponding transcription txt files. Then I wrote create_dataset.py (https://github.com/diyism/colab_kaldi2/blob/main/create_dataset.py) which can successfully generate a my_dataset.jsonl file.
Now I want to replace my_dataset.jsonl into egs/wenetspeech/KWS/prepare.sh, but I found that this prepare.sh is much more complex than the one for Yesno. It also calls egs/wenetspeech/ASR/prepare.sh, and ASR/prepare.sh is also very complex, containing 23 stages, and it requires more than just the my_dataset.jsonl file. I'm completely lost and don't know where to start.
Is it feasible to train a KWS model using only my voice files and my_dataset.jsonl?
I'm trying to use claude.ai to understand the raw files needed by icefall/egs/wenetspeech/ASR/prepare.sh. It seems only the Musan Dataset needs to be downloaded, other files are all generated from the voice wav files and transcription files.
Is it feasible to create a streamlined prepare.sh that uses only local voice wav files and their transcriptions, automatically downloads the Musan dataset, and generates all other dependent files to train a KWS model?
(a) wav.scp It should contain something like below
unique_id_1 /path/to/foo.wav
unique_id_2 /path/to/bar.wav
unique_id_3 /path/to/foobar.wav
(b) wav.scp
It should contain something like below
unique_id_1 transcript for /path/to/foo.wav
unqiue_id_2 transcript for /path/to/bar.wav
unique_id_3 transcript for /path/to/foobar.wav
(c) utt2spk
unique_id_1 unique_id_1
unique_id_2 unique_id_2
unique_id_3 unique_id_3
Follow https://lhotse.readthedocs.io/en/latest/kaldi.html#example
Note that you don't have feats.scp
so you will only get two files
recordings.jsonl.gz supervisions.jsonl.gz
after following the doc
Please follow our yesno recipe or any other recipes in icefall to compute features.
Again, I suggest that you spend time, maybe several days, reading our existing examples. All you need can be found in our examples.
I've tested the latest kws-model(sherpa-onnx-kws-zipformer-wenetspeech-3.3M-2024-01-01.tar.bz2 from https://github.com/k2-fsa/sherpa-onnx/releases/tag/kws-models) against my own voice, but both of the 2 models(encoder-epoch-99-avg-1, encoder-epoch-12-avg-2) in it recognized my "bo2" as "guo2" wrongly:
Is there any way to train the sherpa-onnx-kws model for my own voice? for example, as easy as the YonaVox project:
record my every mono-syllable(pinyin) for 50 times on my phone chrome browser, and with 1 second mute between every syllable automatically (https://github.com/diyism/YonaVox/blob/master/training/recorder.ipynb):
training or optimizing the model with google colab GPU(https://github.com/diyism/YonaVox/blob/master/training/Hebrew_AC_voice_activation_(public_version).ipynb)
ref: https://github.com/k2-fsa/sherpa-onnx/issues/920