Open diyism opened 2 months ago
Please see our icefall doc.
I found some jupyter notebooks in this: https://github.com/k2-fsa/colab/tree/master/sherpa-onnx , but they are not specific for optimizing keyword spotter model.
And this https://k2-fsa.github.io/icefall/recipes/Finetune/from_supervised/finetune_zipformer.html , but I can't figure out how to integrate the wav files of my own voice into it.
Please see our icefall doc.
Please see this comment.
You need to spend some time reading our doc https://k2-fsa.github.io/icefall/
After reading https://k2-fsa.github.io/icefall/recipes/Non-streaming-ASR/yesno/tdnn.html#colab-notebook I can successfully run the modified my-icefall-yes-no-dataset-recipe.ipynb(https://github.com/diyism/colab_kaldi2/blob/main/my_icefall_yes_no_dataset_recipe.ipynb) in Colab:
But I can't found a ipynb file for wenetspeech-kws recipe, so I try to modify the my-icefall-yes-no-dataset-recipe.ipynb for wenetspeech-kws(https://github.com/diyism/colab_kaldi2/blob/main/my_icefall_wenetspeech_kws_dataset_recipe.ipynb), but I found it's downloading 500GB dataset files, I think it won't work in colab:
I want to build a web UI to record my own voice of mandarin syllables to replace wenetspeech-kws dataset(without downloading the 500GB files) just like the YonaVox project(https://github.com/diyism/YonaVox/blob/master/training/recorder.ipynb), and to train the kws model only with these recorded voice, is it feasible?
I found another 2 ipynb files for creating recipes, but it seems that it's not specifically about creating voice dataset files for wenetspeech-kws: 00-basic-workflow.ipynb(https://colab.research.google.com/github/lhotse-speech/lhotse/blob/master/examples/00-basic-workflow.ipynb) espnet-and-lhotse-min-example.ipynb(https://colab.research.google.com/drive/1HKSYPsWx_HoCdrnLpaPdYj5zwlPsM3NH)
but it seems that it's not specifically about creating voice dataset files for wenetspeech-kws
Are there any differences between the dataset you want to build with other dataset examples in icefall, e.g., the yesno dataset? The principle is the same.
but it seems that it's not specifically about creating voice dataset files for wenetspeech-kws
Are there any differences between the dataset you want to build with other dataset examples in icefall, e.g., the yesno dataset? The principle is the same.
I have some wav files of my own voice and corresponding transcription txt files. Then I wrote create_dataset.py (https://github.com/diyism/colab_kaldi2/blob/main/create_dataset.py) which can successfully generate a my_dataset.jsonl file.
Now I want to replace my_dataset.jsonl into egs/wenetspeech/KWS/prepare.sh, but I found that this prepare.sh is much more complex than the one for Yesno. It also calls egs/wenetspeech/ASR/prepare.sh, and ASR/prepare.sh is also very complex, containing 23 stages, and it requires more than just the my_dataset.jsonl file. I'm completely lost and don't know where to start.
Is it feasible to train a KWS model using only my voice files and my_dataset.jsonl?
I'm trying to use claude.ai to understand the raw files needed by icefall/egs/wenetspeech/ASR/prepare.sh. It seems only the Musan Dataset needs to be downloaded, other files are all generated from the voice wav files and transcription files.
Is it feasible to create a streamlined prepare.sh that uses only local voice wav files and their transcriptions, automatically downloads the Musan dataset, and generates all other dependent files to train a KWS model?
(a) wav.scp It should contain something like below
unique_id_1 /path/to/foo.wav
unique_id_2 /path/to/bar.wav
unique_id_3 /path/to/foobar.wav
(b) wav.scp
It should contain something like below
unique_id_1 transcript for /path/to/foo.wav
unqiue_id_2 transcript for /path/to/bar.wav
unique_id_3 transcript for /path/to/foobar.wav
(c) utt2spk
unique_id_1 unique_id_1
unique_id_2 unique_id_2
unique_id_3 unique_id_3
Follow https://lhotse.readthedocs.io/en/latest/kaldi.html#example
Note that you don't have feats.scp
so you will only get two files
recordings.jsonl.gz supervisions.jsonl.gz
after following the doc
Please follow our yesno recipe or any other recipes in icefall to compute features.
Again, I suggest that you spend time, maybe several days, reading our existing examples. All you need can be found in our examples.
I've modified my_icefall_wenetspeech_asr_dataset_recipe.ipynb(https://github.com/diyism/colab_kaldi2/blob/main/my_icefall_wenetspeech_asr_dataset_recipe.ipynb), now it can generate recordings.jsonl.gz , supervisions.jsonl.gz, and then generate myvoice_train.jsonl.gz :
!ls -al /content/drive/MyDrive/ColabData/KWS/kws_create_dataset/train/
-rw------- 1 root root 139 Sep 28 15:42 text
-rw------- 1 root root 11 Sep 28 15:43 utt2spk
-rw------- 1 root root 39 Sep 28 15:54 wav.scp
#1. install lhotse and lhotse import and cut
#Normally, we would use pip install lhotse. However, the yesno recipe is added recently and has not been released to PyPI yet,
#so we install the latest unreleased version here.
!pip install -q git+https://github.com/lhotse-speech/lhotse
from google.colab import drive
!umount /content/drive
drive.mount('/content/drive', force_remount=True)
#!ls -al /content/drive/MyDrive/ColabData/KWS/kws_create_dataset/train
#!lhotse kaldi import --help
#!cd /content/drive/MyDrive/ColabData/KWS/kws_create_dataset && lhotse kaldi import ./train/ 16000 ./train_manifests/
!ls -al /content/drive/MyDrive/ColabData/KWS/kws_create_dataset/train_manifests
#-f ./train_manifests/features.jsonl.gz \
!cd /content/drive/MyDrive/ColabData/KWS/kws_create_dataset && lhotse cut simple \
-r ./train_manifests/recordings.jsonl.gz \
-s ./train_manifests/supervisions.jsonl.gz \
./myvoice_train.jsonl.gz
And I've modified icefall_egs_wenetspeech_ASR_prepare.sh(https://github.com/diyism/colab_kaldi2/blob/main/icefall_egs_wenetspeech_ASR_prepare.sh) and icefall_egs_wenetspeech_ASR_local_preprocess_wenetspeech.py(https://github.com/diyism/colab_kaldi2/blob/main/icefall_egs_wenetspeech_ASR_local_preprocess_wenetspeech.py):
#!/usr/bin/env bash
# fix segmentation fault reported in https://github.com/k2-fsa/icefall/issues/674
export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python
set -eou pipefail
nj=15
stage=0
stop_stage=100
# Path to your local myvoice_train.jsonl.gz file
local_data_path="/content/drive/MyDrive/ColabData/KWS/kws_create_dataset/myvoice_train.jsonl.gz"
dl_dir=$PWD/download
lang_char_dir=data/lang_char
. shared/parse_options.sh || exit 1
# All files generated by this script are saved in "data".
# You can safely remove "data" and rerun this script to regenerate it.
mkdir -p data
log() {
local fname=${BASH_SOURCE[1]##*/}
echo -e "$(date '+%Y-%m-%d %H:%M:%S') (${fname}:${BASH_LINENO[0]}:${FUNCNAME[1]}) $*"
}
log "dl_dir: $dl_dir"
if [ $stage -le 0 ] && [ $stop_stage -ge 0 ]; then
log "Stage 0: Copy local data and download musan if needed"
mkdir -p data/manifests
cp $local_data_path data/manifests/cuts_train.jsonl.gz
cp /content/drive/MyDrive/ColabData/KWS/kws_create_dataset/*.wav /content/icefall/egs/wenetspeech/ASR/
if [ ! -d $dl_dir/musan ]; then
log "Downloading musan dataset"
lhotse download musan $dl_dir
fi
fi
if [ $stage -le 1 ] && [ $stop_stage -ge 1 ]; then
log "Stage 1: Prepare musan manifest"
mkdir -p data/manifests
lhotse prepare musan $dl_dir/musan data/manifests
fi
if [ $stage -le 2 ] && [ $stop_stage -ge 2 ]; then
log "Stage 2: Preprocess local manifest"
if [ ! -f data/fbank/.preprocess_complete ]; then
python3 ./local/preprocess_wenetspeech.py --perturb-speed True
touch data/fbank/.preprocess_complete
fi
fi
if [ $stage -le 3 ] && [ $stop_stage -ge 3 ]; then
log "Stage 3: Combine features"
if [ ! -f data/fbank/cuts_train.jsonl.gz ]; then
cp data/manifests/cuts_train.jsonl.gz data/fbank/cuts_train.jsonl.gz
fi
fi
# Add any additional stages you need for your specific use case
log "Data preparation completed."
It seems it works:
2024-10-04 18:56:19 (prepare.sh:29:main) dl_dir: /content/icefall/egs/wenetspeech/ASR/download
2024-10-04 18:56:19 (prepare.sh:32:main) Stage 0: Copy local data and download musan if needed
2024-10-04 18:56:19 (prepare.sh:44:main) Stage 1: Prepare musan manifest
2024-10-04 18:56:21,963 WARNING [qa.py:120] There are 15 recordings that do not have any corresponding supervisions in the SupervisionSet.
2024-10-04 18:56:23 (prepare.sh:50:main) Stage 2: Preprocess local manifest
2024-10-04 18:56:25,754 INFO [preprocess_wenetspeech.py:29] Loading manifest
2024-10-04 18:56:25,754 INFO [preprocess_wenetspeech.py:38] Compute fbank features
2024-10-04 18:56:25,965 INFO [preprocess_wenetspeech.py:46] Applying speed perturbation
Extracting and storing features (chunks progress): 100% 2/2 [00:03<00:00, 1.86s/it]
2024-10-04 18:56:29,711 INFO [preprocess_wenetspeech.py:56] Saving cuts with features
2024-10-04 18:56:30,355 INFO [preprocess_wenetspeech.py:62] Done
2024-10-04 18:56:31 (prepare.sh:58:main) Stage 3: Combine features
2024-10-04 18:56:31 (prepare.sh:66:main) Data preparation completed.
But if I run the 5th step:
#5.training
! export PYTHONPATH=/content/icefall:$PYTHONPATH && \
cd /content/icefall/egs/wenetspeech/ASR && \
./zipformer/train.py
It shows errors:
2024-10-04 18:40:24,047 INFO [train.py:1064] Training started
2024-10-04 18:40:24,049 INFO [train.py:1074] Device: cuda:0
Traceback (most recent call last):
File "/content/icefall/egs/wenetspeech/ASR/./zipformer/train.py", line 1350, in <module>
main()
File "/content/icefall/egs/wenetspeech/ASR/./zipformer/train.py", line 1343, in main
run(rank=0, world_size=1, args=args)
File "/content/icefall/egs/wenetspeech/ASR/./zipformer/train.py", line 1076, in run
lexicon = Lexicon(params.lang_dir)
File "/content/icefall/icefall/lexicon.py", line 164, in __init__
self.token_table = k2.SymbolTable.from_file(lang_dir / "tokens.txt")
File "/usr/local/lib/python3.10/dist-packages/k2/symbol_table.py", line 130, in from_file
with open(filename, 'r', encoding='utf-8') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'data/lang_char/tokens.txt'
I guess I lost something.
I've tested the latest kws-model(sherpa-onnx-kws-zipformer-wenetspeech-3.3M-2024-01-01.tar.bz2 from https://github.com/k2-fsa/sherpa-onnx/releases/tag/kws-models) against my own voice, but both of the 2 models(encoder-epoch-99-avg-1, encoder-epoch-12-avg-2) in it recognized my "bo2" as "guo2" wrongly:
Is there any way to train the sherpa-onnx-kws model for my own voice? for example, as easy as the YonaVox project:
record my every mono-syllable(pinyin) for 50 times on my phone chrome browser, and with 1 second mute between every syllable automatically (https://github.com/diyism/YonaVox/blob/master/training/recorder.ipynb):
training or optimizing the model with google colab GPU(https://github.com/diyism/YonaVox/blob/master/training/Hebrew_AC_voice_activation_(public_version).ipynb)
ref: https://github.com/k2-fsa/sherpa-onnx/issues/920