k2-fsa / sherpa-onnx

Speech-to-text, text-to-speech, speaker diarization, and VAD using next-gen Kaldi with onnxruntime without Internet connection. Support embedded systems, Android, iOS, Raspberry Pi, RISC-V, x86_64 servers, websocket server/client, C/C++, Python, Kotlin, C#, Go, NodeJS, Java, Swift, Dart, JavaScript, Flutter, Object Pascal, Lazarus, Rust
https://k2-fsa.github.io/sherpa/onnx/index.html
Apache License 2.0
3.7k stars 430 forks source link

How to train or optimize the sherpa-onnx-kws-zipformer-wenetspeech-3.3M-2024-01-01 model for my own voice? #1371

Open diyism opened 2 months ago

diyism commented 2 months ago

I've tested the latest kws-model(sherpa-onnx-kws-zipformer-wenetspeech-3.3M-2024-01-01.tar.bz2 from https://github.com/k2-fsa/sherpa-onnx/releases/tag/kws-models) against my own voice, but both of the 2 models(encoder-epoch-99-avg-1, encoder-epoch-12-avg-2) in it recognized my "bo2" as "guo2" wrongly:

$ cat ../keywords.txt
j iǎng @jiang3
y ǒu @you3
b ó @bo2
b èi @bei4
p āi @pai1
d ào @dao4
g uó @guo2

$ sherpa-onnx-keyword-spotter     --tokens=tokens.txt     --encoder=encoder-epoch-99-avg-1-chunk-16-left-64.onnx     --decoder=decoder-epoch-99-avg-1-chunk-16-left-64.onnx     --joiner=joiner-epoch-99-avg-1-chunk-16-left-64.onnx     --provider=cpu     --num-threads=8  --keywords-threshold=0.02  --max-active-paths=2 --keywords-file=../keywords.txt ./4_me.wav 2>&1 | grep start_time
{"start_time":0.00, "keyword": "jiang3", "timestamps": [1.36, 1.40], "tokens":["j", "iǎng"]}
{"start_time":0.00, "keyword": "you4", "timestamps": [1.68, 1.76], "tokens":["y", "òu"]}
{"start_time":0.00, "keyword": "guo2", "timestamps": [1.96, 2.04], "tokens":["g", "uó"]}
{"start_time":0.00, "keyword": "bei4", "timestamps": [2.36, 2.40], "tokens":["b", "èi"]}
{"start_time":0.00, "keyword": "pai1", "timestamps": [2.64, 2.68], "tokens":["p", "āi"]}
{"start_time":0.00, "keyword": "dao4", "timestamps": [2.88, 2.96], "tokens":["d", "ào"]}

$ sherpa-onnx-keyword-spotter     --tokens=tokens.txt     --encoder=encoder-epoch-12-avg-2-chunk-16-left-64.onnx     --decoder=decoder-epoch-12-avg-2-chunk-16-left-64.onnx     --joiner=joiner-epoch-12-avg-2-chunk-16-left-64.onnx     --provider=cpu     --num-threads=8  --keywords-threshold=0.03  --max-active-paths=2 --keywords-file=../keywords.txt ./4_me.wav 2>&1 | grep start_time
{"start_time":0.00, "keyword": "jiang3", "timestamps": [1.36, 1.40], "tokens":["j", "iǎng"]}
{"start_time":0.00, "keyword": "you4", "timestamps": [1.68, 1.76], "tokens":["y", "òu"]}
{"start_time":0.00, "keyword": "guo2", "timestamps": [1.96, 2.04], "tokens":["g", "uó"]}
{"start_time":0.00, "keyword": "bei4", "timestamps": [2.36, 2.40], "tokens":["b", "èi"]}
{"start_time":0.00, "keyword": "pai1", "timestamps": [2.64, 2.68], "tokens":["p", "āi"]}
{"start_time":0.00, "keyword": "dao4", "timestamps": [2.88, 2.96], "tokens":["d", "ào"]}

Is there any way to train the sherpa-onnx-kws model for my own voice? for example, as easy as the YonaVox project:

  1. record my every mono-syllable(pinyin) for 50 times on my phone chrome browser, and with 1 second mute between every syllable automatically (https://github.com/diyism/YonaVox/blob/master/training/recorder.ipynb): Screenshot 2024-09-23 at 15-59-01 recorder ipynb - Colab

  2. training or optimizing the model with google colab GPU(https://github.com/diyism/YonaVox/blob/master/training/Hebrew_AC_voice_activation_(public_version).ipynb) Screenshot 2024-09-23 at 16-05-22 Hebrew_AC_voice_activation_(public_version) ipynb - Colab

ref: https://github.com/k2-fsa/sherpa-onnx/issues/920

csukuangfj commented 2 months ago

Please see our icefall doc.

diyism commented 2 months ago

I found some jupyter notebooks in this: https://github.com/k2-fsa/colab/tree/master/sherpa-onnx , but they are not specific for optimizing keyword spotter model.

And this https://k2-fsa.github.io/icefall/recipes/Finetune/from_supervised/finetune_zipformer.html , but I can't figure out how to integrate the wav files of my own voice into it.

csukuangfj commented 2 months ago

Please see our icefall doc.

Please see this comment.

You need to spend some time reading our doc https://k2-fsa.github.io/icefall/

diyism commented 2 months ago

After reading https://k2-fsa.github.io/icefall/recipes/Non-streaming-ASR/yesno/tdnn.html#colab-notebook I can successfully run the modified my-icefall-yes-no-dataset-recipe.ipynb(https://github.com/diyism/colab_kaldi2/blob/main/my_icefall_yes_no_dataset_recipe.ipynb) in Colab: Screenshot 2024-09-25 at 19-09-00 my-icefall-yes-no-dataset-recipe ipynb - Colab

But I can't found a ipynb file for wenetspeech-kws recipe, so I try to modify the my-icefall-yes-no-dataset-recipe.ipynb for wenetspeech-kws(https://github.com/diyism/colab_kaldi2/blob/main/my_icefall_wenetspeech_kws_dataset_recipe.ipynb), but I found it's downloading 500GB dataset files, I think it won't work in colab: Screenshot 2024-09-25 at 22-31-17 my-icefall-wenetspeech-kws-dataset-recipe ipynb - Colab

I want to build a web UI to record my own voice of mandarin syllables to replace wenetspeech-kws dataset(without downloading the 500GB files) just like the YonaVox project(https://github.com/diyism/YonaVox/blob/master/training/recorder.ipynb), and to train the kws model only with these recorded voice, is it feasible?

I found another 2 ipynb files for creating recipes, but it seems that it's not specifically about creating voice dataset files for wenetspeech-kws: 00-basic-workflow.ipynb(https://colab.research.google.com/github/lhotse-speech/lhotse/blob/master/examples/00-basic-workflow.ipynb) espnet-and-lhotse-min-example.ipynb(https://colab.research.google.com/drive/1HKSYPsWx_HoCdrnLpaPdYj5zwlPsM3NH)

csukuangfj commented 2 months ago

but it seems that it's not specifically about creating voice dataset files for wenetspeech-kws

Are there any differences between the dataset you want to build with other dataset examples in icefall, e.g., the yesno dataset? The principle is the same.

diyism commented 2 months ago

but it seems that it's not specifically about creating voice dataset files for wenetspeech-kws

Are there any differences between the dataset you want to build with other dataset examples in icefall, e.g., the yesno dataset? The principle is the same.

I have some wav files of my own voice and corresponding transcription txt files. Then I wrote create_dataset.py (https://github.com/diyism/colab_kaldi2/blob/main/create_dataset.py) which can successfully generate a my_dataset.jsonl file.

Now I want to replace my_dataset.jsonl into egs/wenetspeech/KWS/prepare.sh, but I found that this prepare.sh is much more complex than the one for Yesno. It also calls egs/wenetspeech/ASR/prepare.sh, and ASR/prepare.sh is also very complex, containing 23 stages, and it requires more than just the my_dataset.jsonl file. I'm completely lost and don't know where to start.

Is it feasible to train a KWS model using only my voice files and my_dataset.jsonl?

diyism commented 2 months ago

I'm trying to use claude.ai to understand the raw files needed by icefall/egs/wenetspeech/ASR/prepare.sh. It seems only the Musan Dataset needs to be downloaded, other files are all generated from the voice wav files and transcription files. Screenshot 2024-09-28 at 08-32-31 Claude

Is it feasible to create a streamlined prepare.sh that uses only local voice wav files and their transcriptions, automatically downloads the Musan dataset, and generates all other dependent files to train a KWS model?

csukuangfj commented 2 months ago
  1. Please create 3 text files.

(a) wav.scp It should contain something like below

unique_id_1 /path/to/foo.wav
unique_id_2 /path/to/bar.wav
unique_id_3 /path/to/foobar.wav

(b) wav.scp

It should contain something like below

unique_id_1 transcript for /path/to/foo.wav
unqiue_id_2 transcript for /path/to/bar.wav
unique_id_3 transcript for /path/to/foobar.wav

(c) utt2spk

unique_id_1 unique_id_1
unique_id_2 unique_id_2
unique_id_3 unique_id_3
  1. Follow https://lhotse.readthedocs.io/en/latest/kaldi.html#example Note that you don't have feats.scp so you will only get two files

    recordings.jsonl.gz  supervisions.jsonl.gz

    after following the doc

  2. Please follow our yesno recipe or any other recipes in icefall to compute features.


Again, I suggest that you spend time, maybe several days, reading our existing examples. All you need can be found in our examples.

diyism commented 1 month ago

I've modified my_icefall_wenetspeech_asr_dataset_recipe.ipynb(https://github.com/diyism/colab_kaldi2/blob/main/my_icefall_wenetspeech_asr_dataset_recipe.ipynb), now it can generate recordings.jsonl.gz , supervisions.jsonl.gz, and then generate myvoice_train.jsonl.gz :

!ls -al /content/drive/MyDrive/ColabData/KWS/kws_create_dataset/train/
-rw------- 1 root root 139 Sep 28 15:42 text
-rw------- 1 root root  11 Sep 28 15:43 utt2spk
-rw------- 1 root root  39 Sep 28 15:54 wav.scp
#1. install lhotse and lhotse import and cut
#Normally, we would use pip install lhotse. However, the yesno recipe is added recently and has not been released to PyPI yet,
#so we install the latest unreleased version here.
!pip install -q git+https://github.com/lhotse-speech/lhotse

from google.colab import drive
!umount /content/drive
drive.mount('/content/drive', force_remount=True)

#!ls -al /content/drive/MyDrive/ColabData/KWS/kws_create_dataset/train
#!lhotse kaldi import --help
#!cd /content/drive/MyDrive/ColabData/KWS/kws_create_dataset && lhotse kaldi import ./train/ 16000 ./train_manifests/
!ls -al /content/drive/MyDrive/ColabData/KWS/kws_create_dataset/train_manifests

#-f ./train_manifests/features.jsonl.gz \
!cd /content/drive/MyDrive/ColabData/KWS/kws_create_dataset && lhotse cut simple \
  -r ./train_manifests/recordings.jsonl.gz \
  -s ./train_manifests/supervisions.jsonl.gz \
  ./myvoice_train.jsonl.gz

And I've modified icefall_egs_wenetspeech_ASR_prepare.sh(https://github.com/diyism/colab_kaldi2/blob/main/icefall_egs_wenetspeech_ASR_prepare.sh) and icefall_egs_wenetspeech_ASR_local_preprocess_wenetspeech.py(https://github.com/diyism/colab_kaldi2/blob/main/icefall_egs_wenetspeech_ASR_local_preprocess_wenetspeech.py):

#!/usr/bin/env bash

# fix segmentation fault reported in https://github.com/k2-fsa/icefall/issues/674
export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python

set -eou pipefail

nj=15
stage=0
stop_stage=100

# Path to your local myvoice_train.jsonl.gz file
local_data_path="/content/drive/MyDrive/ColabData/KWS/kws_create_dataset/myvoice_train.jsonl.gz"

dl_dir=$PWD/download
lang_char_dir=data/lang_char

. shared/parse_options.sh || exit 1

# All files generated by this script are saved in "data".
# You can safely remove "data" and rerun this script to regenerate it.
mkdir -p data

log() {
  local fname=${BASH_SOURCE[1]##*/}
  echo -e "$(date '+%Y-%m-%d %H:%M:%S') (${fname}:${BASH_LINENO[0]}:${FUNCNAME[1]}) $*"
}

log "dl_dir: $dl_dir"

if [ $stage -le 0 ] && [ $stop_stage -ge 0 ]; then
  log "Stage 0: Copy local data and download musan if needed"
  mkdir -p data/manifests
  cp $local_data_path data/manifests/cuts_train.jsonl.gz
  cp /content/drive/MyDrive/ColabData/KWS/kws_create_dataset/*.wav /content/icefall/egs/wenetspeech/ASR/

  if [ ! -d $dl_dir/musan ]; then
    log "Downloading musan dataset"
    lhotse download musan $dl_dir
  fi
fi

if [ $stage -le 1 ] && [ $stop_stage -ge 1 ]; then
  log "Stage 1: Prepare musan manifest"
  mkdir -p data/manifests
  lhotse prepare musan $dl_dir/musan data/manifests
fi

if [ $stage -le 2 ] && [ $stop_stage -ge 2 ]; then
  log "Stage 2: Preprocess local manifest"
  if [ ! -f data/fbank/.preprocess_complete ]; then
    python3 ./local/preprocess_wenetspeech.py --perturb-speed True
    touch data/fbank/.preprocess_complete
  fi
fi

if [ $stage -le 3 ] && [ $stop_stage -ge 3 ]; then
  log "Stage 3: Combine features"
  if [ ! -f data/fbank/cuts_train.jsonl.gz ]; then
    cp data/manifests/cuts_train.jsonl.gz data/fbank/cuts_train.jsonl.gz
  fi
fi

# Add any additional stages you need for your specific use case

log "Data preparation completed."

It seems it works:

2024-10-04 18:56:19 (prepare.sh:29:main) dl_dir: /content/icefall/egs/wenetspeech/ASR/download
2024-10-04 18:56:19 (prepare.sh:32:main) Stage 0: Copy local data and download musan if needed
2024-10-04 18:56:19 (prepare.sh:44:main) Stage 1: Prepare musan manifest
2024-10-04 18:56:21,963 WARNING [qa.py:120] There are 15 recordings that do not have any corresponding supervisions in the SupervisionSet.
2024-10-04 18:56:23 (prepare.sh:50:main) Stage 2: Preprocess local manifest
2024-10-04 18:56:25,754 INFO [preprocess_wenetspeech.py:29] Loading manifest
2024-10-04 18:56:25,754 INFO [preprocess_wenetspeech.py:38] Compute fbank features
2024-10-04 18:56:25,965 INFO [preprocess_wenetspeech.py:46] Applying speed perturbation
Extracting and storing features (chunks progress): 100% 2/2 [00:03<00:00,  1.86s/it]
2024-10-04 18:56:29,711 INFO [preprocess_wenetspeech.py:56] Saving cuts with features
2024-10-04 18:56:30,355 INFO [preprocess_wenetspeech.py:62] Done
2024-10-04 18:56:31 (prepare.sh:58:main) Stage 3: Combine features
2024-10-04 18:56:31 (prepare.sh:66:main) Data preparation completed.

But if I run the 5th step:

#5.training

! export PYTHONPATH=/content/icefall:$PYTHONPATH && \
  cd /content/icefall/egs/wenetspeech/ASR && \
  ./zipformer/train.py

It shows errors:

2024-10-04 18:40:24,047 INFO [train.py:1064] Training started
2024-10-04 18:40:24,049 INFO [train.py:1074] Device: cuda:0
Traceback (most recent call last):
  File "/content/icefall/egs/wenetspeech/ASR/./zipformer/train.py", line 1350, in <module>
    main()
  File "/content/icefall/egs/wenetspeech/ASR/./zipformer/train.py", line 1343, in main
    run(rank=0, world_size=1, args=args)
  File "/content/icefall/egs/wenetspeech/ASR/./zipformer/train.py", line 1076, in run
    lexicon = Lexicon(params.lang_dir)
  File "/content/icefall/icefall/lexicon.py", line 164, in __init__
    self.token_table = k2.SymbolTable.from_file(lang_dir / "tokens.txt")
  File "/usr/local/lib/python3.10/dist-packages/k2/symbol_table.py", line 130, in from_file
    with open(filename, 'r', encoding='utf-8') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'data/lang_char/tokens.txt'

I guess I lost something.