Acoustic model training - how to fine tune/train from scratch with custom wav file

rockiram commented 5 years ago

I prepared the dataset similar to zamia-en dataset with 3000 wav files and prompts. 1.How to fine tune existing acoustic model 2.How to train the acoustic model from scratch

gooofy commented 5 years ago

I have no experience with acoustic model adaptation, so I cannot provide any instructions for this task.

as for training a new model from scratch you should be able to find hints in the README. first step would be to import your dataset (speech_audio_scan.py). once that is than you can either manually or automatically review your corpus and then check for missing lexicon entries. at that point you could also generate noise-augmented corpora from your dataset, should you choose to. with all that in place, you could then export all datasets you want to train your model on to create a kaldi case:

https://github.com/gooofy/zamia-speech#english-nnet3-chain-models

rockiram commented 5 years ago

@gooofy Thanks for the suggestion.

I've prepared the zamia-en data as per the instructions in README.

I executed

./speech_kaldi_export.py generic-en-small dict-en.ipa generic_en_lang_model_small voxforge_en librispeech zamia_en cd data/dst/asr-models/kaldi/generic-en-small ./run-chain.sh

I'm currently facing 2 issues:

1. utils/validate_data_dir.sh: empty file spk2utt

make mfcc

fix_data_dir.sh: kept all 144 utterances. fix_data_dir.sh: old files are kept in data/train/.backup steps/make_mfcc.sh --cmd utils/run.pl --nj 12 data/train exp/make_mfcc_chain/train mfcc_chain utils/validate_data_dir.sh: WARNING: you have only one speaker. This probably a bad idea. Search for the word 'bold' in http://kaldi-asr.org/doc/data_prep.html for more information. utils/validate_data_dir.sh: Successfully validated data-directory data/train steps/make_mfcc.sh: [info]: no segments file exists: assuming wav.scp indexed by utterance. steps/make_mfcc.sh: Succeeded creating MFCC features for train fix_data_dir.sh: kept all 144 utterances. fix_data_dir.sh: old files are kept in data/train/.backup steps/compute_cmvn_stats.sh data/train exp/make_mfcc_chain/train mfcc_chain Succeeded creating CMVN stats for train fix_data_dir.sh: kept all 144 utterances. fix_data_dir.sh: old files are kept in data/train/.backup fix_data_dir.sh: no utterances remained: not proceeding further. steps/make_mfcc.sh --cmd utils/run.pl --nj 12 data/test exp/make_mfcc_chain/test mfcc_chain utils/validate_data_dir.sh: empty file spk2utt

As a hack, I copied the files "text", "wav.scp", "utt2spk" files from train dir to test dir. This temporarily suppressed this error. Then the 2nd error (mentioned below) popped up.

2. utils/split_scp.pl: Refusing to split data because number of speakers 1 is less than the number of output .scp files 12

make mfcc

fix_data_dir.sh: kept all 144 utterances. fix_data_dir.sh: old files are kept in data/train/.backup steps/make_mfcc.sh --cmd utils/run.pl --nj 12 data/train exp/make_mfcc_chain/train mfcc_chain utils/validate_data_dir.sh: WARNING: you have only one speaker. This probably a bad idea. Search for the word 'bold' in http://kaldi-asr.org/doc/data_prep.html for more information. utils/validate_data_dir.sh: Successfully validated data-directory data/train steps/make_mfcc.sh: [info]: no segments file exists: assuming wav.scp indexed by utterance. steps/make_mfcc.sh: Succeeded creating MFCC features for train fix_data_dir.sh: kept all 144 utterances. fix_data_dir.sh: old files are kept in data/train/.backup steps/compute_cmvn_stats.sh data/train exp/make_mfcc_chain/train mfcc_chain Succeeded creating CMVN stats for train fix_data_dir.sh: kept all 144 utterances. fix_data_dir.sh: old files are kept in data/train/.backup fix_data_dir.sh: kept all 144 utterances. fix_data_dir.sh: old files are kept in data/test/.backup steps/make_mfcc.sh --cmd utils/run.pl --nj 12 data/test exp/make_mfcc_chain/test mfcc_chain utils/validate_data_dir.sh: WARNING: you have only one speaker. This probably a bad idea. Search for the word 'bold' in http://kaldi-asr.org/doc/data_prep.html for more information. utils/validate_data_dir.sh: Successfully validated data-directory data/test steps/make_mfcc.sh: [info]: no segments file exists: assuming wav.scp indexed by utterance. steps/make_mfcc.sh: Succeeded creating MFCC features for test fix_data_dir.sh: kept all 144 utterances. fix_data_dir.sh: old files are kept in data/test/.backup steps/compute_cmvn_stats.sh data/test exp/make_mfcc_chain/test mfcc_chain Succeeded creating CMVN stats for test fix_data_dir.sh: kept all 144 utterances. fix_data_dir.sh: old files are kept in data/test/.backup

mono0a_chain

steps/train_mono.sh --nj 12 --cmd utils/run.pl data/train data/lang exp/mono0a_chain utils/split_scp.pl: Refusing to split data because number of speakers 1 is less than the number of output .scp files 12

Could you please help me in understanding & solving this error?

Thank you!

rockiram commented 5 years ago

hi @gooofy This issue is fixed

utils/validate_data_dir.sh: empty file spk2utt I fixed it by adding the some of the wav file names in spk_test.txt file.

Help me to understand the second issue mentioned earlier.

Thank you!

gooofy commented 5 years ago

I think this is the issue with your dataset:

utils/validate_data_dir.sh: WARNING: you have only one speaker. This probably a bad idea.

if you're really trying to build a single speaker model I am not sure how to set up kaldi for that - maybe the kaldi user mailing list can help you here.

rockiram commented 5 years ago

Thanks @gooofy

I trained the sequitur model with dict-en.ipa file.After training i inferenced the same word in that dict-en.ipa file. The output is entirely different.

Actual output in ipa file - abandonment ʌb'ændʌnmʌnt output predicted by sequitur model6 - abandonment V b ' { n d V n m V n t online ipa converter output - abandonment əˈbændənmənt

Online ipa converter link - https://easypronunciation.com/en/english-phonetic-transcription-converter I tried this because i need to add new words to the ipa file.Kindly help to understand this.

Thank you!

gooofy commented 5 years ago

the output of the sequitur model uses the X-SAMPA encoding not IPA

https://en.wikipedia.org/wiki/X-SAMPA

rockiram commented 5 years ago

@gooofy Thanks a lot for your suggestions.

As mentioned earlier, I'm trying to setup the acoustic model training pipeline using Zamia. With your help, I've progressed till stage 11 in run-chain.sh.

I'm facing an issue (mentioned below) in stage 11 while training.

It'll be helpful to debug if you can provide few pointers regarding this.

steps/nnet3/chain/get_egs.sh: feature type is raw tree-info exp/nnet3_chain/tdnn_250/tree feat-to-dim scp:exp/nnet3_chain/ivectors_train_sp_hires_comb/ivector_online.scp - steps/nnet3/chain/get_egs.sh: working out number of frames of training data steps/nnet3/chain/get_egs.sh: working out feature dim Command failed (getting feature dim): feat-to-dim "ark,s,cs:utils/filter_scp.pl --exclude exp/nnet3_chain/tdnn_250/egs/valid_uttlist data/train_sp_hires_comb/split12/1/feats.scp | /opt/kaldi/src/featbin/apply-cmvn --norm-means=false --norm-vars=false --utt2spk=ark:data/train_sp_hires_comb/split12/1/utt2spk scp:data/train_sp_hires_comb/split12/1/cmvn.scp scp:- ark:- |" Traceback (most recent call last): File "steps/nnet3/chain/train.py", line 634, in main train(args, run_opts) File "steps/nnet3/chain/train.py", line 395, in train stage=args.egs_stage) File "steps/libs/nnet3/train/chain_objf/acoustic_model.py", line 118, in generate_chain_egs egs_opts=egs_opts if egs_opts is not None else '')) File "steps/libs/common.py", line 158, in execute_command p.returncode, command)) Exception: Command exited with status 1: steps/nnet3/chain/get_egs.sh --frames-overlap-per-eg 0 --cmd "utils/run.pl" --cmvn-opts "--norm-means=false --norm-vars=false" --online-ivector-dir "exp/nnet3_chain/ivectors_train_sp_hires_comb" --left-context 16 --right-context 11 --left-context-initial -1 --right-context-final -1 --left-tolerance '5' --right-tolerance '5' --frame-subsampling-factor 3 --alignment-subsampling-factor 3 --stage 0 --frames-per-iter 1500000 --frames-per-eg 150 --srand 0 data/train_sp_hires_comb exp/nnet3_chain/tdnn_250 exp/nnet3_chain/tri2b_chain_train_sp_comb_lats exp/nnet3_chain/tdnn_250/egs

gooofy commented 5 years ago

I am a bit suspicious there might have been issues in earlier steps. Did the training of the tri2b_chain model work? did the ivector extraction work?

other than that, you could either try to run the failing steps/nnet3/chain/get_egs.sh manually to get to the bottom of this or contact the kaldi mailing list, maybe someone there knows that would likely cause this command to fail.

rockiram commented 5 years ago

This issue is solved. It was the path problem for apply-cmvn. I updated that and as well as i updated in path.sh now it is trained properly.

Thanks for your help @gooofy

Closing this issue.

gooofy / zamia-speech

Acoustic model training - how to fine tune/train from scratch with custom wav file #63