Mal-formed spk2gender - Githubissues

JeromeNi commented 5 years ago

Hi, I'm a beginner in Kaldi, and I ran into the above issue when executing make_mfcc.sh for the train_s folder.

I checked the file using head and tail, but it looked fine to me with sorted utt-id on the left and f/m on the right:

head spk2gender N00002 m N00003 m N00005 m N00006 f N00008 m N00011 m N00013 m N00014 m N00015 m N00019 m

tail spk2gender N09481 f N09482 m N09483 m N09484 m N09485 f N09486 f N09487 f N09488 m N09831 m UNKNOWN m

Below are the console output for everything above that step:

/home/jerome/kaldi/egs/kaldi_egs_CGN/s5/utils/prepare_lang.sh /home/jerome/kaldi/egs/kaldi_egs_CGN/s5/data/local/dict_nosp data/local/lang_tmp_nosp data/lang_nosp Checking /home/jerome/kaldi/egs/kaldi_egs_CGN/s5/data/local/dict_nosp/silence_phones.txt ... --> reading /home/jerome/kaldi/egs/kaldi_egs_CGN/s5/data/local/dict_nosp/silence_phones.txt --> text seems to be UTF-8 or ASCII, checking whitespaces --> text contains only allowed whitespaces --> /home/jerome/kaldi/egs/kaldi_egs_CGN/s5/data/local/dict_nosp/silence_phones.txt is OK

Checking /home/jerome/kaldi/egs/kaldi_egs_CGN/s5/data/local/dict_nosp/optional_silence.txt ... --> reading /home/jerome/kaldi/egs/kaldi_egs_CGN/s5/data/local/dict_nosp/optional_silence.txt --> text seems to be UTF-8 or ASCII, checking whitespaces --> text contains only allowed whitespaces --> /home/jerome/kaldi/egs/kaldi_egs_CGN/s5/data/local/dict_nosp/optional_silence.txt is OK

Checking /home/jerome/kaldi/egs/kaldi_egs_CGN/s5/data/local/dict_nosp/nonsilence_phones.txt ... --> reading /home/jerome/kaldi/egs/kaldi_egs_CGN/s5/data/local/dict_nosp/nonsilence_phones.txt --> text seems to be UTF-8 or ASCII, checking whitespaces --> text contains only allowed whitespaces --> /home/jerome/kaldi/egs/kaldi_egs_CGN/s5/data/local/dict_nosp/nonsilence_phones.txt is OK

Checking disjoint: silence_phones.txt, nonsilence_phones.txt --> disjoint property is OK.

Checking /home/jerome/kaldi/egs/kaldi_egs_CGN/s5/data/local/dict_nosp/lexicon.txt --> reading /home/jerome/kaldi/egs/kaldi_egs_CGN/s5/data/local/dict_nosp/lexicon.txt --> text seems to be UTF-8 or ASCII, checking whitespaces --> text contains only allowed whitespaces --> /home/jerome/kaldi/egs/kaldi_egs_CGN/s5/data/local/dict_nosp/lexicon.txt is OK

Checking /home/jerome/kaldi/egs/kaldi_egs_CGN/s5/data/local/dict_nosp/extra_questions.txt ... --> /home/jerome/kaldi/egs/kaldi_egs_CGN/s5/data/local/dict_nosp/extra_questions.txt is empty (this is OK) --> SUCCESS [validating dictionary directory /home/jerome/kaldi/egs/kaldi_egs_CGN/s5/data/local/dict_nosp]

**Creating /home/jerome/kaldi/egs/kaldi_egs_CGN/s5/data/local/dict_nosp/lexiconp.txt from /home/jerome/kaldi/egs/kaldi_egs_CGN/s5/data/local/dict_nosp/lexicon.txt fstaddselfloops data/lang_nosp/phones/wdisambig_phones.int data/lang_nosp/phones/wdisambig_words.int prepare_lang.sh: validating output directory utils/validate_lang.pl data/lang_nosp Checking data/lang_nosp/phones.txt ... --> text seems to be UTF-8 or ASCII, checking whitespaces --> text contains only allowed whitespaces --> data/lang_nosp/phones.txt is OK

Checking words.txt: #0 ... --> text seems to be UTF-8 or ASCII, checking whitespaces --> text contains only allowed whitespaces --> data/lang_nosp/words.txt is OK

Checking disjoint: silence.txt, nonsilence.txt, disambig.txt ... --> silence.txt and nonsilence.txt are disjoint --> silence.txt and disambig.txt are disjoint --> disambig.txt and nonsilence.txt are disjoint --> disjoint property is OK

Checking sumation: silence.txt, nonsilence.txt, disambig.txt ... --> found no unexplainable phones in phones.txt

Checking data/lang_nosp/phones/context_indep.{txt, int, csl} ... --> text seems to be UTF-8 or ASCII, checking whitespaces --> text contains only allowed whitespaces --> 5 entry/entries in data/lang_nosp/phones/context_indep.txt --> data/lang_nosp/phones/context_indep.int corresponds to data/lang_nosp/phones/context_indep.txt --> data/lang_nosp/phones/context_indep.csl corresponds to data/lang_nosp/phones/context_indep.txt --> data/lang_nosp/phones/context_indep.{txt, int, csl} are OK

Checking data/lang_nosp/phones/nonsilence.{txt, int, csl} ... --> text seems to be UTF-8 or ASCII, checking whitespaces --> text contains only allowed whitespaces --> 320 entry/entries in data/lang_nosp/phones/nonsilence.txt --> data/lang_nosp/phones/nonsilence.int corresponds to data/lang_nosp/phones/nonsilence.txt --> data/lang_nosp/phones/nonsilence.csl corresponds to data/lang_nosp/phones/nonsilence.txt --> data/lang_nosp/phones/nonsilence.{txt, int, csl} are OK

Checking data/lang_nosp/phones/silence.{txt, int, csl} ... --> text seems to be UTF-8 or ASCII, checking whitespaces --> text contains only allowed whitespaces --> 5 entry/entries in data/lang_nosp/phones/silence.txt --> data/lang_nosp/phones/silence.int corresponds to data/lang_nosp/phones/silence.txt --> data/lang_nosp/phones/silence.csl corresponds to data/lang_nosp/phones/silence.txt --> data/lang_nosp/phones/silence.{txt, int, csl} are OK

Checking data/lang_nosp/phones/optional_silence.{txt, int, csl} ... --> text seems to be UTF-8 or ASCII, checking whitespaces --> text contains only allowed whitespaces --> 1 entry/entries in data/lang_nosp/phones/optional_silence.txt --> data/lang_nosp/phones/optional_silence.int corresponds to data/lang_nosp/phones/optional_silence.txt --> data/lang_nosp/phones/optional_silence.csl corresponds to data/lang_nosp/phones/optional_silence.txt --> data/lang_nosp/phones/optional_silence.{txt, int, csl} are OK

Checking data/lang_nosp/phones/disambig.{txt, int, csl} ... --> text seems to be UTF-8 or ASCII, checking whitespaces --> text contains only allowed whitespaces --> 11 entry/entries in data/lang_nosp/phones/disambig.txt --> data/lang_nosp/phones/disambig.int corresponds to data/lang_nosp/phones/disambig.txt --> data/lang_nosp/phones/disambig.csl corresponds to data/lang_nosp/phones/disambig.txt --> data/lang_nosp/phones/disambig.{txt, int, csl} are OK

Checking data/lang_nosp/phones/roots.{txt, int} ... --> text seems to be UTF-8 or ASCII, checking whitespaces --> text contains only allowed whitespaces --> 81 entry/entries in data/lang_nosp/phones/roots.txt --> data/lang_nosp/phones/roots.int corresponds to data/lang_nosp/phones/roots.txt --> data/lang_nosp/phones/roots.{txt, int} are OK

Checking data/lang_nosp/phones/sets.{txt, int} ... --> text seems to be UTF-8 or ASCII, checking whitespaces --> text contains only allowed whitespaces --> 81 entry/entries in data/lang_nosp/phones/sets.txt --> data/lang_nosp/phones/sets.int corresponds to data/lang_nosp/phones/sets.txt --> data/lang_nosp/phones/sets.{txt, int} are OK

Checking data/lang_nosp/phones/extra_questions.{txt, int} ... --> text seems to be UTF-8 or ASCII, checking whitespaces --> text contains only allowed whitespaces --> 9 entry/entries in data/lang_nosp/phones/extra_questions.txt --> data/lang_nosp/phones/extra_questions.int corresponds to data/lang_nosp/phones/extra_questions.txt --> data/lang_nosp/phones/extra_questions.{txt, int} are OK

Checking data/lang_nosp/phones/word_boundary.{txt, int} ... --> text seems to be UTF-8 or ASCII, checking whitespaces --> text contains only allowed whitespaces --> 325 entry/entries in data/lang_nosp/phones/word_boundary.txt --> data/lang_nosp/phones/word_boundary.int corresponds to data/lang_nosp/phones/word_boundary.txt --> data/lang_nosp/phones/word_boundary.{txt, int} are OK

Checking optional_silence.txt ... --> reading data/lang_nosp/phones/optional_silence.txt --> data/lang_nosp/phones/optional_silence.txt is OK

Checking disambiguation symbols: #0 and #1 --> data/lang_nosp/phones/disambig.txt has "#0" and "#1" --> data/lang_nosp/phones/disambig.txt is OK

Checking topo ...

Checking word_boundary.txt: silence.txt, nonsilence.txt, disambig.txt ... --> data/lang_nosp/phones/word_boundary.txt doesn't include disambiguation symbols --> data/lang_nosp/phones/word_boundary.txt is the union of nonsilence.txt and silence.txt --> data/lang_nosp/phones/word_boundary.txt is OK

Checking word-level disambiguation symbols... --> data/lang_nosp/phones/wdisambig.txt exists (newer prepare_lang.sh) Checking word_boundary.int and disambig.int --> generating a 28 word sequence --> resulting phone sequence from L.fst corresponds to the word sequence --> L.fst is OK --> generating a 47 word sequence --> resulting phone sequence from L_disambig.fst corresponds to the word sequence --> L_disambig.fst is OK

Checking data/lang_nosp/oov.{txt, int} ... --> text seems to be UTF-8 or ASCII, checking whitespaces --> text contains only allowed whitespaces --> 1 entry/entries in data/lang_nosp/oov.txt --> data/lang_nosp/oov.int corresponds to data/lang_nosp/oov.txt --> data/lang_nosp/oov.{txt, int} are OK

--> data/lang_nosp/L.fst is olabel sorted --> data/lang_nosp/L_disambig.fst is olabel sorted --> SUCCESS [validating lang directory data/lang_nosp] /home/jerome/kaldi/egs/kaldi_egs_CGN/s5/utils/fix_data_dir.sh: file data/train_t/utt2spk is not in sorted order or not unique, sorting it /home/jerome/kaldi/egs/kaldi_egs_CGN/s5/utils/fix_data_dir.sh: file data/train_t/spk2utt is not in sorted order or not unique, sorting it /home/jerome/kaldi/egs/kaldi_egs_CGN/s5/utils/fix_data_dir.sh: file data/train_t/text is not in sorted order or not unique, sorting it /home/jerome/kaldi/egs/kaldi_egs_CGN/s5/utils/fix_data_dir.sh: file data/train_t/segments is not in sorted order or not unique, sorting it /home/jerome/kaldi/egs/kaldi_egs_CGN/s5/utils/fix_data_dir.sh: file data/train_t/wav.scp is not in sorted order or not unique, sorting it fix_data_dir.sh: kept all 196879 utterances. fix_data_dir.sh: old files are kept in data/train_t/.backup /home/jerome/kaldi/egs/kaldi_egs_CGN/s5/utils/fix_data_dir.sh: file data/train_s/utt2spk is not in sorted order or not unique, sorting it /home/jerome/kaldi/egs/kaldi_egs_CGN/s5/utils/fix_data_dir.sh: file data/train_s/spk2utt is not in sorted order or not unique, sorting it /home/jerome/kaldi/egs/kaldi_egs_CGN/s5/utils/fix_data_dir.sh: file data/train_s/text is not in sorted order or not unique, sorting it /home/jerome/kaldi/egs/kaldi_egs_CGN/s5/utils/fix_data_dir.sh: file data/train_s/segments is not in sorted order or not unique, sorting it /home/jerome/kaldi/egs/kaldi_egs_CGN/s5/utils/fix_data_dir.sh: file data/train_s/wav.scp is not in sorted order or not unique, sorting it fix_data_dir.sh: kept all 506074 utterances. fix_data_dir.sh: old files are kept in data/train_s/.backup /home/jerome/kaldi/egs/kaldi_egs_CGN/s5/utils/fix_data_dir.sh: file data/dev_t/utt2spk is not in sorted order or not unique, sorting it /home/jerome/kaldi/egs/kaldi_egs_CGN/s5/utils/fix_data_dir.sh: file data/dev_t/spk2utt is not in sorted order or not unique, sorting it /home/jerome/kaldi/egs/kaldi_egs_CGN/s5/utils/fix_data_dir.sh: file data/dev_t/text is not in sorted order or not unique, sorting it /home/jerome/kaldi/egs/kaldi_egs_CGN/s5/utils/fix_data_dir.sh: file data/dev_t/segments is not in sorted order or not unique, sorting it fix_data_dir.sh: kept all 3996 utterances. fix_data_dir.sh: old files are kept in data/dev_t/.backup /home/jerome/kaldi/egs/kaldi_egs_CGN/s5/utils/fix_data_dir.sh: file data/dev_s/utt2spk is not in sorted order or not unique, sorting it /home/jerome/kaldi/egs/kaldi_egs_CGN/s5/utils/fix_data_dir.sh: file data/dev_s/spk2utt is not in sorted order or not unique, sorting it /home/jerome/kaldi/egs/kaldi_egs_CGN/s5/utils/fix_data_dir.sh: file data/dev_s/text is not in sorted order or not unique, sorting it /home/jerome/kaldi/egs/kaldi_egs_CGN/s5/utils/fix_data_dir.sh: file data/dev_s/segments is not in sorted order or not unique, sorting it fix_data_dir.sh: kept all 409 utterances. fix_data_dir.sh: old files are kept in data/dev_s/.backup Data preparation succeeded local/cgn_train_lms.sh --dict-suffix _nosp Not installing the kaldi_lm toolkit since it is already there. Getting training data with OOV words replaced with (train_nounk.gz) Getting raw N-gram counts discount_ngrams: for n-gram order 1, D=0.000000, tau=0.000000 phi=1.000000 discount_ngrams: for n-gram order 2, D=0.000000, tau=0.000000 phi=1.000000 discount_ngrams: for n-gram order 3, D=1.000000, tau=0.000000 phi=1.000000 Iteration 1/6 of optimizing discounting parameters discount_ngrams: for n-gram order 1, D=0.600000, tau=0.675000 phi=2.000000 discount_ngrams: for n-gram order 2, D=0.800000, tau=0.675000 phi=2.000000 discount_ngrams: for n-gram order 3, D=0.000000, tau=0.825000 phi=2.000000 discount_ngrams: for n-gram order 1, D=0.600000, tau=0.900000 phi=2.000000 discount_ngrams: for n-gram order 2, D=0.800000, tau=0.900000 phi=2.000000 discount_ngrams: for n-gram order 3, D=0.000000, tau=1.100000 phi=2.000000 discount_ngrams: for n-gram order 1, D=0.600000, tau=1.215000 phi=2.000000 discount_ngrams: for n-gram order 2, D=0.800000, tau=1.215000 phi=2.000000 discount_ngrams: for n-gram order 3, D=0.000000, tau=1.485000 phi=2.000000 interpolate_ngrams: 133557 words in wordslist interpolate_ngrams: 133557 words in wordslist interpolate_ngrams: 133557 words in wordslist Perplexity over 115036.000000 words is 159.128529 Perplexity over 112774.000000 words (excluding 2262.000000 OOVs) is 165.122108

real 0m7.187s user 0m9.882s sys 0m0.253s Perplexity over 115036.000000 words is 159.878877 Perplexity over 112774.000000 words (excluding 2262.000000 OOVs) is 165.807496

real 0m7.273s user 0m9.991s sys 0m0.241s Perplexity over 115036.000000 words is 159.522903 Perplexity over 112774.000000 words (excluding 2262.000000 OOVs) is 165.480886

real 0m7.308s user 0m9.962s sys 0m0.270s optimize_alpha.pl: alpha=1.31273108080119 is too positive, limiting it to 0.7 Projected perplexity change from setting alpha=0.7 is 159.522903->158.8554762, reduction of 0.667426800000158 Alpha value on iter 1 is 0.7 Iteration 2/6 of optimizing discounting parameters discount_ngrams: for n-gram order 1, D=0.600000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 2, D=0.800000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 3, D=0.000000, tau=1.402500 phi=2.000000 discount_ngrams: for n-gram order 1, D=0.600000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 2, D=0.800000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 3, D=0.000000, tau=1.870000 phi=2.000000 discount_ngrams: for n-gram order 1, D=0.600000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 2, D=0.800000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 3, D=0.000000, tau=2.524500 phi=2.000000 interpolate_ngrams: 133557 words in wordslist interpolate_ngrams: 133557 words in wordslist interpolate_ngrams: 133557 words in wordslist Perplexity over 115036.000000 words is 158.756764 Perplexity over 112774.000000 words (excluding 2262.000000 OOVs) is 164.782568

real 0m7.246s user 0m9.955s sys 0m0.248s Perplexity over 115036.000000 words is 159.077362 Perplexity over 112774.000000 words (excluding 2262.000000 OOVs) is 165.113473

real 0m7.264s user 0m10.041s sys 0m0.160s Perplexity over 115036.000000 words is 158.906842 Perplexity over 112774.000000 words (excluding 2262.000000 OOVs) is 164.936610

real 0m7.284s user 0m9.839s sys 0m0.227s Projected perplexity change from setting alpha=0.682878172589042 is 158.906842->158.709987245877, reduction of 0.196854754122597 Alpha value on iter 2 is 0.682878172589042 Iteration 3/6 of optimizing discounting parameters discount_ngrams: for n-gram order 1, D=0.600000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 2, D=0.800000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 3, D=0.000000, tau=3.146982 phi=1.750000 discount_ngrams: for n-gram order 1, D=0.600000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 2, D=0.800000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 3, D=0.000000, tau=3.146982 phi=2.000000 discount_ngrams: for n-gram order 1, D=0.600000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 2, D=0.800000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 3, D=0.000000, tau=3.146982 phi=2.350000 interpolate_ngrams: 133557 words in wordslist interpolate_ngrams: 133557 words in wordslist interpolate_ngrams: 133557 words in wordslist Perplexity over 115036.000000 words is 158.714994 Perplexity over 112774.000000 words (excluding 2262.000000 OOVs) is 164.738626

real 0m7.112s user 0m9.921s sys 0m0.187s Perplexity over 115036.000000 words is 158.674197 Perplexity over 112774.000000 words (excluding 2262.000000 OOVs) is 164.699548

real 0m7.241s user 0m9.854s sys 0m0.193s Perplexity over 115036.000000 words is 158.671736 Perplexity over 112774.000000 words (excluding 2262.000000 OOVs) is 164.695308

real 0m7.329s user 0m10.099s sys 0m0.174s Projected perplexity change from setting alpha=-0.102868420714761 is 158.671736->158.66938261301, reduction of 0.00235338699039289 Alpha value on iter 3 is -0.102868420714761 Iteration 4/6 of optimizing discounting parameters discount_ngrams: for n-gram order 1, D=0.600000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 2, D=0.600000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 3, D=0.000000, tau=3.146982 phi=1.897132 discount_ngrams: for n-gram order 1, D=0.600000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 2, D=0.800000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 3, D=0.000000, tau=3.146982 phi=1.897132 discount_ngrams: for n-gram order 1, D=0.600000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 2, D=1.080000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 3, D=0.000000, tau=3.146982 phi=1.897132 interpolate_ngrams: 133557 words in wordslist interpolate_ngrams: 133557 words in wordslist interpolate_ngrams: 133557 words in wordslist Perplexity over 115036.000000 words is 162.538130 Perplexity over 112774.000000 words (excluding 2262.000000 OOVs) is 168.825655

real 0m5.095s user 0m6.722s sys 0m0.156s Perplexity over 115036.000000 words is 158.669074 Perplexity over 112774.000000 words (excluding 2262.000000 OOVs) is 164.693197

real 0m7.167s user 0m9.843s sys 0m0.200s Perplexity over 115036.000000 words is 158.653242 Perplexity over 112774.000000 words (excluding 2262.000000 OOVs) is 164.641991

real 0m7.294s user 0m9.895s sys 0m0.198s Projected perplexity change from setting alpha=-0.126728523021393 is 158.669074->158.374876244238, reduction of 0.294197755762468 Alpha value on iter 4 is -0.126728523021393 Iteration 5/6 of optimizing discounting parameters discount_ngrams: for n-gram order 1, D=0.600000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 2, D=0.698617, tau=1.147500 phi=2.000000 discount_ngrams: for n-gram order 3, D=0.000000, tau=3.146982 phi=1.897132 discount_ngrams: for n-gram order 1, D=0.600000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 2, D=0.698617, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 3, D=0.000000, tau=3.146982 phi=1.897132 discount_ngrams: for n-gram order 1, D=0.600000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 2, D=0.698617, tau=2.065500 phi=2.000000 discount_ngrams: for n-gram order 3, D=0.000000, tau=3.146982 phi=1.897132 interpolate_ngrams: 133557 words in wordslist interpolate_ngrams: 133557 words in wordslist interpolate_ngrams: 133557 words in wordslist Perplexity over 115036.000000 words is 158.421698 Perplexity over 112774.000000 words (excluding 2262.000000 OOVs) is 164.422926

real 0m7.218s user 0m9.978s sys 0m0.233s Perplexity over 115036.000000 words is 158.593769 Perplexity over 112774.000000 words (excluding 2262.000000 OOVs) is 164.588728

real 0m7.231s user 0m9.860s sys 0m0.158s Perplexity over 115036.000000 words is 158.506542 Perplexity over 112774.000000 words (excluding 2262.000000 OOVs) is 164.505799

real 0m7.358s user 0m10.119s sys 0m0.216s optimize_alpha.pl: alpha=0.857871078344494 is too positive, limiting it to 0.7 Projected perplexity change from setting alpha=0.7 is 158.506542->158.3803401, reduction of 0.126201900000041 Alpha value on iter 5 is 0.7 Iteration 6/6 of optimizing discounting parameters discount_ngrams: for n-gram order 1, D=0.600000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 2, D=0.698617, tau=2.601000 phi=1.750000 discount_ngrams: for n-gram order 3, D=0.000000, tau=3.146982 phi=1.897132 discount_ngrams: for n-gram order 1, D=0.600000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 2, D=0.698617, tau=2.601000 phi=2.000000 discount_ngrams: for n-gram order 3, D=0.000000, tau=3.146982 phi=1.897132 discount_ngrams: for n-gram order 1, D=0.600000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 2, D=0.698617, tau=2.601000 phi=2.350000 discount_ngrams: for n-gram order 3, D=0.000000, tau=3.146982 phi=1.897132 interpolate_ngrams: 133557 words in wordslist interpolate_ngrams: 133557 words in wordslist interpolate_ngrams: 133557 words in wordslist Perplexity over 115036.000000 words is 158.354767 Perplexity over 112774.000000 words (excluding 2262.000000 OOVs) is 164.349239

real 0m7.044s user 0m9.608s sys 0m0.282s Perplexity over 115036.000000 words is 158.420657 Perplexity over 112774.000000 words (excluding 2262.000000 OOVs) is 164.435436

real 0m7.235s user 0m9.920s sys 0m0.180s Perplexity over 115036.000000 words is 158.392099 Perplexity over 112774.000000 words (excluding 2262.000000 OOVs) is 164.397430

real 0m7.393s user 0m10.088s sys 0m0.215s optimize_alpha.pl: alpha=4.40254038952905 is too positive, limiting it to 0.7 Projected perplexity change from setting alpha=0.7 is 158.392099->158.320525733333, reduction of 0.0715732666666327 Alpha value on iter 6 is 0.7 Final config is: D=0.6 tau=1.53 phi=2 D=0.698617181582886 tau=2.601 phi=2.7 D=0 tau=3.14698218274151 phi=1.89713157928524 Discounting N-grams. discount_ngrams: for n-gram order 1, D=0.600000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 2, D=0.698617, tau=2.601000 phi=2.700000 discount_ngrams: for n-gram order 3, D=0.000000, tau=3.146982 phi=1.897132 Computing final perplexity Building ARPA LM (perplexity computation is in background) interpolate_ngrams: 133557 words in wordslist interpolate_ngrams: 133557 words in wordslist Perplexity over 115036.000000 words is 158.367475 Perplexity over 112774.000000 words (excluding 2262.000000 OOVs) is 164.357110 158.367475 Done training LM of type 3gram-mincount Pruning N-grams Removed 1063563 parameters, total divergence 494359.437025 Average divergence per parameter is 0.464814, versus threshold 1.500000 Computing pruned perplexity interpolate_ngrams: 133557 words in wordslist After pruning, number of N-grams is 474712 Building ARPA LM (perplexity computation is in background) interpolate_ngrams: 133557 words in wordslist Perplexity over 115036.000000 words is 167.556966 Perplexity over 112774.000000 words (excluding 2262.000000 OOVs) is 174.078534 167.556966 ARPA output is in data/local/local_lm/3gram-mincount//lm_pr1.5.gz Done pruning LM with threshold 1.5 Getting raw N-gram counts discount_ngrams: for n-gram order 1, D=0.000000, tau=0.000000 phi=1.000000 discount_ngrams: for n-gram order 2, D=0.000000, tau=0.000000 phi=1.000000 discount_ngrams: for n-gram order 3, D=1.000000, tau=0.000000 phi=1.000000 discount_ngrams: for n-gram order 4, D=1.000000, tau=0.000000 phi=1.000000 Iteration 1/8 of optimizing discounting parameters discount_ngrams: for n-gram order 1, D=0.600000, tau=0.675000 phi=2.000000 discount_ngrams: for n-gram order 2, D=0.800000, tau=0.675000 phi=2.000000 discount_ngrams: for n-gram order 3, D=0.000000, tau=0.675000 phi=2.000000 discount_ngrams: for n-gram order 4, D=0.000000, tau=0.675000 phi=2.000000 discount_ngrams: for n-gram order 1, D=0.600000, tau=0.900000 phi=2.000000 discount_ngrams: for n-gram order 2, D=0.800000, tau=0.900000 phi=2.000000 discount_ngrams: for n-gram order 3, D=0.000000, tau=0.900000 phi=2.000000 discount_ngrams: for n-gram order 4, D=0.000000, tau=0.900000 phi=2.000000 discount_ngrams: for n-gram order 1, D=0.600000, tau=1.215000 phi=2.000000 discount_ngrams: for n-gram order 2, D=0.800000, tau=1.215000 phi=2.000000 discount_ngrams: for n-gram order 3, D=0.000000, tau=1.215000 phi=2.000000 discount_ngrams: for n-gram order 4, D=0.000000, tau=1.215000 phi=2.000000 interpolate_ngrams: 133557 words in wordslist interpolate_ngrams: 133557 words in wordslist interpolate_ngrams: 133557 words in wordslist Perplexity over 115036.000000 words is 157.505601 Perplexity over 112774.000000 words (excluding 2262.000000 OOVs) is 163.290314

real 0m9.434s user 0m12.872s sys 0m0.296s Perplexity over 115036.000000 words is 157.059713 Perplexity over 112774.000000 words (excluding 2262.000000 OOVs) is 162.871141

real 0m9.543s user 0m12.930s sys 0m0.367s Perplexity over 115036.000000 words is 156.566585 Perplexity over 112774.000000 words (excluding 2262.000000 OOVs) is 162.410693

real 0m9.605s user 0m13.061s sys 0m0.293s optimize_alpha.pl: alpha=1.30330854088574 is too positive, limiting it to 0.7 Projected perplexity change from setting alpha=0.7 is 157.059713->156.226424733333, reduction of 0.833288266666585 Alpha value on iter 1 is 0.7 Iteration 2/8 of optimizing discounting parameters discount_ngrams: for n-gram order 1, D=0.600000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 2, D=0.800000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 3, D=0.000000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 4, D=0.000000, tau=1.147500 phi=2.000000 discount_ngrams: for n-gram order 1, D=0.600000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 2, D=0.800000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 3, D=0.000000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 4, D=0.000000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 1, D=0.600000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 2, D=0.800000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 3, D=0.000000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 4, D=0.000000, tau=2.065500 phi=2.000000 interpolate_ngrams: 133557 words in wordslist interpolate_ngrams: 133557 words in wordslist interpolate_ngrams: 133557 words in wordslist Perplexity over 115036.000000 words is 156.380350 Perplexity over 112774.000000 words (excluding 2262.000000 OOVs) is 162.262134

real 0m9.275s user 0m12.898s sys 0m0.316s Perplexity over 115036.000000 words is 156.149828 Perplexity over 112774.000000 words (excluding 2262.000000 OOVs) is 162.024502

real 0m9.524s user 0m13.065s sys 0m0.268s Perplexity over 115036.000000 words is 156.258032 Perplexity over 112774.000000 words (excluding 2262.000000 OOVs) is 162.135801

real 0m9.559s user 0m12.979s sys 0m0.300s Projected perplexity change from setting alpha=0.689920401260892 is 156.258032->156.115141567241, reduction of 0.14289043275906 Alpha value on iter 2 is 0.689920401260892 Iteration 3/8 of optimizing discounting parameters discount_ngrams: for n-gram order 1, D=0.600000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 2, D=0.800000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 3, D=0.000000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 4, D=0.000000, tau=2.585578 phi=1.750000 discount_ngrams: for n-gram order 1, D=0.600000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 2, D=0.800000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 3, D=0.000000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 4, D=0.000000, tau=2.585578 phi=2.000000 discount_ngrams: for n-gram order 1, D=0.600000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 2, D=0.800000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 3, D=0.000000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 4, D=0.000000, tau=2.585578 phi=2.350000 interpolate_ngrams: 133557 words in wordslist interpolate_ngrams: 133557 words in wordslist interpolate_ngrams: 133557 words in wordslist Perplexity over 115036.000000 words is 156.072369 Perplexity over 112774.000000 words (excluding 2262.000000 OOVs) is 161.945069

real 0m9.262s user 0m12.739s sys 0m0.301s Perplexity over 115036.000000 words is 156.018155 Perplexity over 112774.000000 words (excluding 2262.000000 OOVs) is 161.889029

real 0m9.446s user 0m12.864s sys 0m0.335s Perplexity over 115036.000000 words is 156.149035 Perplexity over 112774.000000 words (excluding 2262.000000 OOVs) is 162.024396

real 0m9.596s user 0m13.032s sys 0m0.327s optimize_alpha.pl: alpha=-29.8351774535627 is too negative, limiting it to -0.5 Projected perplexity change from setting alpha=-0.5 is 156.072369->155.964397190476, reduction of 0.107971809523832 Alpha value on iter 3 is -0.5 Iteration 4/8 of optimizing discounting parameters discount_ngrams: for n-gram order 1, D=0.600000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 2, D=0.800000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 3, D=0.000000, tau=1.147500 phi=2.000000 discount_ngrams: for n-gram order 4, D=0.000000, tau=2.585578 phi=1.500000 discount_ngrams: for n-gram order 1, D=0.600000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 2, D=0.800000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 3, D=0.000000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 4, D=0.000000, tau=2.585578 phi=1.500000 discount_ngrams: for n-gram order 1, D=0.600000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 2, D=0.800000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 3, D=0.000000, tau=2.065500 phi=2.000000 discount_ngrams: for n-gram order 4, D=0.000000, tau=2.585578 phi=1.500000 interpolate_ngrams: 133557 words in wordslist interpolate_ngrams: 133557 words in wordslist interpolate_ngrams: 133557 words in wordslist Perplexity over 115036.000000 words is 155.825596 Perplexity over 112774.000000 words (excluding 2262.000000 OOVs) is 161.690066

real 0m9.393s user 0m12.843s sys 0m0.252s Perplexity over 115036.000000 words is 156.144478 Perplexity over 112774.000000 words (excluding 2262.000000 OOVs) is 162.020621

real 0m9.517s user 0m13.042s sys 0m0.309s Perplexity over 115036.000000 words is 155.974526 Perplexity over 112774.000000 words (excluding 2262.000000 OOVs) is 161.844339

real 0m9.537s user 0m13.121s sys 0m0.309s Projected perplexity change from setting alpha=0.67699544284017 is 155.974526->155.780278308854, reduction of 0.194247691146018 Alpha value on iter 4 is 0.67699544284017 Iteration 5/8 of optimizing discounting parameters discount_ngrams: for n-gram order 1, D=0.600000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 2, D=0.800000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 3, D=0.000000, tau=2.565803 phi=1.750000 discount_ngrams: for n-gram order 4, D=0.000000, tau=2.585578 phi=1.500000 discount_ngrams: for n-gram order 1, D=0.600000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 2, D=0.800000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 3, D=0.000000, tau=2.565803 phi=2.000000 discount_ngrams: for n-gram order 4, D=0.000000, tau=2.585578 phi=1.500000 discount_ngrams: for n-gram order 1, D=0.600000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 2, D=0.800000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 3, D=0.000000, tau=2.565803 phi=2.350000 discount_ngrams: for n-gram order 4, D=0.000000, tau=2.585578 phi=1.500000 interpolate_ngrams: 133557 words in wordslist interpolate_ngrams: 133557 words in wordslist interpolate_ngrams: 133557 words in wordslist Perplexity over 115036.000000 words is 155.798216 Perplexity over 112774.000000 words (excluding 2262.000000 OOVs) is 161.661575

real 0m9.425s user 0m12.836s sys 0m0.312s Perplexity over 115036.000000 words is 155.705262 Perplexity over 112774.000000 words (excluding 2262.000000 OOVs) is 161.567419

real 0m9.480s user 0m12.809s sys 0m0.347s Perplexity over 115036.000000 words is 155.731891 Perplexity over 112774.000000 words (excluding 2262.000000 OOVs) is 161.593887

real 0m9.507s user 0m12.961s sys 0m0.316s optimize_alpha.pl: alpha=-0.51007182107332 is too negative, limiting it to -0.5 Projected perplexity change from setting alpha=-0.5 is 155.731891->155.695921333333, reduction of 0.0359696666666025 Alpha value on iter 5 is -0.5 Iteration 6/8 of optimizing discounting parameters discount_ngrams: for n-gram order 1, D=0.600000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 2, D=0.600000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 3, D=0.000000, tau=2.565803 phi=1.500000 discount_ngrams: for n-gram order 4, D=0.000000, tau=2.585578 phi=1.500000 discount_ngrams: for n-gram order 1, D=0.600000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 2, D=0.800000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 3, D=0.000000, tau=2.565803 phi=1.500000 discount_ngrams: for n-gram order 4, D=0.000000, tau=2.585578 phi=1.500000 discount_ngrams: for n-gram order 1, D=0.600000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 2, D=1.080000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 3, D=0.000000, tau=2.565803 phi=1.500000 discount_ngrams: for n-gram order 4, D=0.000000, tau=2.585578 phi=1.500000 interpolate_ngrams: 133557 words in wordslist interpolate_ngrams: 133557 words in wordslist interpolate_ngrams: 133557 words in wordslist Perplexity over 115036.000000 words is 159.509748 Perplexity over 112774.000000 words (excluding 2262.000000 OOVs) is 165.631957

real 0m7.239s user 0m9.561s sys 0m0.208s Perplexity over 115036.000000 words is 155.707118 Perplexity over 112774.000000 words (excluding 2262.000000 OOVs) is 161.537072

real 0m9.495s user 0m12.992s sys 0m0.300s Perplexity over 115036.000000 words is 155.717955 Perplexity over 112774.000000 words (excluding 2262.000000 OOVs) is 161.582697

real 0m9.525s user 0m13.012s sys 0m0.313s Projected perplexity change from setting alpha=-0.126205188383732 is 155.717955->155.431511777556, reduction of 0.286443222443495 Alpha value on iter 6 is -0.126205188383732 Iteration 7/8 of optimizing discounting parameters discount_ngrams: for n-gram order 1, D=0.600000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 2, D=0.699036, tau=1.147500 phi=2.000000 discount_ngrams: for n-gram order 3, D=0.000000, tau=2.565803 phi=1.500000 discount_ngrams: for n-gram order 4, D=0.000000, tau=2.585578 phi=1.500000 discount_ngrams: for n-gram order 1, D=0.600000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 2, D=0.699036, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 3, D=0.000000, tau=2.565803 phi=1.500000 discount_ngrams: for n-gram order 4, D=0.000000, tau=2.585578 phi=1.500000 discount_ngrams: for n-gram order 1, D=0.600000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 2, D=0.699036, tau=2.065500 phi=2.000000 discount_ngrams: for n-gram order 3, D=0.000000, tau=2.565803 phi=1.500000 discount_ngrams: for n-gram order 4, D=0.000000, tau=2.585578 phi=1.500000 interpolate_ngrams: 133557 words in wordslist interpolate_ngrams: 133557 words in wordslist interpolate_ngrams: 133557 words in wordslist Perplexity over 115036.000000 words is 155.480147 Perplexity over 112774.000000 words (excluding 2262.000000 OOVs) is 161.322731

real 0m9.365s user 0m12.851s sys 0m0.272s Perplexity over 115036.000000 words is 155.560051 Perplexity over 112774.000000 words (excluding 2262.000000 OOVs) is 161.400276

real 0m9.409s user 0m12.867s sys 0m0.286s Perplexity over 115036.000000 words is 155.647336 Perplexity over 112774.000000 words (excluding 2262.000000 OOVs) is 161.483250

real 0m9.562s user 0m12.971s sys 0m0.310s optimize_alpha.pl: alpha=0.741762028608297 is too positive, limiting it to 0.7 Projected perplexity change from setting alpha=0.7 is 155.560051->155.449587166667, reduction of 0.110463833333284 Alpha value on iter 7 is 0.7 Iteration 8/8 of optimizing discounting parameters discount_ngrams: for n-gram order 1, D=0.600000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 2, D=0.699036, tau=2.601000 phi=1.750000 discount_ngrams: for n-gram order 3, D=0.000000, tau=2.565803 phi=1.500000 discount_ngrams: for n-gram order 4, D=0.000000, tau=2.585578 phi=1.500000 discount_ngrams: for n-gram order 1, D=0.600000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 2, D=0.699036, tau=2.601000 phi=2.000000 discount_ngrams: for n-gram order 3, D=0.000000, tau=2.565803 phi=1.500000 discount_ngrams: for n-gram order 4, D=0.000000, tau=2.585578 phi=1.500000 discount_ngrams: for n-gram order 1, D=0.600000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 2, D=0.699036, tau=2.601000 phi=2.350000 discount_ngrams: for n-gram order 3, D=0.000000, tau=2.565803 phi=1.500000 discount_ngrams: for n-gram order 4, D=0.000000, tau=2.585578 phi=1.500000 interpolate_ngrams: 133557 words in wordslist interpolate_ngrams: 133557 words in wordslist interpolate_ngrams: 133557 words in wordslist Perplexity over 115036.000000 words is 155.414176 Perplexity over 112774.000000 words (excluding 2262.000000 OOVs) is 161.250228

real 0m9.334s user 0m12.742s sys 0m0.306s Perplexity over 115036.000000 words is 155.478636 Perplexity over 112774.000000 words (excluding 2262.000000 OOVs) is 161.334794 Perplexity over 115036.000000 words is 155.451820 Perplexity over 112774.000000 words (excluding 2262.000000 OOVs) is 161.298601

real 0m9.479s user 0m12.941s sys 0m0.282s

real 0m9.476s user 0m12.860s sys 0m0.331s optimize_alpha.pl: objective function is not convex; returning alpha=0.7 Projected perplexity change from setting alpha=0.7 is 155.451820->155.376413466667, reduction of 0.0754065333333358 Alpha value on iter 8 is 0.7 Final config is: D=0.6 tau=1.53 phi=2.0 D=0.699035849293014 tau=2.601 phi=2.7 D=0.0 tau=2.56580302754546 phi=1.5 D=0.0 tau=2.58557821392916 phi=1.5 Discounting N-grams. discount_ngrams: for n-gram order 1, D=0.600000, tau=1.530000 phi=2.000000 discount_ngrams: for n-gram order 2, D=0.699036, tau=2.601000 phi=2.700000 discount_ngrams: for n-gram order 3, D=0.000000, tau=2.565803 phi=1.500000 discount_ngrams: for n-gram order 4, D=0.000000, tau=2.585578 phi=1.500000 Computing final perplexity Building ARPA LM (perplexity computation is in background) interpolate_ngrams: 133557 words in wordslist interpolate_ngrams: 133557 words in wordslist Perplexity over 115036.000000 words is 155.427936 Perplexity over 112774.000000 words (excluding 2262.000000 OOVs) is 161.259152 155.427936 Done training LM of type 4gram-mincount Pruning N-grams Removed 1320191 parameters, total divergence 580420.241259 Average divergence per parameter is 0.439649, versus threshold 1.500000 Computing pruned perplexity interpolate_ngrams: 133557 words in wordslist After pruning, number of N-grams is 517418 Building ARPA LM (perplexity computation is in background) interpolate_ngrams: 133557 words in wordslist Perplexity over 115036.000000 words is 165.347818 Perplexity over 112774.000000 words (excluding 2262.000000 OOVs) is 171.745769 165.347818 ARPA output is in data/local/local_lm/4gram-mincount/lm_pr1.5.gz Done pruning LM with threshold 1.5 local/cgn_format_local_lms.sh --lang-suffix _nosp arpa2fst --disambig-symbol=#0 --read-symbol-table=data/lang_nosp/words.txt - data/lang_nosp_test_tgpr/G.fst LOG (arpa2fst[5.5.194~1-1dcd]:Read():arpa-file-parser.cc:94) Reading \data\ section. LOG (arpa2fst[5.5.194~1-1dcd]:Read():arpa-file-parser.cc:149) Reading \1-grams: section. LOG (arpa2fst[5.5.194~1-1dcd]:Read():arpa-file-parser.cc:149) Reading \2-grams: section. LOG (arpa2fst[5.5.194~1-1dcd]:Read():arpa-file-parser.cc:149) Reading \3-grams: section. LOG (arpa2fst[5.5.194~1-1dcd]:RemoveRedundantStates():arpa-lm-compiler.cc:359) Reduced num-states from 364067 to 75783 fstisstochastic data/lang_nosp_test_tgpr/G.fst 1.0609e-05 -0.337552 arpa2fst --disambig-symbol=#0 --read-symbol-table=data/lang_nosp/words.txt - data/lang_nosp_test_tg/G.fst LOG (arpa2fst[5.5.194~1-1dcd]:Read():arpa-file-parser.cc:94) Reading \data\ section. LOG (arpa2fst[5.5.194~1-1dcd]:Read():arpa-file-parser.cc:149) Reading \1-grams: section. LOG (arpa2fst[5.5.194~1-1dcd]:Read():arpa-file-parser.cc:149) Reading \2-grams: section. LOG (arpa2fst[5.5.194~1-1dcd]:Read():arpa-file-parser.cc:149) Reading \3-grams: section. LOG (arpa2fst[5.5.194~1-1dcd]:RemoveRedundantStates():arpa-lm-compiler.cc:359) Reduced num-states from 1081217 to 214817 fstisstochastic data/lang_nosp_test_tg/G.fst 9.22247e-06 -0.539687 arpa-to-const-arpa --bos-symbol=133559 --eos-symbol=133560 --unk-symbol=140 - data/lang_nosp_test_tgconst/G.carpa LOG (arpa-to-const-arpa[5.5.194~1-1dcd]:BuildConstArpaLm():const-arpa-lm.cc:1078) Reading - utils/map_arpa_lm.pl: Processing "\data\" utils/map_arpa_lm.pl: Processing "\1-grams:\" LOG (arpa-to-const-arpa[5.5.194~1-1dcd]:Read():arpa-file-parser.cc:94) Reading \data\ section. LOG (arpa-to-const-arpa[5.5.194~1-1dcd]:Read():arpa-file-parser.cc:149) Reading \1-grams: section. utils/map_arpa_lm.pl: Processing "\2-grams:\" LOG (arpa-to-const-arpa[5.5.194~1-1dcd]:Read():arpa-file-parser.cc:149) Reading \2-grams: section. utils/map_arpa_lm.pl: Processing "\3-grams:\" LOG (arpa-to-const-arpa[5.5.194~1-1dcd]:Read():arpa-file-parser.cc:149) Reading \3-grams: section. arpa2fst --disambig-symbol=#0 --read-symbol-table=data/lang_nosp/words.txt - data/lang_nosp_test_fg/G.fst LOG (arpa2fst[5.5.194~1-1dcd]:Read():arpa-file-parser.cc:94) Reading \data\ section. LOG (arpa2fst[5.5.194~1-1dcd]:Read():arpa-file-parser.cc:149) Reading \1-grams: section. LOG (arpa2fst[5.5.194~1-1dcd]:Read():arpa-file-parser.cc:149) Reading \2-grams: section. LOG (arpa2fst[5.5.194~1-1dcd]:Read():arpa-file-parser.cc:149) Reading \3-grams: section. LOG (arpa2fst[5.5.194~1-1dcd]:Read():arpa-file-parser.cc:149) Reading \4-grams: section. LOG (arpa2fst[5.5.194~1-1dcd]:RemoveRedundantStates():arpa-lm-compiler.cc:359) Reduced num-states from 1486197 to 361272 fstisstochastic data/lang_nosp_test_fg/G.fst 9.09031e-06 -0.612351 arpa-to-const-arpa --bos-symbol=133559 --eos-symbol=133560 --unk-symbol=140 - data/lang_nosp_test_fgconst/G.carpa LOG (arpa-to-const-arpa[5.5.194~1-1dcd]:BuildConstArpaLm():const-arpa-lm.cc:1078) Reading - utils/map_arpa_lm.pl: Processing "\data\" utils/map_arpa_lm.pl: Processing "\1-grams:\" LOG (arpa-to-const-arpa[5.5.194~1-1dcd]:Read():arpa-file-parser.cc:94) Reading \data\ section. LOG (arpa-to-const-arpa[5.5.194~1-1dcd]:Read():arpa-file-parser.cc:149) Reading \1-grams: section. utils/map_arpa_lm.pl: Processing "\2-grams:\" LOG (arpa-to-const-arpa[5.5.194~1-1dcd]:Read():arpa-file-parser.cc:149) Reading \2-grams: section. utils/map_arpa_lm.pl: Processing "\3-grams:\" LOG (arpa-to-const-arpa[5.5.194~1-1dcd]:Read():arpa-file-parser.cc:149) Reading \3-grams: section. utils/map_arpa_lm.pl: Processing "\4-grams:\" LOG (arpa-to-const-arpa[5.5.194~1-1dcd]:Read():arpa-file-parser.cc:149) Reading \4-grams: section. arpa2fst --disambig-symbol=#0 --read-symbol-table=data/lang_nosp/words.txt - data/lang_nosp_test_fgpr/G.fst LOG (arpa2fst[5.5.194~1-1dcd]:Read():arpa-file-parser.cc:94) Reading \data\ section. LOG (arpa2fst[5.5.194~1-1dcd]:Read():arpa-file-parser.cc:149) Reading \1-grams: section. LOG (arpa2fst[5.5.194~1-1dcd]:Read():arpa-file-parser.cc:149) Reading \2-grams: section. LOG (arpa2fst[5.5.194~1-1dcd]:Read():arpa-file-parser.cc:149) Reading \3-grams: section. LOG (arpa2fst[5.5.194~1-1dcd]:Read():arpa-file-parser.cc:149) Reading \4-grams: section. LOG (arpa2fst[5.5.194~1-1dcd]:RemoveRedundantStates():arpa-lm-compiler.cc:359) Reduced num-states from 498792 to 101342 fstisstochastic data/lang_nosp_test_fgpr/G.fst 1.02835e-05 -0.690422 steps/make_mfcc.sh --cmd run.pl --nj 30 data/train_s utils/validate_data_dir.sh: Mal-formed spk2gender file

I've ran into some other issues before that, but cleared my data folder. Not sure if I should have cleared anything else though.

Thanks for the help!

laurensw75 commented 5 years ago

Thanks for the catch. I didn't run into this issue before, so maybe something was different in my CGN. Apparently some (named) speakers are not in the speakers.txt of CGN, so we don't know their gender. I've update the script to assume their gender as male ;-)

I't probably a good idea to delete your old data folder.

JeromeNi commented 5 years ago

Thanks for the fix! I am now able to get to stage 7, the cleaning up portion of the script.

However, upon running stage 8, I'm seeing some error there: steps/cleanup/clean_and_segment_data.sh: Building biased-language-model decoding graphs... steps/cleanup/make_biased_lm_graphs.sh --nj 30 --cmd run.pl data/train_s data/lang_s exp/train_s/tri3_cleaned_work exp/train_s/tri3_cleaned_work/graphs sym2int.pl: replacing 3d-dingetjes with 140 sym2int.pl: replacing verkooppunt-xxx with 140 sym2int.pl: replacing r&d-poot with 140 sym2int.pl: replacing r&d with 140 sym2int.pl: replacing r&d-subsidies with 140 sym2int.pl: replacing r&d-instituten with 140 sym2int.pl: replacing r&d-activiteiten with 140 sym2int.pl: replacing r&d-faciliteiten with 140 sym2int.pl: replacing r&d-researchlabs with 140 sym2int.pl: replacing 3d with 140 sym2int.pl: replacing agt-den with 140 sym2int.pl: replacing Één with 140 sym2int.pl: replacing Één with 140 sym2int.pl: replacing Één with 140 sym2int.pl: replacing onroerendgoedbelastingen with 140 sym2int.pl: replacing onroerendgoedbelastingen with 140 sym2int.pl: replacing woz-uh-waarde with 140 sym2int.pl: replacing redengeving with 140 sym2int.pl: replacing verantwoordelijkheidsrelatie with 140 sym2int.pl: replacing verantwoordelijkheidsrelatie with 140 sym2int.pl: not warning for OOVs any more times ** Replaced 6843 instances of OOVs with 140 steps/cleanup/make_biased_lm_graphs.sh: creating utterance-group-specific decoding graphs with biased LMs run.pl: 30 / 30 failed, log is in exp/train_s/tri3_cleaned_work/graphs/log/compile_decoding_graphs.*.log steps/cleanup/clean_and_segment_data.sh: Building biased-language-model decoding graphs... steps/cleanup/make_biased_lm_graphs.sh --nj 30 --cmd run.pl data/train_t data/lang_t exp/train_t/tri3_cleaned_work exp/train_t/tri3_cleaned_work/graphs sym2int.pl: replacing mnemosyne with 140 sym2int.pl: replacing mnemosyne with 140 sym2int.pl: replacing eh with 140 sym2int.pl: replacing xxx-kabel with 140 sym2int.pl: replacing lisenka with 140 sym2int.pl: replacing lisenka with 140 sym2int.pl: replacing voorplezier with 140 sym2int.pl: replacing rüdesheim with 140 sym2int.pl: replacing tussenpaneel with 140 sym2int.pl: replacing kutbaantjes with 140 sym2int.pl: replacing kutbaantjes with 140 sym2int.pl: replacing energiezuiniger with 140 sym2int.pl: replacing tantaal with 140 sym2int.pl: replacing klein-walcheren with 140 sym2int.pl: replacing stedentrips with 140 sym2int.pl: replacing boekenmondeling with 140 sym2int.pl: replacing ramptoerist with 140 sym2int.pl: replacing fortuyn-broek with 140 sym2int.pl: replacing clausoleum with 140 sym2int.pl: replacing 3d-voorstelling with 140 sym2int.pl: not warning for OOVs any more times ** Replaced 1987 instances of OOVs with 140 steps/cleanup/make_biased_lm_graphs.sh: creating utterance-group-specific decoding graphs with biased LMs run.pl: 30 / 30 failed, log is in exp/train_t/tri3_cleaned_work/graphs/log/compile_decoding_graphs.*.log copy_data_dir.sh: no such file data/train_t_cleaned/utt2spk rm: cannot remove 'data/train_t_cleaned_16khz/feats.scp': No such file or directory ./run.sh: line 232: data/train_t_cleaned_16khz/wav.scp: No such file or directory cat: data/train_t_cleaned/wav.scp: No such file or directory steps/make_mfcc.sh --cmd run.pl --nj 30 --mfcc-config conf/mfcc.conf data/train_t_cleaned_16khz make_mfcc.sh: no such file data/train_t_cleaned_16khz/wav.scp It seems that those folders are empty, so I looked at the console output again and found these errors in step 7: run.pl: 30 / 30 failed, log is in exp/train_s/tri3_cleaned_work/graphs/log/compile_decoding_graphs.*.log run.pl: 30 / 30 failed, log is in exp/train_t/tri3_cleaned_work/graphs/log/compile_decoding_graphs.*.log I'm wondering if that's because I've done something wrong when running some earlier steps, so I checked the WER files in tri3 folders, and they seems to be fine: `cat exp/train_s/tri3/decodetgpr/wer* | grep WER %WER 15.05 [ 818 / 5434, 169 ins, 106 del, 543 sub ] %WER 14.92 [ 811 / 5434, 160 ins, 108 del, 543 sub ] %WER 14.89 [ 809 / 5434, 156 ins, 109 del, 544 sub ] %WER 14.85 [ 807 / 5434, 150 ins, 120 del, 537 sub ] %WER 15.27 [ 830 / 5434, 143 ins, 135 del, 552 sub ] %WER 15.62 [ 849 / 5434, 137 ins, 142 del, 570 sub ] %WER 15.75 [ 856 / 5434, 132 ins, 145 del, 579 sub ] %WER 16.29 [ 885 / 5434, 134 ins, 151 del, 600 sub ] %WER 20.08 [ 1091 / 5434, 288 ins, 74 del, 729 sub ] %WER 18.61 [ 1011 / 5434, 257 ins, 79 del, 675 sub ] %WER 17.24 [ 937 / 5434, 228 ins, 80 del, 629 sub ] %WER 16.49 [ 896 / 5434, 205 ins, 92 del, 599 sub ] %WER 15.84 [ 861 / 5434, 192 ins, 96 del, 573 sub ] %WER 15.26 [ 829 / 5434, 179 ins, 100 del, 550 sub ]

cat exp/train_s/tri3/decode_tgprfg/wer* | grep WER %WER 14.69 [ 798 / 5434, 170 ins, 99 del, 529 sub ] %WER 14.37 [ 781 / 5434, 158 ins, 99 del, 524 sub ] %WER 14.15 [ 769 / 5434, 153 ins, 105 del, 511 sub ] %WER 14.28 [ 776 / 5434, 151 ins, 111 del, 514 sub ] %WER 14.37 [ 781 / 5434, 146 ins, 114 del, 521 sub ] %WER 14.69 [ 798 / 5434, 145 ins, 122 del, 531 sub ] %WER 14.72 [ 800 / 5434, 139 ins, 130 del, 531 sub ] %WER 14.87 [ 808 / 5434, 136 ins, 135 del, 537 sub ] %WER 19.58 [ 1064 / 5434, 292 ins, 73 del, 699 sub ] %WER 18.18 [ 988 / 5434, 266 ins, 78 del, 644 sub ] %WER 16.86 [ 916 / 5434, 234 ins, 79 del, 603 sub ] %WER 15.83 [ 860 / 5434, 197 ins, 86 del, 577 sub ] %WER 15.26 [ 829 / 5434, 188 ins, 88 del, 553 sub ] %WER 15.02 [ 816 / 5434, 182 ins, 95 del, 539 sub ]

cat exp/train_t/tri3/decode_tgprfg/wer* | grep WER %WER 44.07 [ 10721 / 24328, 1975 ins, 2224 del, 6522 sub ] [PARTIAL] %WER 43.48 [ 10577 / 24328, 1822 ins, 2378 del, 6377 sub ] [PARTIAL] %WER 43.16 [ 10500 / 24328, 1694 ins, 2526 del, 6280 sub ] [PARTIAL] %WER 42.88 [ 10433 / 24328, 1555 ins, 2708 del, 6170 sub ] [PARTIAL] %WER 43.22 [ 10515 / 24328, 1451 ins, 2923 del, 6141 sub ] [PARTIAL] %WER 43.44 [ 10568 / 24328, 1334 ins, 3107 del, 6127 sub ] [PARTIAL] %WER 43.87 [ 10672 / 24328, 1267 ins, 3278 del, 6127 sub ] [PARTIAL] %WER 44.41 [ 10805 / 24328, 1242 ins, 3474 del, 6089 sub ] [PARTIAL] %WER 53.52 [ 13020 / 24328, 3512 ins, 1458 del, 8050 sub ] [PARTIAL] %WER 51.26 [ 12471 / 24328, 3221 ins, 1555 del, 7695 sub ] [PARTIAL] %WER 49.22 [ 11974 / 24328, 2899 ins, 1688 del, 7387 sub ] [PARTIAL] %WER 47.43 [ 11539 / 24328, 2625 ins, 1799 del, 7115 sub ] [PARTIAL] %WER 46.21 [ 11242 / 24328, 2386 ins, 1939 del, 6917 sub ] [PARTIAL] %WER 45.09 [ 10969 / 24328, 2169 ins, 2114 del, 6686 sub ] [PARTIAL]

cat exp/train_t/tri3/decodetgpr/wer* | grep WER %WER 44.89 [ 10920 / 24328, 1978 ins, 2267 del, 6675 sub ] [PARTIAL] %WER 44.31 [ 10780 / 24328, 1827 ins, 2432 del, 6521 sub ] [PARTIAL] %WER 44.10 [ 10728 / 24328, 1712 ins, 2587 del, 6429 sub ] [PARTIAL] %WER 43.95 [ 10692 / 24328, 1566 ins, 2789 del, 6337 sub ] [PARTIAL] %WER 43.92 [ 10685 / 24328, 1449 ins, 2985 del, 6251 sub ] [PARTIAL] %WER 44.30 [ 10778 / 24328, 1365 ins, 3194 del, 6219 sub ] [PARTIAL] %WER 44.60 [ 10851 / 24328, 1269 ins, 3401 del, 6181 sub ] [PARTIAL] %WER 45.04 [ 10958 / 24328, 1206 ins, 3607 del, 6145 sub ] [PARTIAL] %WER 53.99 [ 13135 / 24328, 3553 ins, 1443 del, 8139 sub ] [PARTIAL] %WER 51.83 [ 12610 / 24328, 3245 ins, 1561 del, 7804 sub ] [PARTIAL] %WER 49.91 [ 12141 / 24328, 2915 ins, 1696 del, 7530 sub ] [PARTIAL] %WER 48.19 [ 11724 / 24328, 2644 ins, 1843 del, 7237 sub ] [PARTIAL] %WER 46.74 [ 11371 / 24328, 2378 ins, 1978 del, 7015 sub ] [PARTIAL] %WER 45.74 [ 11128 / 24328, 2168 ins, 2112 del, 6848 sub ] [PARTIAL] `

Thanks for the help again!

laurensw75 commented 5 years ago

I haven't executed the whole script in a while, especially the part that generates the basic models. When I find the time, I will try to reproduce your error and see if (and how) it needs fixing.

laurensw75 commented 5 years ago

I have just tried it again and did not get your errors in step 7.

I did however use the lexicon.lex file from https://github.com/opensource-spraakherkenning-nl/Kaldi_NL, as this results in slightly better performance (tri3_s_fg = 13.0% WER, tri3_t_fg = 43.4% WER) and uses a (for me) more convenient phoneset.

Because step 7 did not complete properly, step 8 failed as well. Is there anything useful in the error logs you mention (exp/train_s/tri3_cleaned_work/graphs/log/compile_decoding_graphs.*.log) ?

JeromeNi commented 5 years ago

Yes, actually there is. There seems to be an error when running make_biased_lms.py.

make_biased_lms.py: error calling subprocess, command was: steps/cleanup/internal/make_one_biased_lm.py --word-disambig-symbol=133558 --ngram-order=4 --min-lm-state-count=10 --discounting-constant=0.3 --top-words=exp/train_s/tri3_cleaned_work/graphs/top_words.int, error was : a bytes-like object is required, not 'str' make_one_biased_lm.py: processed 0 lines of input Traceback (most recent call last): File "steps/cleanup/internal/make_one_biased_lm.py", line 310, in ngram_counts.PrintAsFst(args.word_disambig_symbol) File "steps/cleanup/internal/make_one_biased_lm.py", line 276, in PrintAsFst this_cost = -math.log(self.GetProb(hist, word, total_count_map)) File "steps/cleanup/internal/make_one_biased_lm.py", line 246, in GetProb prob = float(word_to_count[word]) / total_count ZeroDivisionError: float division by zero ASSERTION_FAILED (compile-train-graphs-fsts[5.5.194~1-1dcd]:CompileGraphs():training-graph-compiler.cc:186) : 'phone2word_fst.Start() != kNoStateId && "Perhaps you have words missing in your lexicon?"'

I'm attachting one of the logs here: https://drive.google.com/file/d/10Y3MGTRuzFnN9e8Mjh-RQAVmotzLdFOp/view?usp=sharing

laurensw75 commented 5 years ago

Which version of Python are you running?

JeromeNi commented 5 years ago

I'm running Python 3.7

laurensw75 commented 5 years ago

These cleanup scripts are not mine, but are originally part of another example. I think you should use python 2.7 for most Kaldi-related stuff, though it will depend on the exact script whether it is absolutely required. My advice is to try again using python 2.7. Step 7 typically takes a really long time, so hopefully you'll find out fairly quickly if this solves the problem.

mstopa commented 4 years ago

Using Python 2.7 solved my problem with

ASSERTION_FAILED (compile-train-graphs-fsts[5.5.194~1-1dcd]:CompileGraphs():training-graph-compiler.cc:186) : 'phone2word_fst.Start() != kNoStateId && "Perhaps you have words missing in your lexicon?"'

too. Thanks!

laurensw75 / kaldi_egs_CGN

Mal-formed spk2gender #5