twgo / siann1-hak8_boo5-hing5

聲學模型訓練
MIT License
1 stars 1 forks source link

比較兩種dict #25

Closed leo424y closed 6 years ago

leo424y commented 6 years ago

發現

loca/dict 裡還有個free-syllable/dict 都用的script在 ./產生free-syllable的graph.sh:19:cp ${data}/local/dict/[^l]* ${data}/local/free-syllable/dict
可知 free-syllable 只是多dict兩個檔,在twsas走評估前一個script 產生free-syllable的graph會跑到

請教

根據

root@1a41e361410d:/usr/local/kaldi/egs/taiwanese/s5c/data/local# cat dict/lexicon.txt | wc
   7008   33719  163462
root@1a41e361410d:/usr/local/kaldi/egs/taiwanese/s5c/data/local# cat free-syllable/dict/lexicon.txt | wc
    725    2175    8974
root@1a41e361410d:/usr/local/kaldi/egs/taiwanese/s5c/data/local# diff free-syllable/dict/lexicon.txt dict/lexicon.txt | wc
   7735   43581  187883

fst

The file L.fst is the Finite State Transducer form of the lexicon (L, see "Speech Recognition with Weighted Finite-State Transducers" by Mohri, Pereira and Riley, in Springer Handbook on SpeechProcessing and Speech Communication, 2008). with phone symbols on the input and word symbols on the output. The file L_disambig.fst is the lexicon, as above but including the disambiguation symbols #1, #2, and so on, as well as the self-loop with #0 on it to "pass through" the disambiguation symbol from the grammar. See Disambiguation symbols for more explanation. Anyway, you won't have to deal with this directly.

Our tutorial above on how to create the lang/ directory did not address how to create the file G.fst, which is the finite state transducer form of the language model or grammar that we'll decode with. 

lexicon, uniform

root@fab417c4eccd:/usr/local/kaldi/egs/taiwanese/s5c# grep -rnw . -e  'lexicon.txt'
./local/swbd1_prepare_dict.sh:92:ln -sf lexicon5.txt lexicon.txt # This is the final lexicon.
./local/swbd1_train_lms.sh:43:lexicon=$2  # data/local/dict/lexicon.txt
./外語準備.sh:23:cat ${gua7gi2_data}/lexicon.txt | \
./外語準備.sh:72:  cat > ${data}/local/gua7gi2/dict/lexicon.txt
./外語準備.sh:83:utils/format_lm.sh data/lang_dict_gua7gi2 $LM_GZ data/local/gua7gi2/dict/lexicon.txt data/lang_gua7gi2
./處理nnet3濫.sh:17:  cat $tai5_data/dict/lexicon.txt $hua5_data/dict/lexicon.txt | \
./處理nnet3濫.sh:19:    cat > $data/dict/lexicon.txt
./處理nnet3濫.sh:32:  utils/format_lm.sh $data/lang_dict $LM_GZ $data/dict/lexicon.txt $data/lang
./試format_lm.sh:24:utils/format_lm.sh data/lang_sp $LM_GZ data/local/dict/lexicon.txt $LANG_DIR
./產生free-syllable的graph.sh:20:cp ${data}/local/free-syllable/lexicon.txt ${data}/local/free-syllable/dict
./處理nnet3華.sh:24:  utils/format_lm.sh $data/lang_dict $LM_GZ $data/dict/lexicon.txt $data/lang
./處理nnet3台詞台音.sh:15:  bash $data/dict/處理lexicon.sh $data/dict/lexicon.txt.tiau3 $data/dict/lexicon.txt
./處理nnet3台詞台音.sh:54:  utils/format_lm.sh $data/lang_dict $LM_GZ $data/dict/lexicon.txt $data/lang

root@fab417c4eccd:/usr/local/kaldi/egs/taiwanese/s5c# grep -rnw . -e  'uniform.fst'
./產生free-syllable的graph.sh:25:cat data/local/free-syllable/uniform.fst | \
root@fab417c4eccd:/usr/local/kaldi/egs/taiwanese/s5c/data/local# head free-syllable/lexicon.txt
a   ʔ- a
ah  ʔ- aʔ
ai  ʔ- ai
aih ʔ- aiʔ
ak  ʔ- ak
am  ʔ- am
an  ʔ- an
ang ʔ- aŋ
ann ʔ- aⁿ
annh    ʔ- aⁿʔ
root@fab417c4eccd:/usr/local/kaldi/egs/taiwanese/s5c/data/local# head free-syllable/uniform.fst
0   0   a   a
0   0   ah  ah
0   0   ai  ai
0   0   aih aih
0   0   ak  ak
0   0   am  am
0   0   an  an
0   0   ang ang
0   0   ann ann
0   0   annh    annh

free-syllable vs normal dict

root@fab417c4eccd:/usr/local/kaldi/egs/taiwanese/s5c/data/local# tree free-syllable/
free-syllable/
├── dict
│   ├── extra_questions.txt
│   ├── lexiconp.txt
│   ├── lexicon.txt
│   ├── nonsilence_phones.txt
│   ├── optional_silence.txt
│   └── silence_phones.txt
├── lexicon.txt
└── uniform.fst

1 directory, 8 files
root@fab417c4eccd:/usr/local/kaldi/egs/taiwanese/s5c/data/local# tree dict/
dict/
├── extra_questions.txt
├── lexiconp.txt
├── lexicon.txt
├── nonsilence_phones.txt
├── optional_silence.txt
└── silence_phones.txt
root@fab417c4eccd:/usr/local/kaldi/egs/taiwanese/s5c#  grep -rnw . -e  'local/dict'
./local/swbd1_prepare_dict.sh:22:patch <local/dict.patch $dir/lexicon0.txt || exit 1;
./local/swbd1_train_lms.sh:43:lexicon=$2  # data/local/dict/lexicon.txt
./試format_sri.sh:15:utils/prepare_lang.sh tshi3/local/dict "<UNK>"  tshi3/local/lang tshi3/lang
./外語準備.sh:22:cp ${data}/local/dict/[^l]* ${data}/local/gua7gi2/dict
./試format_lm.sh:24:utils/format_lm.sh data/lang_sp $LM_GZ data/local/dict/lexicon.txt $LANG_DIR
./試rescore_arpa.sh:15:utils/prepare_lang.sh tshi3/local/dict "<UNK>"  tshi3/local/lang tshi3/lang
./產生free-syllable的graph.sh:19:cp ${data}/local/dict/[^l]* ${data}/local/free-syllable/dict
./試rescore.sh:15:utils/prepare_lang.sh tshi3/local/dict "<UNK>"  tshi3/local/lang tshi3/lang
./對齊音檔.sh:29:cp ${data}/[^l]* "${giap8}/local/dict"
./對齊音檔.sh:30:utils/prepare_lang.sh "${giap8}/local/dict" "<UNK>"  "${giap8}/local/lang" $lang
./走訓練.sh:32:  utils/prepare_lang.sh data/local/dict "<UNK>"  data/tmp/lang_train data/lang_train

root@fab417c4eccd:/usr/local/kaldi/egs/taiwanese/s5c# grep -rnw . -e  'free-syllable'
./產生free-syllable的graph.sh:17:mkdir -p ${data}/local/free-syllable/dict
./產生free-syllable的graph.sh:18:rm -rf ${data}/local/free-syllable/dict/*
./產生free-syllable的graph.sh:19:cp ${data}/local/dict/[^l]* ${data}/local/free-syllable/dict
./產生free-syllable的graph.sh:20:cp ${data}/local/free-syllable/lexicon.txt ${data}/local/free-syllable/dict
./產生free-syllable的graph.sh:23:utils/prepare_lang.sh ${data}/local/free-syllable/dict "" $lang_log $lang
./產生free-syllable的graph.sh:25:cat data/local/free-syllable/uniform.fst | \

root@fab417c4eccd:/usr/local/kaldi/egs/taiwanese/s5c# grep -rnw . -e  'dict'
./local/swbd1_prepare_dict.sh:16:srcdict=$srcdir/swb_ms98_transcriptions/sw-ms98-dict.text
./local/swbd1_prepare_dict.sh:22:patch <local/dict.patch $dir/lexicon0.txt || exit 1;
./local/swbd1_data_prep.sh:45:[ ! -f $dir/swb_ms98_transcriptions/sw-ms98-dict.text ] && \
./local/format_acronyms_dict.py:2:# convert acronyms in swbd dict to fisher convention
./local/swbd1_train_lms.sh:27:help_message="Usage: $0 [options] <train-txt> <dict> <out-dir> [fisher-dirs]
./local/swbd1_train_lms.sh:43:lexicon=$2  # data/local/dict/lexicon.txt
./試format_sri.sh:15:utils/prepare_lang.sh tshi3/local/dict "<UNK>"  tshi3/local/lang tshi3/lang
./外語準備.sh:20:mkdir -p ${data}/local/gua7gi2/dict
./外語準備.sh:21:rm -rf ${data}/local/gua7gi2/dict/*
./外語準備.sh:22:cp ${data}/local/dict/[^l]* ${data}/local/gua7gi2/dict
./外語準備.sh:72:  cat > ${data}/local/gua7gi2/dict/lexicon.txt
./外語準備.sh:76:utils/prepare_lang.sh data/local/gua7gi2/dict/ "<UNK>"  data/local/gua7gi2/lang data/lang_dict_gua7gi2
./外語準備.sh:83:utils/format_lm.sh data/lang_dict_gua7gi2 $LM_GZ data/local/gua7gi2/dict/lexicon.txt data/lang_gua7gi2
./處理nnet3濫.sh:14:  mkdir -p $tmp_dir $data/dict
./處理nnet3濫.sh:16:  cp $tai5_data/dict/* $data/dict/
./處理nnet3濫.sh:17:  cat $tai5_data/dict/lexicon.txt $hua5_data/dict/lexicon.txt | \
./處理nnet3濫.sh:19:    cat > $data/dict/lexicon.txt
./處理nnet3濫.sh:20:  rm -f $data/dict/lexiconp.txt
./處理nnet3濫.sh:21:  utils/prepare_lang.sh $data/dict "<unk>"  $data/local/lang $data/lang_dict
./處理nnet3濫.sh:32:  utils/format_lm.sh $data/lang_dict $LM_GZ $data/dict/lexicon.txt $data/lang
./試format_lm.sh:24:utils/format_lm.sh data/lang_sp $LM_GZ data/local/dict/lexicon.txt $LANG_DIR
./試rescore_arpa.sh:15:utils/prepare_lang.sh tshi3/local/dict "<UNK>"  tshi3/local/lang tshi3/lang
./產生free-syllable的graph.sh:17:mkdir -p ${data}/local/free-syllable/dict
./產生free-syllable的graph.sh:18:rm -rf ${data}/local/free-syllable/dict/*
./產生free-syllable的graph.sh:19:cp ${data}/local/dict/[^l]* ${data}/local/free-syllable/dict
./產生free-syllable的graph.sh:20:cp ${data}/local/free-syllable/lexicon.txt ${data}/local/free-syllable/dict
./產生free-syllable的graph.sh:23:utils/prepare_lang.sh ${data}/local/free-syllable/dict "" $lang_log $lang
./處理nnet3華.sh:13:  rm -f $data/dict/lexiconp.txt
./處理nnet3華.sh:14:  utils/prepare_lang.sh $data/dict "<unk>"  $data/local/lang $data/lang_dict
./處理nnet3華.sh:24:  utils/format_lm.sh $data/lang_dict $LM_GZ $data/dict/lexicon.txt $data/lang
./試rescore.sh:15:utils/prepare_lang.sh tshi3/local/dict "<UNK>"  tshi3/local/lang tshi3/lang
./對齊音檔.sh:29:cp ${data}/[^l]* "${giap8}/local/dict"
./對齊音檔.sh:30:utils/prepare_lang.sh "${giap8}/local/dict" "<UNK>"  "${giap8}/local/lang" $lang
./處理nnet3台詞台音.sh:14:  rm -f $data/dict/lexiconp.txt
./處理nnet3台詞台音.sh:15:  bash $data/dict/處理lexicon.sh $data/dict/lexicon.txt.tiau3 $data/dict/lexicon.txt
./處理nnet3台詞台音.sh:16:  utils/prepare_lang.sh $data/dict "<unk>"  $data/local/lang $data/lang_dict
./處理nnet3台詞台音.sh:54:  utils/format_lm.sh $data/lang_dict $LM_GZ $data/dict/lexicon.txt $data/lang
./走訓練.sh:32:  utils/prepare_lang.sh data/local/dict "<UNK>"  data/tmp/lang_train data/lang_train
sih4sing5hong5 commented 6 years ago

free-syllable 是不分語者?(我印象中)

你敢有做free-syllable/dict/lexicon.txt 佮 dict/lexicon.txt 的比較? lexicon.txt上重要

fst 格式作用

可能看文件會較緊: http://www.openfst.org/twiki/bin/view/FST/WebHome