srvk / eesen

The official repository of the Eesen project
http://arxiv.org/abs/1507.08240
Apache License 2.0
822 stars 342 forks source link

Problems using lexiconp.txt (lexicon with pronounciation probabilities) #103

Open bmilde opened 7 years ago

bmilde commented 7 years ago

In my setup, I've modified wsj/utils/ctc_compile_dict_token.sh to directly accept a lexiconp.txt file. In the WSJ example, this script takes a lexicon.txt file, converts it internally to the lexiconp.txt format and then does add the disambiguation symbols.

However, in wsj/utils/ctc_compile_dict_token.sh, line 51: ndisambig=utils/add_lex_disambig.pl $tmpdir/lexiconp.txt $tmpdir/lexiconp_disambig.txt

The flag "--pron-probs" needs to be added, otherwise add_lex_disambig.pl threats the pronunciation probability as a phone. This is not a problem if all entries start with 1.0, but determinzation can fail (obscure "FST is not functional" error message later on) if the file contains alternative pronunciation probabilities.

Other than that I noticed mismatches between tabs and spaces in utils/ctc_compile_dict_token.sh, which could lead to problems further down the line, for scripts that don't split on \t while they read in the lexicon.

E.g. the lexicon conversion in line 47: perl -ape 's/(\S+\s+)(.+)/${1}1.0\t$2/;' < $srcdir/lexicon.txt > $tmpdir/lexiconp.txt || exit 1;

introduces a tab character after the 1.0 pronunciation probability.

fmetze commented 7 years ago

Thanks, will see - I think Kaldi has also modified those scripts, and it may be useful to pull in their changes as well. Will have to check ...