Problems using lexiconp.txt (lexicon with pronounciation probabilities)

In my setup, I've modified wsj/utils/ctc_compile_dict_token.sh to directly accept a lexiconp.txt file. In the WSJ example, this script takes a lexicon.txt file, converts it internally to the lexiconp.txt format and then does add the disambiguation symbols.

However, in wsj/utils/ctc_compile_dict_token.sh, line 51: ndisambig=utils/add_lex_disambig.pl $tmpdir/lexiconp.txt $tmpdir/lexiconp_disambig.txt

The flag "--pron-probs" needs to be added, otherwise add_lex_disambig.pl threats the pronunciation probability as a phone. This is not a problem if all entries start with 1.0, but determinzation can fail (obscure "FST is not functional" error message later on) if the file contains alternative pronunciation probabilities.

Other than that I noticed mismatches between tabs and spaces in utils/ctc_compile_dict_token.sh, which could lead to problems further down the line, for scripts that don't split on \t while they read in the lexicon.

E.g. the lexicon conversion in line 47: perl -ape 's/(\S+\s+)(.+)/${1}1.0\t$2/;' < $srcdir/lexicon.txt > $tmpdir/lexiconp.txt || exit 1;

introduces a tab character after the 1.0 pronunciation probability.

srvk / eesen

Problems using lexiconp.txt (lexicon with pronounciation probabilities) #103