Closed sanakhamekhem closed 6 years ago
I think you're right. You compose the FST from a token list and grammar and lexicon; these should use the same units (whether they are characters, phonemes, or tokens).
When decoding, each frame of audio features is fed into an acoustic model which produces as it's output, taken together, the matrix of log probabilities. The Nth log probability in each row corresponds to the Nth entry in units.txt (or chars.txt). So looking at a single row from the matrix, there will at times be a 'spike' log probability much higher than all the others, corresponding to that particular phone (or char) occurring at that frame.
Your observations make perfect sense, that if the units.txt used to create TLG.fst are different than the chars.txt corresponding to the matrix of log probabilities, the resulting lattice will be very strange looking. If they were the same, the results should conform to the lexicon. A solution would be to create TLG.fst from a lexicon and tokens based on (and corresponding to) chars.txt, and use THAT TLG.fst.
But units.txt and chars.txt are ALMOST exactly the same! They seem only to differ by one line. Could it be that the strange results are because most of the eesen-decoded chars are "off by one" ?
As long as the acoustic model training and decoding graph generation use the same units, things should work. I was looking for a code example to show where the decoding graph gets created, but generally TLG.fst is built by composing the token set T.fst, language model L.fst, and grammar G.fst, and each of these were created from a common token set and dictionary, at the lowest level, a common units.txt. So you could build a TLG.fst decoding graph based on your language model sources that have your chars.txt in common - this decoding graph would then be compatible and make sense of your log probability matrix in the code example above.
I managed to get a compatible logs with the code used in eesen, The matrix of logs is like: utt1 [ logprob1 logprob2 logprob3 ...etc ] logprob1 must be the first (1) for blank, logprob2 for UNK and logprob3 for SPACE.... In my framework used to generate this matrix, I have adapted the matrix (sp --> Space) `tokens.txt
That's so cool! Guess I can close this :)
Hi, I would like to use the eesen to decode a matrix of log probabilities. I have used the following code:
cat $srcdir/$out1/nnet.ark \| \ latgen-faster --max-active=$max_active --max-mem=$max_mem --beam=$beam --lattice-beam=$lattice_beam \ --acoustic-scale=$acwt --allow-partial=true --word-symbol-table=$graphdir/words.txt \ $graphdir/TLG.fst ark:- "ark:|gzip -c > $dir/lat.JOB.gz" || exit 1;
The char list used to generate the logs with my framework (which is including a ctc), and the units.txt generated with the eesen are attached. units.txt chars.txtThe decoding result is strange when using eesen, but it is good when I decode the logs using the char list and logs directely without lexicon and without FST, the char list and units shoud be the same??? is that the problem?? the WER is very big using FST decoding and lexicon,