srvk / eesen

The official repository of the Eesen project
http://arxiv.org/abs/1507.08240
Apache License 2.0
822 stars 343 forks source link

decoding log-posteriors with eesen #158

Closed sanakhamekhem closed 6 years ago

sanakhamekhem commented 6 years ago

Hi, I would like to use the eesen to decode a matrix of log probabilities. I have used the following code: cat $srcdir/$out1/nnet.ark \| \ latgen-faster --max-active=$max_active --max-mem=$max_mem --beam=$beam --lattice-beam=$lattice_beam \ --acoustic-scale=$acwt --allow-partial=true --word-symbol-table=$graphdir/words.txt \ $graphdir/TLG.fst ark:- "ark:|gzip -c > $dir/lat.JOB.gz" || exit 1; The char list used to generate the logs with my framework (which is including a ctc), and the units.txt generated with the eesen are attached. units.txt chars.txt

The decoding result is strange when using eesen, but it is good when I decode the logs using the char list and logs directely without lexicon and without FST, the char list and units shoud be the same??? is that the problem?? the WER is very big using FST decoding and lexicon,

riebling commented 6 years ago

I think you're right. You compose the FST from a token list and grammar and lexicon; these should use the same units (whether they are characters, phonemes, or tokens).

When decoding, each frame of audio features is fed into an acoustic model which produces as it's output, taken together, the matrix of log probabilities. The Nth log probability in each row corresponds to the Nth entry in units.txt (or chars.txt). So looking at a single row from the matrix, there will at times be a 'spike' log probability much higher than all the others, corresponding to that particular phone (or char) occurring at that frame.

Your observations make perfect sense, that if the units.txt used to create TLG.fst are different than the chars.txt corresponding to the matrix of log probabilities, the resulting lattice will be very strange looking. If they were the same, the results should conform to the lexicon. A solution would be to create TLG.fst from a lexicon and tokens based on (and corresponding to) chars.txt, and use THAT TLG.fst.

But units.txt and chars.txt are ALMOST exactly the same! They seem only to differ by one line. Could it be that the strange results are because most of the eesen-decoded chars are "off by one" ?

sanakhamekhem commented 6 years ago

Thank you for your response Mr riebling, I will try retrain my data using the same units and chars like these: chars.txt units.txt Then I will use the same decoding stage as in the essen. That will be correct????? thanks again.

riebling commented 6 years ago

As long as the acoustic model training and decoding graph generation use the same units, things should work. I was looking for a code example to show where the decoding graph gets created, but generally TLG.fst is built by composing the token set T.fst, language model L.fst, and grammar G.fst, and each of these were created from a common token set and dictionary, at the lowest level, a common units.txt. So you could build a TLG.fst decoding graph based on your language model sources that have your chars.txt in common - this decoding graph would then be compatible and make sense of your log probability matrix in the code example above.

sanakhamekhem commented 6 years ago

I managed to get a compatible logs with the code used in eesen, The matrix of logs is like: utt1 [ logprob1 logprob2 logprob3 ...etc ] logprob1 must be the first (1) for blank, logprob2 for UNK and logprob3 for SPACE.... In my framework used to generate this matrix, I have adapted the matrix (sp --> Space) `tokens.txt

0 1 2 3 a0a 4 aaA 5 aaE 6 aeA 7 . .. `` units.txt 1 2 a0a 3 aaA 4 aaE 5 aeA 6 . .` The decoding seems to be good for now. Thanks for your help.
riebling commented 6 years ago

That's so cool! Guess I can close this :)