srvk / eesen

The official repository of the Eesen project
http://arxiv.org/abs/1507.08240
Apache License 2.0
824 stars 342 forks source link

Making use of EESEN only for decoding. #148

Closed lukas-lee closed 6 years ago

lukas-lee commented 6 years ago

Could I use EESEN only for decoding?

I have the matrix of log-probabilities values which shows the probabilities of each characters(a,b,c,d,e,....) per each frame given sentences of test files. All the test files are from WSJ corpus. This matrix is not from EESEN, but from my own acoustic models.

Using the information that I have and decoder provided by EESEN, I'd like to perform WFST decoding. But I am confused about how can I use source files. In the shell level, I will use "steps/decode_ctc_lat.sh" for decoding. It means, I have to use "net-output-extract.cc" and "latgen-faster.cc" file. However...

  1. The issue is that "net-output-extract.cc" needs "final.nnet" file. but as I did not train by EESEN, so I can't use "final.nnet". Are there any ways to perform decoding without "final.nnet ? Or Are there any ways that I can substitute "final.nnet"?

  2. Using the "latgen-faster.cc" file which contains "decoder/decodable-matrix.h", I think I can use my own decoding matrix. But I am confused about how the form of "decodable marix" is and how can I insert my own decodable matrix into it. Could you give me a rough explanation about these?

Thank you in advance!

riebling commented 6 years ago

You've almost got it right, except step 1. is too far back the processing chain; net-output-extract is the acoustic model decoding stage, using final.nnet, so it's output is akin to the matrix of log probabilities your acoustic model produces. You're right that the steps you would want to perform next with Eesen would be latgen-faster, which takes as one of it's inputs the log probabilities of characters.

As an example workflow that saves such a file to disk (as opposed to secretly piping it from one Eesen(Kaldi) command into another, with no visibility or record of it ever having existed ;) ) have a look at speech2phones.sh which saves as an intermediate result build/trans/${basename}/eesen/decode/phones.1.txt. THIS is the file format expected by latgen-faster. See attached for an example. (This setup uses phones, not characters - the two different ways Eesen can be run) and so the contents of each row in this file are for one frame of audio, the number of elements in each line is the number of phones, and the individual elements are the log probabilities. phones.1.txt

Hope this helps!

riebling commented 6 years ago

I left out some description: the labels on each line in this file are utterance IDs

lukas-lee commented 6 years ago

Thank you for your kind guidance! Thanks to you, I totally figured out how the form of the matrix is.

Can I be more specific? How can I insert my own matrix into the "latgen-faster.cc" in order to get desirable output?

My guess is that following part reads the log-likelihoods from the matrix. "=========================================================== if (DecodeUtteranceLatticeFaster( decoder, decodable, word_syms, utt, acoustic_scale, determinize, allow_partial, &alignment_writer, &words_writer, &compact_lattice_writer, &lattice_writer, &like)) { tot_like += like; frame_count += loglikes.NumRows(); num_success++; } else num_fail++; "===========================================================

Then from which part of "latgen-faster.cc", can I load "the log likelihood matrix" that you suggested above? I mean, where can I insert the value of my own matrix? Sorry for asking you too much detail. I will be really grateful for even very small hints!

riebling commented 6 years ago

I think your matrix of acoustic log likelihoods can only be directly compatible with a WFST decoding stage whose models are built up from the same units. The provided models used by Eesen transcriber use the CMUDict phone set: some 39 phones plus a few noises, and the blank symbol. If your system is character based, then the FST decoding graphs will need to have been created using an identical set of characters.

Therefore the (WFST decoding) graphs provided with Eesen transcriber would not be directly compatible, and could not accept as input the matrix of acoustic log likelihoods you produce.

BUT: to create compatible WFST decoding graphs, you could adapt the experiment https://github.com/srvk/eesen/blob/master/asr_egs/tedlium/v1/run_ctc_char.sh such that it incorporates and works with your current acoustic model decoder. (Not knowing more about your system, there is also the matter of training data.)

So I mis-spoke when suggesting that the log-likelihood matrix produced by your acoustic model decoder would be plug-compatible with the WFST decoding graphs provided with Eesen transcriber. It's more accurate to say that, in the same way as Eesen transcriber uses Eesen's asr_egs/tedlium/v2-30ms experiment to create models and decoding graphs from tedlium training data, you could adapt or come up with a similar experiment to create your own models and decoding graphs from your own training data. (I don't know enough about your system to be more specific). We do know you are using characters, not phones, and so tedlium/v1/run_ctc_char.sh is probably the closest fit.