srvk / eesen

The official repository of the Eesen project
http://arxiv.org/abs/1507.08240
Apache License 2.0
824 stars 342 forks source link

Network Output #129

Closed 0x454447415244 closed 7 years ago

0x454447415244 commented 7 years ago

I'm trying to integrate this system with another one built with Tensorflow. I don't have the audio data. I would be grateful if someone can provide a sample file decoded with net-output-extract (as in decode_ctc.sh). I need to see what is the format of the file fed to "decode-faster" tool for WFST decoding.

Thanks

fmetze commented 7 years ago

You can convert the output of net-output-extract to a text representation by piping it through

copy-feats ark:- ark,t:-

so you can see what it looks like. So, in your case, format the output of the TF network in the same way, and pipe it through

copy-feats ark,t:- ark:-

and you should be able to use it in decoding. We've done that and it works, we'll try to release that code some time soon.

0x454447415244 commented 7 years ago

Thank you for your reply.

I don't have the output of net-output-extract. I'm asking if someone could provide a sample output.

I actually have a network generating probabilities at each frame, so I need to feed these into "decode-faster" tool for decoding. I need to make my code outputs with the same format as "net-output-extract".

fmetze commented 7 years ago

It is something like

utterance-id [ 0.0 1.0 2.0 ] utterance-jd [ 3.0 4.0 4.0 5.0 6.0 7.0 ]

for two utterances with 1 and 2 frames, and a dimensionality of 3, if i remember correctly. You can check with a feature archive, it is the same format.

0x454447415244 commented 7 years ago

I managed to reverse engineer it just now. And it worked. Thanks for your answer anyway. As you said it is like:

utt1 [ logprob1 logprob2 logprob3 ...etc ]

logprob1 must be the one for blank, logprob2 for SPACE and logprob3 for UNK, etc... I'm actually using this for handwriting recognition. I was getting the output partially wrong with acoustic scale of 0.5, but then I tried 2.0 and it worked perfectly. I'm wondering what's the highest/best value that I should use in my case (based on the acoustic scale definition).

fmetze commented 7 years ago

Good. If you want to check in in a few days, we might have our TF code up. Do you have a recipe that you might be able to share? The best weights are really depending on the task, I guess.

0x454447415244 commented 7 years ago

Looking forward for it. I'm going to upload my Convolutional RNN system soon.

0x454447415244 commented 7 years ago

May I ask if you can help in character level language model for handwriting recognition? I successfully decoded with the created TLG.fst network from a word lexicon and a word language model. How to use a character language model? Like TG only (without L), where G is a character language model. I did a composition of T and G but it doesn't seem to work as expected.

There was also a strange thing happening. Decoding with this tool gives better performance on one dataset, but not on another (vs a decoding algorithm based on token passing). Like ~6% difference or so... with the same LM, dictionary and configuration.

fmetze commented 7 years ago

It's probably easiest if you simply rewrite your words as phones and then use a 1:1 mapping for your lexicon. You will then want to increase the history of your n-gram LM. You should be able to leave out L, but then you'd have to rewrite some of your scripts - although I am not sure what you are using right now.

Comparing different decoding architectures can be difficult - pruning is most likely what makes a difference. If you get the same result (for a specific utterance), do you also get the same score for the FST vs token passing algorithms? If so, it most likely is pruning. If not, there is some small difference in how the model is set up during decoding.

This should also answer 132, right?

0x454447415244 commented 7 years ago

Exactly, I tried before to consider the lexicon simply as the characters with 1:1 mapping, but I'm not sure why it didn't work. For history, I'm using 10-grams. What do you mean by writing more scripts?

Thanks for your reply. I will remove 132. I thought this is not visible anymore.

Edit: I got it to work now. There was a problem in the order of classes. I was including the space and characters in the lexicon. I will investigate more on the difference in performance. Thanks for you reply!

0x454447415244 commented 7 years ago

Hello! Any idea on how to implement a hybrid word/character language model? An arc will be connected to the start of the character language model whenever an out of vocabulary word (<UNK>) is encountered.

That is described in the papers below

http://www-i6.informatik.rwth-aachen.de/publications/download/911/KozielskiMichalRybachDavidHahnStefanSchl%7Bu%7DterRalfNeyHermann--OpenVocabularyHwritingRecognitionUsingCombinedWord-LevelCharacter-LevelLanguageModels--2013.pdf

https://www.researchgate.net/profile/Christopher_Kermorvant/publication/264860192_Surgenerative_Finite_State_Transducer_n-gram_for_Out-Of-Vocabulary_Word_Recognition/links/550b4c9f0cf28556409706ad.pdf

Thanks

riebling commented 7 years ago

This is coming from far afield, but maybe you could start with an open source language modeling code base like KenLM and extend it to include hybrid character-level modeling.

0x454447415244 commented 7 years ago

I think this should be done and glued at the level of WFST with separate LMs (one for word and other for character).