githubharald / CTCDecoder

Connectionist Temporal Classification (CTC) decoding algorithms: best path, beam search, lexicon search, prefix search, and token passing. Implemented in Python.
https://towardsdatascience.com/3797e43a86c
MIT License
817 stars 182 forks source link

CTC Token Passing #2

Closed wellescastro closed 6 years ago

wellescastro commented 6 years ago

Hi!

I'm trying to use the Token Passing algorithm for decoding a model trained in IAM-DB. I'm using a language model built with the LOB corpus, however, there are situations in which the word that is passed to wordToLabelSeq method presents a character that is not mapped to any class, eg.: '>'. What do you advise to do in these situations?

Thanks in advance, Dayvid.

githubharald commented 6 years ago

Hi,

I use token passing only on word-level as it does not work very well (according to my tests) when adding "words" (or entities) representing only punctuation marks such as "<" and others. I only take real words from the corpus by using a regex expression like "\w+". Those words are then fed into the token passing algorithm and the most probable word sequence is returned. Finally, I add whitespaces in between the words to get the text.

Coming back to your question: if your neural network does not output a character, then it also can't predict a word containing this character. Therefore, I would remove such words.

How about the running time of your implementation? Even with a C++ implementation I was only able to use small dictionaries because the algorithm is very slow (time depends quadratically on dictionary size). Beam search is more flexible in my opinion: it is fast (depending on the beam width), and it can integrate a character-level language model. And of course each beam could also be checked for dictionary words, optionally giving a zero score to a beam with unknown words in it.