CTC Token Passing - Githubissues

Hi,

I use token passing only on word-level as it does not work very well (according to my tests) when adding "words" (or entities) representing only punctuation marks such as "<" and others. I only take real words from the corpus by using a regex expression like "\w+". Those words are then fed into the token passing algorithm and the most probable word sequence is returned. Finally, I add whitespaces in between the words to get the text.

Coming back to your question: if your neural network does not output a character, then it also can't predict a word containing this character. Therefore, I would remove such words.

How about the running time of your implementation? Even with a C++ implementation I was only able to use small dictionaries because the algorithm is very slow (time depends quadratically on dictionary size). Beam search is more flexible in my opinion: it is fast (depending on the beam width), and it can integrate a character-level language model. And of course each beam could also be checked for dictionary words, optionally giving a zero score to a beam with unknown words in it.

githubharald / CTCDecoder

CTC Token Passing #2