parlance / ctcdecode

PyTorch CTC Decoder bindings
MIT License
830 stars 245 forks source link

Support for explicit word separators and explicit dictionary for use in handwriting recognition #79

Open gwenniger opened 6 years ago

gwenniger commented 6 years ago

Hi, I have a question about how to use the decoder when there is an explicit word separator besides the blank symbol. Some background: I'm trying to use the decoder for neural handwriting recognition. As you may be aware, this application is similar to neural speech recognition, and the same technology is suitable to a large extend. However, there is one issue. In neural handwriting recognition, the common practice is to keep the word separator symbols that are in the training material, and let the model reproduce them in addition to the "normal" symbols. (See for example [https://arxiv.org/abs/1312.4569]), section IV C)

When not using the language model, this is fine, and you can get output like this (using "|" as the special word separator symbol):

Without language model:

evaluate_mdrnn - output: ""|BeTle|asd|Robbe|Mamnygard|.|"|"|Whati's|he|ben" reference: ""|Better|ask|Robbie|Munyard|.|"|"|What|'s|he|been" --- wrong evaluate_mdrnn - output: "Comuon|rerlet|,|wse|should|nok|be|elle|to" reference: "Common|Market|,|we|should|not|be|able|to" --- wrong

However, when using the language model, it is not clear how to integrate the special word separator symbol (not the same as the CTC blank symbol!). When training the language model on "normal" text, such as the LOB [http://ota.ox.ac.uk/desc/0167] or Brown corpus, the word separator symbol won't be present obviously, and hence the decoder won't produce it.

With language model:

evaluate_mdrnn - output: "" Bethea Robbie Munyard . " what she ben" reference: ""|Better|ask|Robbie|Munyard|.|"|"|What|'s|he|been" --- wrong evaluate_mdrnn - output: "Common relative should not be elle to" reference: "Common|Market|,|we|should|not|be|able|to" --- wrong

This is likely to harm performance, since the "|" symbol is still produced by the model, and needs to be "consumed" by the decoder somehow.

One hack I attempted is to train the language model with semi-artificial data, in which I add a separator between every word, for example:

gold-hunting | Kennedy | shocks | Dr | A | . Germany | must | pay | . offer | of | +357 | m | is | too | small | . President | Kennedy | is | ready | to | get | tough | over | West | Germany's | cash | offer | to | help | America's | balance | of | payments | position | .

However, this also has undesired side-effects, such as leading to problem with Kneser-Ney discounting during language model training.

I think in decoders that use finite state transducers the finite state model is sometimes tailored with special states or transitions to deal with this problem. Perhaps this issue never occurs in speech, but I think actually it might occur if you explicitly mark long pauses for example (similar to explicit separators between words).

Do you have any suggestion how I might deal with this while using ctcdecode? Neither using the language model trained on the original data, which cannot produce the word separator symbols, nor hacking the language model training data are very effective solutions it seems till now...

Another important and somewhat related issue seems to be the fact that there is no explicit vocabulary used in the decoder, only the language model? If one would like to restrict the vocabulary to say the 50K most frequent words would the (only) way be to change the language model training data, replacing all the words not in the 50K most frequent words with an INFREQUNT_WORD symbol or something? (This could work but again seems like quite an ugly hack which I would rather avoid if there is a way to provide an explicit vocabulary to the decoder.)

Thanks in advance for your help!

Gideon

ryanleary commented 6 years ago

Linking this with #83. Thanks for submitting the PR. I have a few requested changes, then we'll get it merged.

ryanleary commented 6 years ago

Please file separate issues for the other things you mentioned. I think some of them can be handled rather straightforwardly.