parlance / ctcdecode

PyTorch CTC Decoder bindings
MIT License
830 stars 245 forks source link

use corpus without punctuations always decode strings without punctuations #154

Closed DRosemei closed 4 years ago

DRosemei commented 4 years ago

Thanks for your great work! I have trained an english language model without punctuations using kenlm, and ctcdecode always outputs strings without punctuations? I also have trained an english language model with punctuations and ctcdecode could output punctuations. In my opinion, beam search will always predict punctuations, language model just gives a correction. I also find that ngrams are the same before language model scoring when using english language model with and without punctuations in ctcdecode/ctcdecode/src/ctc_beam_search_decoder.cpp. So I want to know how should I modify the code to get punctuations with language model without punctuations?

chaitusvk commented 4 years ago

I think its not possible ... beach search cant always predict punctuations . punctuation is also like one of charector
say ..i have "Oh !" Beam search predicts it Oh & ! and feed to your language model for finding probability of (Oh,!) your language model finds prob and decodes it

if you dont have "!" in your language model ..it is impossibe for you to predit

beam search may or may not predict punctuvations take a example word "late" it can decode "iate" if it can mistake " l " with "i" why cant it mistake " ! " with " l "

both are same for CTC decode

DRosemei commented 4 years ago

I think its not possible ... beach search cant always predict punctuations . punctuation is also like one of charector say ..i have "Oh !" Beam search predicts it Oh & ! and feed to your language model for finding probability of (Oh,!) your language model finds prob and decodes it

if you dont have "!" in your language model ..it is impossibe for you to predit

beam search may or may not predict punctuvations take a example word "late" it can decode "iate" if it can mistake " l " with "i" why cant it mistake " ! " with " l "

both are same for CTC decode

You are right. There is a dictionary in the code, so it could not predict punctuations with a language model without punctuations.