parlance / ctcdecode

PyTorch CTC Decoder bindings
MIT License
821 stars 241 forks source link

Why is there a constant score for OOV? #62

Open ankitmundada opened 6 years ago

ankitmundada commented 6 years ago

This line gives a score of -1000 (which is declared here), to any n-gram which contains an OOV. Is this the right way to approach it? Isn't it possible to get the score for <unk> tokens from the LM and use that instead of using a hardcoded score?

joemathai commented 6 years ago

You can get rid of the if statement here https://github.com/parlance/ctcdecode/blob/cef6739f7370762229cf7e115e4afcc319a4f805/ctcdecode/src/scorer.cpp#L83 This would assign the <UNK> probability to the OOV words.