parlance / ctcdecode

PyTorch CTC Decoder bindings
MIT License
829 stars 245 forks source link

Does ctcdecode kenlm scorer support utf-8 characters? #56

Closed joemathai closed 6 years ago

joemathai commented 6 years ago

Code to check for characters of size 1 before pushing into the char_map. I tried using characters which need 2bytes and they don't get added into the char map.

update: I see that the a new char_map is created when adding words to dictionary.

https://github.com/parlance/ctcdecode/blob/7223cc4b308db76e71947f245e1e57ace1a00121/ctcdecode/src/scorer.cpp#L156