kensho-technologies / pyctcdecode

A fast and lightweight python-based CTC beam search decoder for speech recognition.
Apache License 2.0
421 stars 89 forks source link

No unigrams found in arpa file. #29

Closed ivangtorre closed 2 years ago

ivangtorre commented 2 years ago

Hi! Maybe I am missing something but it seems that when using a binarized arpa file, the decoder do not have access to unigrams showing this log:

"No unigrams found in arpa file. Something is wrong with the file"

In fact, when using the arpa file insted of the binarized version, the WER is significantly better.

Is this behaviour due to this particular library or that it happens always and I had not noticed yet?

Thank you!

gkucsko commented 2 years ago

hey, so the logic is:

i suggest you either use the bare arpa file, or the binarized version and explicitly pass unigrams=[...] in the constructor

ivangtorre commented 2 years ago

Thank you for the info. I have been working long time with LM binarized with other ctc decoders and I am impressed of this implementation (congrats!!).

gkucsko commented 2 years ago

Thanks! The missing ngrams in the binary is just a kenlm choice. Might even just be the python bindings that don't expose the necessary functionality. Maybe Ken will add it in a future release, or might even be worth a PR at some point. Sorry, no publication at this time. Great to hear. You could cite it as a github repo (https://www.wikihow.com/Cite-a-GitHub-Repository), or a footnote with the url if that's easier. Or worst case, no worries if you use it without citing, we won't be upset :)