No unigrams found in arpa file.

kensho-technologies / pyctcdecode

A fast and lightweight python-based CTC beam search decoder for speech recognition.

Apache License 2.0

421 stars 89 forks source link

No unigrams found in arpa file. #29

Closed ivangtorre closed 2 years ago

ivangtorre commented 2 years ago

Hi! Maybe I am missing something but it seems that when using a binarized arpa file, the decoder do not have access to unigrams showing this log:

"No unigrams found in arpa file. Something is wrong with the file"

In fact, when using the arpa file insted of the binarized version, the WER is significantly better.

Is this behaviour due to this particular library or that it happens always and I had not noticed yet?

Thank you!

gkucsko commented 2 years ago

hey, so the logic is:

non-binarized .arpa file is plain text, so pyctcdecode can extract a list of known unigrams for efficient decoding
binarized files don't expose the unigrams so you have to pass them explicitly to the constructor for better performance

i suggest you either use the bare arpa file, or the binarized version and explicitly pass unigrams=[...] in the constructor

ivangtorre commented 2 years ago

Thank you for the info. I have been working long time with LM binarized with other ctc decoders and I am impressed of this implementation (congrats!!).

I have not found any information about this topic. This means that for all binarized LM we are always losing the information of unigrams, unless explicitaly provided? ( in general, not only for pyctcdecode). If yes, do you have any reference to check and learn about it?
This is another comment but we are extensively testing this implementation for different architectures of ASR and for Speech to text translation and we are obtaining better results and in general in less time. We would like to know more technical details on this implementation. Aditionally we are submitting some related works to publish were we are changing the decoder with this library. We would like to cite this implementation if you provide any paper.

gkucsko commented 2 years ago

Thanks! The missing ngrams in the binary is just a kenlm choice. Might even just be the python bindings that don't expose the necessary functionality. Maybe Ken will add it in a future release, or might even be worth a PR at some point. Sorry, no publication at this time. Great to hear. You could cite it as a github repo (https://www.wikihow.com/Cite-a-GitHub-Repository), or a footnote with the url if that's easier. Or worst case, no worries if you use it without citing, we won't be upset :)