Closed ivangtorre closed 2 years ago
hey, so the logic is:
.arpa
file is plain text, so pyctcdecode can extract a list of known unigrams for efficient decodingi suggest you either use the bare arpa file, or the binarized version and explicitly pass unigrams=[...]
in the constructor
Thank you for the info. I have been working long time with LM binarized with other ctc decoders and I am impressed of this implementation (congrats!!).
I have not found any information about this topic. This means that for all binarized LM we are always losing the information of unigrams, unless explicitaly provided? ( in general, not only for pyctcdecode). If yes, do you have any reference to check and learn about it?
This is another comment but we are extensively testing this implementation for different architectures of ASR and for Speech to text translation and we are obtaining better results and in general in less time. We would like to know more technical details on this implementation. Aditionally we are submitting some related works to publish were we are changing the decoder with this library. We would like to cite this implementation if you provide any paper.
Thanks! The missing ngrams in the binary is just a kenlm choice. Might even just be the python bindings that don't expose the necessary functionality. Maybe Ken will add it in a future release, or might even be worth a PR at some point. Sorry, no publication at this time. Great to hear. You could cite it as a github repo (https://www.wikihow.com/Cite-a-GitHub-Repository), or a footnote with the url if that's easier. Or worst case, no worries if you use it without citing, we won't be upset :)
Hi! Maybe I am missing something but it seems that when using a binarized arpa file, the decoder do not have access to unigrams showing this log:
"No unigrams found in arpa file. Something is wrong with the file"
In fact, when using the arpa file insted of the binarized version, the WER is significantly better.
Is this behaviour due to this particular library or that it happens always and I had not noticed yet?
Thank you!