Closed amitli1 closed 1 year ago
I was informed that the unigram is only available to the arpa when I am running my code. I don't know why. Because my resources is quite constraint, so I must make it a binary file, yet unigram is quite important for the accuracy
The unigrams appear at the end of the binary file.
In C++, inherit from https://github.com/kpu/kenlm/blob/master/lm/enumerate_vocab.hh then pass your class as https://github.com/kpu/kenlm/blob/35f145839eca742f2402716d17542fd0546efc9d/lm/config.hh#L37 . It will get a callback for every token in the vocabulary.
Currently the python wrapper does not have this, but you can add it.
I have a binary language model (created by the
lmplz
andbuild_binary
tools).Is it possible to get all the unigrams list from the binary file, using python ?