kpu / kenlm

KenLM: Faster and Smaller Language Model Queries
http://kheafield.com/code/kenlm/
Other
2.46k stars 514 forks source link

Is it possible to get all the unigrams from binary file ? #426

Closed amitli1 closed 9 months ago

amitli1 commented 1 year ago

I have a binary language model (created by the lmplz and build_binary tools).

Is it possible to get all the unigrams list from the binary file, using python ?

royw99 commented 9 months ago

I was informed that the unigram is only available to the arpa when I am running my code. I don't know why. Because my resources is quite constraint, so I must make it a binary file, yet unigram is quite important for the accuracy

kpu commented 9 months ago

The unigrams appear at the end of the binary file.
In C++, inherit from https://github.com/kpu/kenlm/blob/master/lm/enumerate_vocab.hh then pass your class as https://github.com/kpu/kenlm/blob/35f145839eca742f2402716d17542fd0546efc9d/lm/config.hh#L37 . It will get a callback for every token in the vocabulary.
Currently the python wrapper does not have this, but you can add it.