Closed ndvbd closed 5 years ago
5-gram counts alone, without any padding at the beginning of sentence, are insufficient information to construct a language model. For example, a one-word sentence should impact probabilities but cannot be encoded in your representation.
It's not clear from your question if you want a language model with smoothed probabilities (in which case run https://kheafield.com/code/kenlm/estimation/ on plain text) or just want somebody to store your 5-gram counts for you (in which case it sounds like you just want a hash table). The python wrapper does not compile the C++ programs like lmplz that you will want to build language models.
Top-k queries are not supported out of the box because the data structure is not laid out to execute those efficiently. You want a forward trie, while I implemented a reverse trie.
@kpu Thanks, what I meant is that I already have the ngrams. For example, the Google Books ngrams. I want to take the already created ngrams statistics into the KenLM, so I can perform queries on it.
Google n-grams are pruned so one can't build a Kneser-Ney model (which counts singletons etc). But fear not: http://statmt.org/ngrams/ .
Google also published the 1-gram -> 5->gram, wouldn't that make the Kneser-Ney model building possible?
Thanks! Appreciated!
Hi, I am trying to use KenLM. I installed using
sudo pip install https://github.com/kpu/kenlm/archive/master.zip
I have a file (tab seperated, or any format that KenLM can accept) in this structure (for 5-gram): w1 w2 w3 w4 w5 frequency
How do I create a KenLM data structure, preferably from python, or any other option that may exist? Afterwards, I want to load the KenLM structure to the RAM (using
model = kenlm.Model('lm/file.arpa')
for example), and get the frequency of the n-gram. If it's not possible, then getting the log frequency would suffice as well.What's the right way to start? Would the pip installation be enough, or do I need another build?
In addition, does KenLM supports top-k queries? Like what is the most probable ten words after a known 3-gram context for example?