kpu / kenlm

KenLM: Faster and Smaller Language Model Queries
http://kheafield.com/code/kenlm/
Other
2.5k stars 513 forks source link

Hello world: KenLM basic usage (+python) #203

Closed ndvbd closed 5 years ago

ndvbd commented 5 years ago

Hi, I am trying to use KenLM. I installed using sudo pip install https://github.com/kpu/kenlm/archive/master.zip

I have a file (tab seperated, or any format that KenLM can accept) in this structure (for 5-gram): w1 w2 w3 w4 w5 frequency

How do I create a KenLM data structure, preferably from python, or any other option that may exist? Afterwards, I want to load the KenLM structure to the RAM (using model = kenlm.Model('lm/file.arpa') for example), and get the frequency of the n-gram. If it's not possible, then getting the log frequency would suffice as well.

What's the right way to start? Would the pip installation be enough, or do I need another build?

In addition, does KenLM supports top-k queries? Like what is the most probable ten words after a known 3-gram context for example?

kpu commented 5 years ago

5-gram counts alone, without any padding at the beginning of sentence, are insufficient information to construct a language model. For example, a one-word sentence should impact probabilities but cannot be encoded in your representation.

It's not clear from your question if you want a language model with smoothed probabilities (in which case run https://kheafield.com/code/kenlm/estimation/ on plain text) or just want somebody to store your 5-gram counts for you (in which case it sounds like you just want a hash table). The python wrapper does not compile the C++ programs like lmplz that you will want to build language models.

Top-k queries are not supported out of the box because the data structure is not laid out to execute those efficiently. You want a forward trie, while I implemented a reverse trie.

ndvbd commented 5 years ago

@kpu Thanks, what I meant is that I already have the ngrams. For example, the Google Books ngrams. I want to take the already created ngrams statistics into the KenLM, so I can perform queries on it.

kpu commented 5 years ago

Google n-grams are pruned so one can't build a Kneser-Ney model (which counts singletons etc). But fear not: http://statmt.org/ngrams/ .

ndvbd commented 5 years ago

Google also published the 1-gram -> 5->gram, wouldn't that make the Kneser-Ney model building possible?

kpu commented 5 years ago

No. Read https://kheafield.com/papers/stanford/crawl_paper.pdf .

ndvbd commented 5 years ago

Thanks! Appreciated!