kpu / kenlm

KenLM: Faster and Smaller Language Model Queries
http://kheafield.com/code/kenlm/
Other
2.5k stars 512 forks source link

Issue loading biolab data with kenlm #380

Open mattberns opened 2 years ago

mattberns commented 2 years ago

Howdy yall. I am trying to analyze the data in the language models found here: https://bio.nlplab.org/#ngram-model

I am loading the 1-gram + 2-gram data into the arpa format, everything looks good / clean, yet I get the following

Non-zero backoff -1.5930591 provided for an n-gram that should have no backoff in the 2-gram at byte 905261256 Byte: 905261256

I looked at the rows that contain this value and I find the following

-0.92665285 <s> The -1.5930591 -1.5930591 trypanosomes/ml [ -0.10104541

The commend issued was:

echo "in primary care" | ./query ./full.arpa

kpu commented 2 years ago

That page appears to have a 5-gram Kneser-Ney model then encourage people to load it with a lower order (such as a bigram model). This is a bad idea: https://neural.mt/papers/edinburgh/rest_paper.pdf . If you want just a bigram model, train a bigram model.