Issue loading biolab data with kenlm

kpu / kenlm

KenLM: Faster and Smaller Language Model Queries

Other

2.5k stars 512 forks source link

Howdy yall. I am trying to analyze the data in the language models found here: https://bio.nlplab.org/#ngram-model

I am loading the 1-gram + 2-gram data into the arpa format, everything looks good / clean, yet I get the following

Non-zero backoff -1.5930591 provided for an n-gram that should have no backoff in the 2-gram at byte 905261256 Byte: 905261256

I looked at the rows that contain this value and I find the following

-0.92665285 <s> The -1.5930591 -1.5930591 trypanosomes/ml [ -0.10104541

The commend issued was:

echo "in primary care" | ./query ./full.arpa

kpu / kenlm