kpu / kenlm

KenLM: Faster and Smaller Language Model Queries
http://kheafield.com/code/kenlm/
Other
2.5k stars 513 forks source link

ngrams have different scoring levels #258

Open raviolli opened 4 years ago

raviolli commented 4 years ago

Hi there just curious, this isn't really a bug...

How come when compared 2-grams vs 3-grams their scoring are not normalized.

The 2-grams will typically (and the majority of the time) have higher scores then the 3-grams.

This becomes problematic when trying to compare scores between 2-gram and 3-grams outputs.

Any insight would be great, perhaps with detailed explanation I can fix the issue and submit a pull.