kpu / kenlm

KenLM: Faster and Smaller Language Model Queries
http://kheafield.com/code/kenlm/
Other
2.5k stars 513 forks source link

fix count of last n-grams #434

Open ben-freist opened 1 year ago

ben-freist commented 1 year ago

I think this fixes a problem with the way ngrams are counted that's described in https://github.com/kpu/kenlm/issues/405.

The problem is that the last ngram for which adjusted counts were computed had the wrong count. I generated a bunch of texts, ngram orders and pruning thresholds and compared with this python script compute_discounts.txt

Out of 100 texts, with this patch 79 texts are rejected by both lmplz and the attached python script and for 21 I get the same discounts.

Without this patch 78 texts are rejected by both lmplz and my script, 1 is rejected by my script but not lmplz and 2 are rejected by lmplz but not my script. Among the texts for which discounts are computed, there's agreement between lmplz and my script in 17 cases and for 2 they are different.

Should I add my test data here too?