kpu / kenlm

KenLM: Faster and Smaller Language Model Queries
http://kheafield.com/code/kenlm/
Other
2.52k stars 511 forks source link

Why do I need to add --discount_fallback ? #424

Closed amitli1 closed 1 year ago

amitli1 commented 1 year ago

I have simple English file:

I'm Harry Potter
Harry Potter is young  wizard
Hermione Granger is Harry friend
There are seven fantasy novels of Harry Potter

I'm running the following command: lmplz -o 3 <myTest.txt >myTest.arpa

And getting error:

/adjust_counts.cc:60 in void lm::builder::{anonymous}::StatCollector::CalculateDiscounts(const lm::builder::DiscountConfig&) threw BadDiscountException because `discounts_[i].amount[j] < 0.0 || discounts_[i].amount[j] > j'.
ERROR: 1-gram discount out of range for adjusted count 2: -0.5999999.  This means modified Kneser-Ney smoothing thinks something is weird about your data.  To override this error for e.g. a class-based model, rerun with --discount_fallback

If I run it with --discount_fallback parameter - it works.

  1. What is wrong with my text file ?
  2. What is the meaning of adding --discount_fallback parameter ?``
kpu commented 1 year ago

Your corpus is too small to have the statistical regularities that modified Kneser-Ney smoothing expects when estimating discounts. If you want to kludigly make up some discounts, --discount_fallback is there for you.

jiangweiatgithub commented 2 weeks ago

Would that parameter have any side effect if used indiscriminately?

kpu commented 2 weeks ago

It will only have impact when the discounts are out of range.