kpu / kenlm

KenLM: Faster and Smaller Language Model Queries
http://kheafield.com/code/kenlm/
Other
2.46k stars 514 forks source link

Disable smoothing #432

Open XenonMolecule opened 1 year ago

XenonMolecule commented 1 year ago

Hi! I am using KenLM on massive corpora of text to explore the properties of those datasets (i.e., Common Crawl, Wikipedia, etc.).

I am not trying to use KenLM to generate new text; I want to explore the occurrences of specific phrases and the raw counts of n-gram occurrences in the training corpus (fine if this is the log probability of a sequence, don't necessarily need exactly counts). As such, I want to disable smoothing so I can be sure that one phrase is more probable than another because those n-grams appear more frequently, not because of smoothing out-of-vocabulary or rare tokens.

Can I disable smoothing altogether with KenLM, or is this not the right tool for my use case? If so, how? Thanks!

kpu commented 1 year ago

You can query one if you can make an ARPA file. lmplz is hard-coded to modified Kneser-Ney smoothing though you can override the discounts. So if you can work out discounts that reduce to what you want, fine. Otherwise you'll need something else to build the ARPA file.