Open XenonMolecule opened 1 year ago
You can query one if you can make an ARPA file. lmplz
is hard-coded to modified Kneser-Ney smoothing though you can override the discounts. So if you can work out discounts that reduce to what you want, fine. Otherwise you'll need something else to build the ARPA file.
Hi! I am using KenLM on massive corpora of text to explore the properties of those datasets (i.e., Common Crawl, Wikipedia, etc.).
I am not trying to use KenLM to generate new text; I want to explore the occurrences of specific phrases and the raw counts of n-gram occurrences in the training corpus (fine if this is the log probability of a sequence, don't necessarily need exactly counts). As such, I want to disable smoothing so I can be sure that one phrase is more probable than another because those n-grams appear more frequently, not because of smoothing out-of-vocabulary or rare tokens.
Can I disable smoothing altogether with KenLM, or is this not the right tool for my use case? If so, how? Thanks!