kpu / kenlm

KenLM: Faster and Smaller Language Model Queries
http://kheafield.com/code/kenlm/
Other
2.5k stars 513 forks source link

lmplz --intermediate + --prune, or --renumber + --prune fails #307

Open nosyrev opened 3 years ago

nosyrev commented 3 years ago

Hi! I wanted to prepare several models to test interpolation of them with different weights and found out that with non-zero --prune option lmplz -o 4 --intermediate inter --prune 1 < text Fails with exception

kenlm/lm/common/joint_order.hh:61 in void lm::JointOrder(const util::stream::ChainPositions &, Callback &) [Callback = lm::builder::(anonymous namespace)::Callback<lm::builder::(anonymous namespace)::OutputProbBackoff>, Compare = lm::SuffixOrder] threw FormatLoadException because `order != current + 1'.
Detected n-gram without matching suffix
Abort trap: 6

at the end of === 4/4 Calculating and writing order-interpolated probabilities === stage.

Short debugging shown that it is because --renumber option is automatically enabled with --intermediate option and that it is the main reason of failure. For example lmplz -o 4 --prune 1 --renumber < text > text.arpa Will fail with the same exception.

Without --renumber option arpa model is created without any problem with any --prune. Looks like --renumber currently only works with default --prune 0. --interpolate inter --prune 1 version also finishes fine if you just comment-out that line

Wasn't unfortunately able to figure out main source of the problem yet, so don't have PR with fix.

If that behaviour is not intended of course :)

All that is valid for current master branch.

locmene commented 2 years ago

This issue is still present with the latest build on the master branch.
Just to add to @nosyrev's description, it happens when creating intermediate LMs with unigram pruning only, i.e.,
--intermediate test.intermediate --prune 0 and --intermediate test.intermediate --prune 0 2 work fine,
but --intermediate test.intermediate --prune 2 fails.