more higher ppl than srilm

ben-8878 commented 3 years ago

train a 3gram srilm and kenlm with 30Gb text. train 4gram kenlm with follows parameter: kenlm_opts="-o 4 --prune 0 0 1 1 -S 50% -T /data/temp"

compare the size of model： kenlm：3.7G srilm：2.3G
compare the ppl on same text testset kenlm: 206 sentences, 3277 words, 0 OOVs 0 zeroprobs, logprob= -10986.33 ppl= 1426.499 ppl1= 2251.937 srilm: 206 sentences, 3277 words, 206 OOVs 0 zeroprobs, logprob= -9656.414 ppl= 884.5531 ppl1= 1394.401 kenlm use a better smoothing function, I don't know why it has more higher ppl.

kpu commented 3 years ago

Perplexity can't be compared across models with different vocabularies. In the extreme case, a model with only OOV will have perplexity 0 because everything maps to <unk> and p(`) = 1. It appears your SRILM model has a smaller vocabulary.

SRILM does `another hack'' where it turns off interpolation for unigrams, resulting in higher p() than many words have. If your test corpus has a high OOV rate this will make the perplexity lower. Conversely, it should make perplexity higher on a test corpus with a low OOV rate. It also has the strange effect that a system will prefer to generateover words that exist in the vocabulary. If you want that behavior, pass--interpolate_unigrams 0tolmplz`.

ben-8878 commented 3 years ago

@kpu it performs better with “--interpolate_unigrams 0”， i close the issue， thank you

kpu / kenlm

more higher ppl than srilm #345