kpu / kenlm

KenLM: Faster and Smaller Language Model Queries
http://kheafield.com/code/kenlm/
Other
2.5k stars 513 forks source link

more higher ppl than srilm #345

Closed ben-8878 closed 3 years ago

ben-8878 commented 3 years ago

train a 3gram srilm and kenlm with 30Gb text. train 4gram kenlm with follows parameter: kenlm_opts="-o 4 --prune 0 0 1 1 -S 50% -T /data/temp"

  1. compare the size of model: kenlm:3.7G srilm:2.3G
  2. compare the ppl on same text testset kenlm: 206 sentences, 3277 words, 0 OOVs 0 zeroprobs, logprob= -10986.33 ppl= 1426.499 ppl1= 2251.937 srilm: 206 sentences, 3277 words, 206 OOVs 0 zeroprobs, logprob= -9656.414 ppl= 884.5531 ppl1= 1394.401 kenlm use a better smoothing function, I don't know why it has more higher ppl.
kpu commented 3 years ago

Perplexity can't be compared across models with different vocabularies. In the extreme case, a model with only OOV will have perplexity 0 because everything maps to <unk> and p(`) = 1. It appears your SRILM model has a smaller vocabulary.

SRILM does `another hack'' where it turns off interpolation for unigrams, resulting in higher p() than many words have. If your test corpus has a high OOV rate this will make the perplexity lower. Conversely, it should make perplexity higher on a test corpus with a low OOV rate. It also has the strange effect that a system will prefer to generateover words that exist in the vocabulary. If you want that behavior, pass--interpolate_unigrams 0tolmplz`.

ben-8878 commented 3 years ago

@kpu it performs better with “--interpolate_unigrams 0”, i close the issue, thank you