kpu / kenlm

KenLM: Faster and Smaller Language Model Queries
http://kheafield.com/code/kenlm/
Other
2.5k stars 513 forks source link

Computing perplexity on different sized text corpuses #305

Open tomassykora opened 4 years ago

tomassykora commented 4 years ago

Hello, I'd like to compute perplexity on different text corpuses given an ngram computed with kenlm. I found in some old issues that --vocab_pad param should be used with a big number in similar situations. But I'm really not sure if I got it right and this is the situation.

Can I just compute ngram with lmpz with this option and then run query with the given ngram on the given text corpus? Or something else should be done? Currently it seems that the bigger the corpus the bigger the ppl which makes me think whether the corpus size normalization is done right or not.

rnajim commented 2 years ago

Hi, I have the same issue, any help?