Hello, I'd like to compute perplexity on different text corpuses given an ngram computed with kenlm. I found in some old issues that --vocab_pad param should be used with a big number in similar situations. But I'm really not sure if I got it right and this is the situation.
Can I just compute ngram with lmpz with this option and then run query with the given ngram on the given text corpus? Or something else should be done? Currently it seems that the bigger the corpus the bigger the ppl which makes me think whether the corpus size normalization is done right or not.
Hello, I'd like to compute perplexity on different text corpuses given an ngram computed with kenlm. I found in some old issues that
--vocab_pad
param should be used with a big number in similar situations. But I'm really not sure if I got it right and this is the situation.Can I just compute ngram with
lmpz
with this option and then runquery
with the given ngram on the given text corpus? Or something else should be done? Currently it seems that the bigger the corpus the bigger the ppl which makes me think whether the corpus size normalization is done right or not.