Computing perplexity on different sized text corpuses

Hello, I'd like to compute perplexity on different text corpuses given an ngram computed with kenlm. I found in some old issues that --vocab_pad param should be used with a big number in similar situations. But I'm really not sure if I got it right and this is the situation.

Can I just compute ngram with lmpz with this option and then run query with the given ngram on the given text corpus? Or something else should be done? Currently it seems that the bigger the corpus the bigger the ppl which makes me think whether the corpus size normalization is done right or not.

kpu / kenlm

Computing perplexity on different sized text corpuses #305