How samples are padded?

MagedSaeed commented 1 year ago

Greetings @kpu

Another question here with KenLM models.

I am working on some experiments where the perplexity of statistical LMs (using KenLM) and Neural LMs (using GRUs) should be reported. What I noticed is that when considering the pad token in the training loss in the NLMs experiments, the results are significantly better than the statistical LMs, which is expected. However, the loss ignores the pad token, the results are worse than the statistical LMs.

I reviewed your paper and found out that samples are padded with <s> in the beginning, possibly, more than once. However, it is not clear when and how many padding tokens to add. Screenshot of this statement is below:

Can you please elaborate on this? This turns out to be sensitive with perplexity as the frequency of these padding tokens affects the probabilities.

kpu commented 1 year ago

Multiple <s> padding is an implementation detail to count n-grams that does not last past the first round of lmplz. A sentence probability is always p(A sentence . </s> | <s>)

MagedSaeed commented 1 year ago

Thanks for your reply and clarification

kpu / kenlm

How samples are padded? #408