kpu / kenlm

KenLM: Faster and Smaller Language Model Queries
http://kheafield.com/code/kenlm/
Other
2.5k stars 513 forks source link

How samples are padded? #408

Closed MagedSaeed closed 1 year ago

MagedSaeed commented 1 year ago

Greetings @kpu

Another question here with KenLM models.

I am working on some experiments where the perplexity of statistical LMs (using KenLM) and Neural LMs (using GRUs) should be reported. What I noticed is that when considering the pad token in the training loss in the NLMs experiments, the results are significantly better than the statistical LMs, which is expected. However, the loss ignores the pad token, the results are worse than the statistical LMs.

I reviewed your paper and found out that samples are padded with <s> in the beginning, possibly, more than once. However, it is not clear when and how many padding tokens to add. Screenshot of this statement is below:

image

Can you please elaborate on this? This turns out to be sensitive with perplexity as the frequency of these padding tokens affects the probabilities.

kpu commented 1 year ago

Multiple <s> padding is an implementation detail to count n-grams that does not last past the first round of lmplz. A sentence probability is always p(A sentence . </s> | <s>)

MagedSaeed commented 1 year ago

Thanks for your reply and clarification