Closed MagedSaeed closed 1 year ago
Multiple <s>
padding is an implementation detail to count n-grams that does not last past the first round of lmplz. A sentence probability is always p(A sentence . </s> | <s>)
Thanks for your reply and clarification
Greetings @kpu
Another question here with KenLM models.
I am working on some experiments where the perplexity of statistical LMs (using KenLM) and Neural LMs (using GRUs) should be reported. What I noticed is that when considering the pad token in the training loss in the NLMs experiments, the results are significantly better than the statistical LMs, which is expected. However, the loss ignores the pad token, the results are worse than the statistical LMs.
I reviewed your paper and found out that samples are padded with
<s>
in the beginning, possibly, more than once. However, it is not clear when and how many padding tokens to add. Screenshot of this statement is below:Can you please elaborate on this? This turns out to be sensitive with perplexity as the frequency of these padding tokens affects the probabilities.