Encoder paddings influence results?

Hi,

I noticed that if I just increase n_ctx (it is 77 in https://github.com/huggingface/pytorch-openai-transformer-lm/blob/master/train.py#L214 and I tried different values larger than that) I get different results. For example, with n_ctx=200 I get: ROCStories Valid Accuracy: 91.18
ROCStories Test Accuracy: 86.10

while without modifying it (n_ctx=77) I get: ROCStories Valid Accuracy: 90.37
ROCStories Test Accuracy: 86.00

or with n_ctx=100: ROCStories Valid Accuracy: 90.11
ROCStories Test Accuracy: 86.58

That is almost 1% difference on the validation set, and 0.58% on the test set. Running twice with the same n_ctx gives the same result, so the differences don't come from other sources.

I also couldn't (quickly) find the code that would set to -INF the values corresponding to paddings. I could only find here https://github.com/huggingface/pytorch-openai-transformer-lm/blob/master/model_pytorch.py#L87 for preventing the decoder from looking ahead, but nowhere for preventing the attention to go over the paddings.

As a side question, was wondering about the choice of -1e9 for -INF, couldn't that be too small and that the model still gets a tiny bit of information from ahead?

Thanks, Oana

huggingface / pytorch-openai-transformer-lm

Encoder paddings influence results? #45