Open OanaMariaCamburu opened 6 years ago
Hi,
In the article, the authors use the transpose of the embedding matrix as linear layer just before the softmax layer. This explains the shape of the softmax layer.
Thanks for the answer, @rodgzilla, but that shouldn't be the main reason. Just to be able to tie the weights, we can use the sub-part of the embedding matrix that corresponds to the n_vocab tokens. Since we don't ever want to output positional tokens and a quick check shows that the model is putting significant probablity of them:
output1 tensor([[[ -6.4144, -2.8540, -11.5565, ..., 4.0194, 3.9853, 3.0146],
[ -8.3058, -7.6292, -19.2639, ..., 1.0065, 1.1153, 0.5842],
[ -7.3526, -5.5124, -16.6829, ..., 1.7537, 1.4622, 1.0649],
...,
[ 8.8013, 1.9527, -15.8394, ..., 1.5357, 1.5170, 1.4211],
[ 8.7922, 1.9531, -15.8468, ..., 1.5477, 1.5284, 1.4295],
[ 8.7885, 1.9515, -15.8496, ..., 1.5536, 1.5351, 1.4346]]],
device='cuda:0')
output1 is the output of the LMHead when the loaded pretrained model is run on the sentence "An apple is a fruit" with n_ctx=64.
The problem is that adding these n_ctx logits to the output vocab creates an incorrect dependency on n_ctx. This is also a reason why I get different results when setting the n_ctx to different larger values. For example, n_ctx=77 in https://github.com/huggingface/pytorch-openai-transformer-lm/blob/master/train.py#L214 and I tried different values larger than that, I get different results. For example, with n_ctx=200 I get: ROCStories Valid Accuracy: 91.18 ROCStories Test Accuracy: 86.10
while without modifying it (n_ctx=77) I get: ROCStories Valid Accuracy: 90.37 ROCStories Test Accuracy: 86.00
or with n_ctx=100: ROCStories Valid Accuracy: 90.11 ROCStories Test Accuracy: 86.58
That is almost 1% difference on the validation set, and 0.58% on the test set. Running twice with the same n_ctx gives the same result, so the differences don't seem to come from other sources than n_ctx. I've reported this separately in https://github.com/huggingface/pytorch-openai-transformer-lm/issues/45#issue-382624469 and I believe it is due to this output vocabulary containing the positional embeddings.
I will look soon whether taking the subset of the maxtrix would solve the n_ctx dependency problem and let you know.
Best, Oana
Hi,
I was wondering why is the output softmax of dimension n_vocab + n_special + n_ctx as opposed to just n_vocab + n_special? We don't really need to output "tokens" from the positional encodings, do we? I also had a look at some outputs and didn't get negligible values on the last n_ctx lm_logits. Thanks!