openai / finetune-transformer-lm

Code and model for the paper "Improving Language Understanding by Generative Pre-Training"
https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf
MIT License
2.13k stars 499 forks source link

How to deal with logits from position indices in the output layer? #22

Open xiaoda99 opened 6 years ago

xiaoda99 commented 6 years ago

Dear guys,

I found that the position embeddings are concatenated with the word embeddings in the embedding layer. https://github.com/openai/finetune-transformer-lm/blob/bd1cf7d678926041e6d19193cab7e5cd8ce2fce6/train.py#L411 and the output layer also shares weights with this embedding layer, so it outputs logits for both word indices and position indices. https://github.com/openai/finetune-transformer-lm/blob/bd1cf7d678926041e6d19193cab7e5cd8ce2fce6/train.py#L176

My questions are:

  1. During lm pretraining, did you mask out the logits from those position indices when computing the loss?
  2. If I use the pretrained model as a LM to generate text, do I need to mask out these position indices' logits before softmax when sampling the next word?

BTW, I used the pytorch code ported by huggingface: https://github.com/huggingface/pytorch-openai-transformer-lm FYI, I also posted an issue there describing some details of my experiments: https://github.com/huggingface/pytorch-openai-transformer-lm/issues/36

madisonmay commented 5 years ago

@xiaoda99 in https://github.com/IndicoDataSolutions/finetune/blob/development/finetune/base.py#L544 masking the positional embeddings helped produce more reasonable generated text for us