huggingface / pytorch-openai-transformer-lm

🐥A PyTorch implementation of OpenAI's finetuned transformer language model with a script to import the weights pre-trained by OpenAI
MIT License
1.51k stars 285 forks source link

How does position embedding implementation work? #44

Closed bcserna closed 6 years ago

bcserna commented 6 years ago

So there's the TransformerModel's forward method, and I just can't get a hold of the position embedding part (and might be wrong about others). So, as far as I can tell, step-by-step it goes like this:

  1. Reshape our input to have 3 dimensions -> [ ? x sequences (?) x tokens (512) ]
  2. Get the individual token embeddings -> [ ? x sequences (?) x tokens (512) x emb_dim (768) ]
  3. Sum up those embeddings along axis 2 (summing token embeddings element-wise for each sequence?) -> [ ? x sequences x emb_dim (768) ]
  4. Shouldn't we have [ sequences x tokens (512) x emb_dim (768) ] here?
def forward(self, x):
        x = x.view(-1, x.size(-2), x.size(-1))
        e = self.embed(x)
        # Add the position information to the input embeddings
        h = e.sum(dim=2)
        for block in self.h:
            h = block(h)
        return h

My questions are:

Thank you in advance!

rodgzilla commented 6 years ago

As you can see in train.py, x is created with the following shape:

xmb = np.zeros((n_batch, 2, n_ctx, 2), dtype=np.int32)

So, for x, before the reshaping, the shape is (batch, n_sequence, n_tokens, seq_or_pos):

Indeed, in this version of the transformer network, the positional embedding are learned so positions in the input sequence are just considered like normal tokens and have a corresponding embedding in the embedding matrix. You can see that the position embeddings are located at the end of the embedding table (starting at index n_vocab + n_special).

So, if we analyse the forward method line by line we get:

x = x.view(-1, x.size(-2), x.size(-1))

We first flatten (remove) the n_sequence dimension as the inference on each sequence of and input is independent. We get a tensor of shape (n_batch * n_sequence, n_tokens, seq_or_pos)

e = self.embed(x)

We fetch the embeddings for the tokens AND the positions at the same time so we get a tensor of dimension (n_batch * n_sequence, n_tokens, seq_or_pos, dim_emb) with dim_emb being the dimension of the embedding vectors (here 768).

h = e.sum(dim=2)

Then, as described in the research paper, we simply add the token embeddings with their corresponding position embedding. We get a tensor of shape (n_batch * n_sequence, n_tokens, dim_emb) that will be the input to our transformer blocks Block.

The rest of the function is just the application of the blocks to the input so I won't detail it.

bcserna commented 6 years ago

Great explanation, it's clear now, thank you!