Why do we want to pad the top of target_embedded

XinDongol commented 5 years ago

https://github.com/tunz/transformer-pytorch/blob/5cf29c06ebd067e0e0274904b386ae388a191745/model/transformer.py#L282

I noticed that you do slicing on the second dimension but do padding on the first dimension (top padding).

For example, if the size of target_embedded is 100x100, after doing this, its size is 101x99.

I am a little confused.

tunz commented 5 years ago

I followed what tensor2tensor did. https://github.com/tensorflow/tensor2tensor/blob/v.1.12.0/tensor2tensor/models/transformer.py#L1209

Actually, it does not change the size like that. line 282 removes the last word in the sequence, but it also adds a padding value to the first position of the sequence in line 283. When I insert print functions, output looks like this:

print(target_embedded.size())  # torch.Size([30, 33, 512])
target_embedded = target_embedded[:, :-1]
print(target_embedded.size())  # torch.Size([30, 32, 512])
target_embedded = F.pad(target_embedded, (0, 0, 1, 0))
print(target_embedded.size())  # torch.Size([30, 33, 512])

The reason why we remove the last word is that we don't need to predict the word after the last word. For example, let's think about this case that we use "abcd" as an input. Then, we have to predict a from nothing, and we have nothing to predict from d. For this, we add a padding value in line 283, and remove the last word in line 282 to eliminate unnecessary computations.

  input                  output
| <pad> | --(predict)--> | a |
|   a   | --(predict)--> | b |
|   b   | --(predict)--> | c |
|   c   | --(predict)--> | d |
|   d   |

XinDongol commented 5 years ago

Got it and appreciate your quick reply!

tunz / transformer-pytorch

Why do we want to pad the top of target_embedded #3