Closed XinDongol closed 5 years ago
I followed what tensor2tensor did. https://github.com/tensorflow/tensor2tensor/blob/v.1.12.0/tensor2tensor/models/transformer.py#L1209
Actually, it does not change the size like that. line 282 removes the last word in the sequence, but it also adds a padding value to the first position of the sequence in line 283. When I insert print functions, output looks like this:
print(target_embedded.size()) # torch.Size([30, 33, 512])
target_embedded = target_embedded[:, :-1]
print(target_embedded.size()) # torch.Size([30, 32, 512])
target_embedded = F.pad(target_embedded, (0, 0, 1, 0))
print(target_embedded.size()) # torch.Size([30, 33, 512])
The reason why we remove the last word is that we don't need to predict the word after the last word. For example, let's think about this case that we use "abcd"
as an input. Then, we have to predict a
from nothing, and we have nothing to predict from d
. For this, we add a padding value in line 283, and remove the last word in line 282 to eliminate unnecessary computations.
input output
| <pad> | --(predict)--> | a |
| a | --(predict)--> | b |
| b | --(predict)--> | c |
| c | --(predict)--> | d |
| d |
Got it and appreciate your quick reply!
https://github.com/tunz/transformer-pytorch/blob/5cf29c06ebd067e0e0274904b386ae388a191745/model/transformer.py#L282
I noticed that you do slicing on the second dimension but do padding on the first dimension (top padding).
For example, if the size of target_embedded is 100x100, after doing this, its size is 101x99.
I am a little confused.