xdwang0726 commented 5 years ago

Hi, The existing tutorials for nn.Transformer e.g. https://pytorch.org/tutorials/beginner/transformer_tutorial.html and https://github.com/pytorch/examples/tree/master/word_language_model both are only use nn.TransformerEncoder. There is no tutorial using nn.TransformerDecoder. I think it is better to add one example that include nn.TransformerDecoder so that users can easy get started with. As for my own experience, I have some troubles using nn.TransfomerDecoder in inference process.

mathematicsofpaul commented 4 years ago

@V3RGANz for the longest time I had the suspicion that there was something going on with the mechanisms related to tgt_mask. What exactly was wrong with the generate_square_subsequent_mask btw?

I am still getting repeated tokens during the training phase even, perhaps the issue for others might be that they are getting that in the training phase so there is no real way for the model to perform correctly in the inference phase.

V3RGANz commented 4 years ago

@mathematicsofpaul short example

torch-1.3.0

>>> torch.nn.Transformer().generate_square_subsequent_mask(5)
tensor([[0., 0., 0., 0., 0.],
        [-inf, 0., 0., 0., 0.],
        [-inf, -inf, 0., 0., 0.],
        [-inf, -inf, -inf, 0., 0.],
        [-inf, -inf, -inf, -inf, 0.]])

torch-1.6.0+cu101

>>> torch.nn.Transformer().generate_square_subsequent_mask(5)
tensor([[0., -inf, -inf, -inf, -inf],
        [0., 0., -inf, -inf, -inf],
        [0., 0., 0., -inf, -inf],
        [0., 0., 0., 0., -inf],
        [0., 0., 0., 0., 0.]])

this was caused by some internal representation property of ByteTensor in torch 1.3.0 I think. because we have function

def generate_square_subsequent_mask(self, sz):
        r"""Generate a square mask for the sequence. The masked positions are filled with float('-inf').
            Unmasked positions are filled with float(0.0).
        """
        mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)
        mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
        return mask

in first line we have .transpose(0,1) which in fact has no effect:

>>> (torch.triu(torch.ones(5, 5)) == 1) == 0
tensor([[False, False, False, False, False],
        [ True, False, False, False, False],
        [ True,  True, False, False, False],
        [ True,  True,  True, False, False],
        [ True,  True,  True,  True, False]])
>>> (torch.triu(torch.ones(5, 5)) == 1).transpose(0, 1) == 0
tensor([[False, False, False, False, False],
        [ True, False, False, False, False],
        [ True,  True, False, False, False],
        [ True,  True,  True, False, False],
        [ True,  True,  True,  True, False]])

in the latest torch everything works ok

mathematicsofpaul commented 4 years ago

@V3RGANz thank you for that, I have been reading quite alot as why me and many others are still getting repeated token during the use of nn.transformer. Would you have any clue where to start? I am using the most upto date pytorch build too.

V3RGANz commented 4 years ago

@mathematicsofpaul if as you mentioned you get repeated token during training phase you should

make dataset with 1-2 simple and small batches that easy to visualize

check which sequences you passing to transformer
check which sequences you passing to loss function

try to overfit your model on one batch to make sure that problem caused by model.

vitouphy commented 4 years ago

Lately, I just got Transformer (Seq2Seq) working for a dialogue generation task. Hope this helps. Tested on PyTorch 1.2.0.

class TransformerSeq2Seq (nn.Module):
    def __init__(self):
        super(TransformerSeq2Seq, self).__init__()
        self.embedding = nn.Embedding(VOCAB_SIZE, INPUT_DIM)
        self.pos_encoder = PositionalEncoding(INPUT_DIM, dropout)

        encoder_layer = nn.TransformerEncoderLayer(d_model=INPUT_DIM, nhead=NUM_HEADS)
        self.transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=NUM_LAYERS)
        decoder_layer = nn.TransformerDecoderLayer(d_model=INPUT_DIM, nhead=NUM_HEADS)
        self.transformer_decoder = nn.TransformerDecoder(decoder_layer, num_layers=NUM_LAYERS)

        self.linear = nn.Linear(INPUT_DIM, VOCAB_SIZE)
        self.softmax = nn.Softmax(dim=-1)

    def get_mask(self, sz):
        mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)
        mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
        return mask

    def forward(self, src, tgt, src_key_padding_mask=None, tgt_key_padding_mask=None):
        src = self.embedding(src)
        src = self.pos_encoder(src)
        src = self.transformer_encoder(src, src_key_padding_mask=src_key_padding_mask)

        tgt_mask = self.get_mask(tgt.size(0)).to(device)
        tgt = self.embedding(tgt)
        tgt = self.pos_encoder(tgt)

        output = self.transformer_decoder(
            tgt = tgt, 
            memory = src, 
            tgt_mask = tgt_mask, # to avoid looking at the future tokens (the ones on the right)
            tgt_key_padding_mask = tgt_key_padding_mask, # to avoid working on padding
            memory_key_padding_mask = src_key_padding_mask # avoid looking on padding of the src
        )

        output = self.linear(output)
        return output

    def generate(self, src, src_key_padding_mask=None):
        ''' src has dimension of LEN x 1 '''
        src = self.embedding(src)
        src = self.pos_encoder(src)
        src = self.transformer_encoder(src, src_key_padding_mask=src_key_padding_mask)

        inputs = [sos_idx]
        for i in range(MAX_TGT_LEN):
            tgt = torch.LongTensor([inputs]).view(-1,1).to(device)
            tgt_mask = self.get_mask(i+1).to(device)

            tgt = self.embedding(tgt)
            tgt = self.pos_encoder(tgt)
            output = self.transformer_decoder(
                tgt=tgt, 
                memory=src, 
                tgt_mask=tgt_mask,
                memory_key_padding_mask = src_key_padding_mask )

            output = self.linear(output)
            output = self.softmax(output)
            output = output[-1] # the last timestep
            values, indices = output.max(dim=-1)
            pred_token = indices.item()
            inputs.append(pred_token)

        return inputs[1:]

mathematicsofpaul commented 4 years ago

As per @V3RGANz suggestion, I have uploaded a small example that trains on purely one sample. At the moment, the model seems to be giving basically repeated tokens during the training phase. (the slight differences are likely only due to the dropout).

Short Code.

So basically this example here attempts to copy and paste the inputs from the decoder, whereby the decoder inputs are simple arrays of random integers from (0 to 100). Since the tokens predicted during the output are so similar, I suspect it is because there is something wrong with the mechanisms involved in masking "future values" on the decoder end.

In terms of the overall model, it is actually for the purpose of time series generation and so:

I have left out the embedding component since it is a time series/number array already,
left out the softmax layer and replaced it with a linear layer outputting from 42 to 42 dimensions,
swapped out the cross entropy loss for a nn.MSELoss. (the problem should not be here since both loss functions are really "nearness" evaluators)

If anyone has any suggestions, that would be great! I am really keen on getting it to work and eventually sharing the code for everyone else to use on their own time series.

zhangguanheng66 commented 4 years ago

We have a BERT example for TransformerEncoder (link). I will work on a TransformerDecoder example later in October. I think a translation task could be a good topic for decoder example. Feel free to contribute to pytorch/text repo under examples folder.

mathematicsofpaul commented 4 years ago

@zhangguanheng66 Just to get a better understanding, since I'm seeing a lot of Encoder examples, does the BERT TransformerEncoder have both the encoder and decoder components of the Transformer? In other words, does it do teacher forcing? Thanks Zhang!

zhangguanheng66 commented 4 years ago

@zhangguanheng66 Just to get a better understanding, since I'm seeing a lot of Encoder examples, does the BERT TransformerEncoder have both the encoder and decoder components of the Transformer? In other words, does it do teacher forcing? Thanks Zhang!

No, it doesn't have decoder component so that's why I want to have a translation task, which is a good application for decoder model. If someone would like to contribute a PR for that, I'm very happy to review (I'm just very busy for resetting torchtext, which will be released in October).

vitouphy commented 4 years ago

@mathematicsofpaul BERT does not have the decoder component of Transformers. It uses only the encoder part, and it is not suitable for decoding because on every timestep t, it has access to tokens at t-1 and t+1.

vitouphy commented 4 years ago

No, it doesn't have decoder component so that's why I want to have a translation task, which is a good application for decoder model. If someone would like to contribute a PR for that, I'm very happy to review (I'm just very busy for resetting torchtext, which will be released in October).

I wanna help with that. Where do I start?

zhangguanheng66 commented 4 years ago

You can contribute the PR next to BERT example in torchtext/examples. And here is an active PR for translation task but using RNN. https://github.com/pytorch/text/pull/864

vitouphy commented 4 years ago

You can contribute the PR next to BERT example in torchtext/examples. And here is an active PR for translation task but using RNN. pytorch/text#864

Thank you. I'll look into it.

Xanyv commented 4 years ago

Hi！ I am using nn.Transformer for time series predictions based on SEQUENCE-TO-SEQUENCE MODELING WITH NN.TRANSFORMER AND TORCHTEXT, and I am wondering why the decoder is only a linear layer？why dont they use the whole Transformer like TransformerEncoder + TransformerDecoder for the task？ If someone could help me with this I would really appreciate a lot！

zhangguanheng66 commented 4 years ago

Hi！ I am using nn.Transformer for time series predictions based on SEQUENCE-TO-SEQUENCE MODELING WITH NN.TRANSFORMER AND TORCHTEXT, and I am wondering why the decoder is only a linear layer？why dont they use the whole Transformer like TransformerEncoder + TransformerDecoder for the task？ If someone could help me with this I would really appreciate a lot！

That "decoder" is not we are talking about here. In the word language model, embedding layer is considered as "encoder" while the last layer (or "decoder") is to project embedding to the word id. See figure 1 in the BERT paper (link).

mathematicsofpaul commented 4 years ago

@zhangguanheng66 Would you know of any complete implementations of the vanilla transformer in PyTorch aside from the annotated transformer from Harvard?

zhangguanheng66 commented 4 years ago

@mathemage People probably referred to this tutorial before. We have another BERT example, which trains the model from scratch and fine-tune the model for question-answer task.

mathematicsofpaul commented 4 years ago

@zhangguanheng66 thanks again. In earlier comment i made, I mentioned that I was getting repeated tokens for my copy and paste task with nn.MSELoss, would you have a clue as to where to look to identify this bug? Seeing as though as there have been very few successful implementations using the "Attention is all you need" Transformer nn.Transformer module, could it be an inherent bug in the nn.Transfomer module that is causing these repeated tokens?

Repeated Tokens Example

It is a super simple copy and pasting of vectors task, and for some reason I am still getting repeated tokens.

fa9r commented 4 years ago

@zhangguanheng66 could you explain in more detail for which tasks the transformer decoder is useful and when it is not?

my current understanding is that you need the decoder for autoregressive tasks where your output at time step t is dependent on or should be consistent with the previous outputs at steps 1,...,t-1. However, I feel like this is the case for any sequence-to-sequence task. In particular, why is it not used for the language modelling task in the tutorial, shouldn't knowledge of the previous words in a predicted sequence also help there?

mathematicsofpaul commented 4 years ago

For those who are getting repeated tokens, it could be that the issue lies in the nn.TransformerDecoder module. Here is an example where the decoder gives repeated tokens for a copy & paste task. (there is no fully connected layer at the end of this one since I wanted to show that it was strictly the decoder that was spitting out repeated tokens). When you run it, it will give out repeated tokens from the second epoch til the end of training. `import torch import torch.nn as nn import math pos_encoder = PositionalEncoding(d_model = 42,dropout=0.0)

torch.manual_seed(0) memory = torch.rand(4, 1, 42) #src: (S, N, E) https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html tgt = pos_encoder(torch.rand(4, 1, 42)) #tgt: (T, N, E) tgtmask = torch.nn.Transformer().generate_square_subsequent_mask(4).float()

decoder_layer = nn.TransformerDecoderLayer(d_model=42, nhead=7, dropout =0.0) transformer_decoder = nn.TransformerDecoder(decoder_layer, num_layers=6)

criterion = nn.MSELoss() optimizer = torch.optim.Adam(transformer_decoder.parameters(), lr = 0.05)

num_epochs = 100 transformer_decoder.train()

for epoch in range(num_epochs):
out = transformer_decoder(tgt, memory, tgt_mask=tgtmask) loss = criterion(out.permute(1,0,2), tgt.permute(1,0,2)) #(N, other dimensions) print("predicted", out) print("target", tgt)

zeroes out the old parameter gradients and back prop

optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f'Epoch {epoch + 1} | Loss for this epoch: {loss.item():.4f}')`

Something happened with the code posting formatting sorry.

vitouphy commented 4 years ago

@zhangguanheng66 could you explain in more detail for which tasks the transformer decoder is useful and when it is not?

my current understanding is that you need the decoder for autoregressive tasks where your output at time step t is dependent on or should be consistent with the previous outputs at steps 1,...,t-1. However, I feel like this is the case for any sequence-to-sequence task. In particular, why is it not used for the language modelling task in the tutorial, shouldn't knowledge of the previous words in a predicted sequence also help there?

Seq2Seq converts one sequence to another. For example, machine translations, dialogue conversation, and summarization are some of the examples of Seq2Seq. Decoder, in this case, is a language model that is conditioned on the source sentence. Regular language model does not depend on any source sentence, and we can use it for sentence generation. Given a few starting words, the LM will try to complete that sentence, or sometimes even write a story out of it.

fa9r commented 4 years ago

@vitouphy Thank you for the explanation. So if I understand that correctly you use the transformer decoder for all Seq2Seq tasks, and for language modelling in particular it is not used because that is not a real Seq2Seq task? In that case the headline of the language modelling tutorial "Sequence-to-Sequence Modeling with nn.Transformer" would seem a bit misleading.

Thanks again for the help, highly appreciated!

vitouphy commented 4 years ago

@fa9r, You can think of Decoder as a type of Language Model that receives additional information from the Encoder. After getting that information, the decoder behaves exactly like language model such that the prediction of token t relies on 1...t-1.

About the tutorial, it is based (if I'm not mistaken) on the concept from Open GPT-2. GPT2 is a BIG language model (1.5 B parameters) and it is trained with the following input: translate EN to FR. [SRC] Hi there [TARGET] salut. In other word, lately people have been experiment of using big language model to perform Seq2Seq task.

zhangguanheng66 commented 4 years ago

@zhangguanheng66 could you explain in more detail for which tasks the transformer decoder is useful and when it is not?

my current understanding is that you need the decoder for autoregressive tasks where your output at time step t is dependent on or should be consistent with the previous outputs at steps 1,...,t-1. However, I feel like this is the case for any sequence-to-sequence task. In particular, why is it not used for the language modelling task in the tutorial, shouldn't knowledge of the previous words in a predicted sequence also help there?

@fa9r For the tasks that num i tokens depends on 0, 1, 2,..., i-1, the TransformerEncoder module is enough. The tutorial on pytorch website is a typical word language modeling task, which predicts the next word. I think, machine translation task is a good example for the TransformerDecoder model.

zhangguanheng66 commented 4 years ago

For those who are getting repeated tokens, it could be that the issue lies in the nn.TransformerDecoder module. Here is an example where the decoder gives repeated tokens for a copy & paste task. (there is no fully connected layer at the end of this one since I wanted to show that it was strictly the decoder that was spitting out repeated tokens). When you run it, it will give out repeated tokens from the second epoch til the end of training. `import torch import torch.nn as nn import math pos_encoder = PositionalEncoding(d_model = 42,dropout=0.0)

torch.manual_seed(0) memory = torch.rand(4, 1, 42) #src: (S, N, E) https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html tgt = pos_encoder(torch.rand(4, 1, 42)) #tgt: (T, N, E) tgtmask = torch.nn.Transformer().generate_square_subsequent_mask(4).float()

decoder_layer = nn.TransformerDecoderLayer(d_model=42, nhead=7, dropout =0.0) transformer_decoder = nn.TransformerDecoder(decoder_layer, num_layers=6)

criterion = nn.MSELoss() optimizer = torch.optim.Adam(transformer_decoder.parameters(), lr = 0.05)

num_epochs = 100 transformer_decoder.train()

for epoch in range(num_epochs): out = transformer_decoder(tgt, memory, tgt_mask=tgtmask) loss = criterion(out.permute(1,0,2), tgt.permute(1,0,2)) #(N, other dimensions) print("predicted", out) print("target", tgt)

zeroes out the "old" parameter gradients and back prop

optimizer.zero_grad() loss.backward() optimizer.step() print(f'Epoch {epoch + 1} | Loss for this epoch: {loss.item():.4f}')`

Something happened with the code posting formatting sorry.

@mathematicsofpaul I don't quite understand your implementation. For the input of TransformerDecoder, what are the tgt and memory sequences in your copy/paste task? For the transformer architecture, memory sequence is usually from a encoder. IMO, you need to follow the tutorial and set up a TransformerEncoder model. During training, the input of TransformerEncoder (a.k.a. src) is this is a copy paste task <EOS> and the output sequence is compared against <BOS> this is a copy paste task. Then, during the inference, with the input sentence inference task <EOS>, the pretrained TransformerEncoder should output <BOS> inference task. I hope I explain the whole process well here for your copy/paste task.

mathematicsofpaul commented 4 years ago

@zhangguanheng66 I really appreciate that you got back to me!

So what I mean by copy and paste task is that, I actually input tgt = this is a copy paste task into the decoder input and then force it to learn to copy by training (teacher forcing) it to eventually give out = this is a copy paste task on the decoder output. (the reason that I dont have the <EOS> or <BOS> tokens is because im just trying to get the copy and paste task to work) In addition, for the memory I simply just kept that input fixed since the focus was that I wanted check if the decoder module was actually working.

One thing to keep in mind is that I am doing a copy and paste task for a sequence vectors which is the same as having words in an embedded form.

Here is an example giving repeated tokens from the second epoch and onwards.

Copy and Paste Example

At the moment the pure decoder is just giving me out = this this this this this this. Really, its more like out = an an an an an an since its not even copying the first token from tgt.

I am not a fan of the transformer example in the docs since it doesn't actually use the decoder depicted in the original "Attention is all you need". It only has the encoder, which from what I read is more of a BERT model? :(. In the nicest way possible, I really hope someone will be more explicit in saying that the tutorial is not the same Transformer in the "Attention is all you need" paper.

zhangguanheng66 commented 4 years ago

@mathematicsofpaul I tried your example and made a few changes based on my understanding of your "copy/paste" task. Although it's not on word vector level, it works to predict different tokens in the output (see print(out[0][0][:3], out[1][0][:3])). I also see the loss value drops significantly so the training process is effective.

Again, since you are not using memory here, switching to TransformerEncoder makes sense to me. And I still feel TransformerEncoder is consistent with the idea in "Attention is all you need". Let me know if you still have question about this.

import torch
import torch.nn as nn
import math

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, dropout=0.0, max_len=60000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)

        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x + self.pe[:x.size(0), :]
        return self.dropout(x)

pos_encoder = PositionalEncoding(d_model = 42,dropout=0.0)
ntokens = 200
embedding = nn.Embedding(200, 42)
output_project = nn.Linear(42, ntokens)

torch.manual_seed(0)
memory = torch.rand(4, 1, 42) #src: (S, N, E) https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html
tgtmask = torch.nn.Transformer().generate_square_subsequent_mask(4).float()

decoder_layer = nn.TransformerDecoderLayer(d_model=42, nhead=7, dropout =0.0)
transformer_decoder = nn.TransformerDecoder(decoder_layer, num_layers=1)

class CopyPasteModel(nn.Module):
    def __init__(self, transformer_decoder, positional, embedding, out_proj):
        super(CopyPasteModel, self).__init__()
        self.transformer_decoder = transformer_decoder
        self.positional = positional
        self.embedding = embedding
        self.out_proj = out_proj

    def forward(self, tgt, memory):
        output = self.embedding(tgt)
        output = self.positional(output)
        output = self.transformer_decoder(output, memory)
        return self.out_proj(output)

criterion = nn.CrossEntropyLoss()
model = CopyPasteModel(transformer_decoder, pos_encoder, embedding, output_project)
model.train()

optimizer = torch.optim.Adam(model.parameters(), lr = 0.05)

num_epochs = 50

tgt_length = 400
for epoch in range(num_epochs):
  tgt = torch.randint(ntokens, (tgt_length, 1)) #tgt: (T, N)
  optimizer.zero_grad()
  out = model(tgt, memory)
  loss = criterion(out.view(-1, ntokens), tgt.view(-1)) #(N, other dimensions)
  loss.backward()
  torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5)
  optimizer.step()
  print(f'Epoch {epoch + 1} | Loss for this epoch: {loss.item():.4f}')

tgt = torch.randint(ntokens, (4, 1)) #tgt: (T, N, E)
out = model(tgt, memory)
print(out[0][0][:3], out[1][0][:3])

mathematicsofpaul commented 4 years ago

@zhangguanheng66 Thanks again, this is almost what I am after; however, my intention was to use nn.MSELoss() instead of nn.CrossEntropyLoss() in order to serve my purpose of time series prediction. Here is my example where I copied your code and essentially turned it into a time series "copy and paste" task (pardon me, I should have been more clear on the fact that I was using nn.MSELoss() for time series purposes instead of word vectors). However, despite the differences, one could draw a parallel between the two since the word embedding representation of a sentence is basically a sequence of vectors. In this case, it is a sequence of vectors that represent time series data.

Here is an almost near replica of your code above, but I have made some minor adjustments leaving out the embedding component since we are estimating continuous times series values instead of strictly tokens from a vocabulary.

Modified for nn.MSELoss

To be clear about the copy and paste task, I am trying to copy tgt = 0.1, 0.567, 0.4259, 0.5612 and have the decoder modulel give out= 0.1 ,0.567, 0.4259, 0.5612. However, the decoder module is giving repeated tokens of out = 0.7513, 0.7513, 0.7513, 0.7513. In addition, the numbers mentioned don't have to be strictly one-dimensional since, time series is data is often multivariate ie. tgt = [0.1, 0.56], [0.567, 0.45], [0.4259, 0.1230], [0.5612, 0.12385].

In terms of the earlier example you provided, this is a strong example for those in the near future that wish to use the decoder for NLP purposes. The only adjustment missing is really the tgt_mask, I will leave this code here as editable by you and quickly turn off editing once you are done!

Decoder example

Fingercrossed that the nn.MSELoss() repeated tokens issue will be smoothed out so that I can make a version of the transformer for times series and share it with everyone!

mathemage commented 4 years ago

@mathemage People probably referred to this tutorial before. We have another BERT example, which trains the model from scratch and fine-tune the model for question-answer task.

@zhangguanheng66 I guess this was only a typo in the mention / the handle, right? I don't recognise this issue at all...

zhangguanheng66 commented 4 years ago

@zhangguanheng66 Thanks again, this is almost what I am after; however, my intention was to use nn.MSELoss() instead of nn.CrossEntropyLoss() in order to serve my purpose of time series prediction. Here is my example where I copied your code and essentially turned it into a time series "copy and paste" task (pardon me, I should have been more clear on the fact that I was using nn.MSELoss() for time series purposes instead of word vectors). However, despite the differences, one could draw a parallel between the two since the word embedding representation of a sentence is basically a sequence of vectors. In this case, it is a sequence of vectors that represent time series data.

Here is an almost near replica of your code above, but I have made some minor adjustments leaving out the embedding component since we are estimating continuous times series values instead of strictly tokens from a vocabulary.

Modified for nn.MSELoss

To be clear about the copy and paste task, I am trying to copy tgt = 0.1, 0.567, 0.4259, 0.5612 and have the decoder modulel give out= 0.1 ,0.567, 0.4259, 0.5612. However, the decoder module is giving repeated tokens of out = 0.7513, 0.7513, 0.7513, 0.7513. In addition, the numbers mentioned don't have to be strictly one-dimensional since, time series is data is often multivariate ie. tgt = [0.1, 0.56], [0.567, 0.45], [0.4259, 0.1230], [0.5612, 0.12385].

In terms of the earlier example you provided, this is a strong example for those in the near future that wish to use the decoder for NLP purposes. The only adjustment missing is really the tgt_mask, I will leave this code here as editable by you and quickly turn off editing once you are done!

Decoder example

Fingercrossed that the nn.MSELoss() repeated tokens issue will be smoothed out so that I can make a version of the transformer for times series and share it with everyone!

@mathematicsofpaul I don't know if MSE is the best loss func in this case. If you add torch.set_printoptions(precision=8), you will be able to print our the tensor with more precisions and see slight difference (but of course, it's still not copy/paste tgt sequence).

mathematicsofpaul commented 4 years ago

mmm @zhangguanheng66 What would you recommend instead of using the nn.MSELoss()?

DaisyTung commented 4 years ago

I have a similar question. I am also trying use nn.Transformer tutorial on summarization, but I tried to use encoder only, just like the tutorial did. I change the function "get_batch" to return source article as data, summary as target to train the model, but I have same problem, it always output same word when test. The reason that I tried to use encoder only on summarization is the paper "GENERATING WIKIPEDIA BY SUMMARIZING LONG SEQUENCES", said they only used transformer decoder to do summarization task, combine the inputs(the source article) and outputs(the summary) to a single sentence, and trained as a language model. Can someone tell me is it possible to use this tutorial(nn.Transformer encoder only) to do summarization? and how should I modify to make it better.

rgbayrak commented 4 years ago

dom integers from (0 to 100). Since the tokens predicted during the output are so similar, I suspect it is because there is something wrong with the mechanisms involved in masking "future values" on the decoder end.

In terms of the overall model, it is actually for the purpose of time series generation and so:

I have left out the embedding component since it is a time series/number array already,

left out the softmax layer and replaced it with a linear layer outputting from 42 to 42 dimensions,

swapped out the cross entropy loss for a nn.MSELoss. (the problem should not be here since both loss functions are really "nearness" evaluators)

If anyone has any suggestions, that would be

Hi Paul,

Have you ever figured this out? I am working on time-series predictions using transformers, tad bit confused with what should I remove (embedding bit), what should I add (linear layers). I was just looking at your github repo, you forked some repo but I wanted see if you get it working?

Thank you, Roza

mathematicsofpaul commented 4 years ago

Quite busy, I will reply to this shortly! Regards, Paul.

On Tue, 10 Nov 2020 at 15:03, rgbayrak notifications@github.com wrote:

dom integers from (0 to 100). Since the tokens predicted during the output are so similar, I suspect it is because there is something wrong with the mechanisms involved in masking "future values" on the decoder end.

In terms of the overall model, it is actually for the purpose of time series generation and so:

I have left out the embedding component since it is a time series/number array already,

left out the softmax layer and replaced it with a linear layer outputting from 42 to 42 dimensions,

swapped out the cross entropy loss for a nn.MSELoss. (the problem should not be here since both loss functions are really "nearness" evaluators)

If anyone has any suggestions, that would be

Hi Paul,

Have you ever figured this out? I am working on time-series predictions using transformers, tad bit confused with what should I remove (embedding bit), what should I add (linear layers). I was just looking at your github repo, you forked some repo but I wanted see if you get it working?

Thank you, Roza

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pytorch/tutorials/issues/719#issuecomment-724435535, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHTP22OB4GKQX2JIS6TI2BLSPC3RDANCNFSM4JK6ZBBA .

rgbayrak commented 4 years ago

Quite busy, I will reply to this shortly! Regards, Paul. …

Sure, I am happy to collaborate on figuring this out. Sooner the better, so if you give me leads that would be great! Thank you.

shamoons commented 4 years ago

dom integers from (0 to 100). Since the tokens predicted during the output are so similar, I suspect it is because there is something wrong with the mechanisms involved in masking "future values" on the decoder end. In terms of the overall model, it is actually for the purpose of time series generation and so:

I have left out the embedding component since it is a time series/number array already,

left out the softmax layer and replaced it with a linear layer outputting from 42 to 42 dimensions,

swapped out the cross entropy loss for a nn.MSELoss. (the problem should not be here since both loss functions are really "nearness" evaluators)

If anyone has any suggestions, that would be

Hi Paul,

Have you ever figured this out? I am working on time-series predictions using transformers, tad bit confused with what should I remove (embedding bit), what should I add (linear layers). I was just looking at your github repo, you forked some repo but I wanted see if you get it working?

Thank you, Roza

@rgbayrak Looks like we have a similar problem: https://discuss.pytorch.org/t/my-multi-dimensional-transformer-doesnt-seem-to-learn-anything/103122/2

Been trying to figure this out with time-series, but no dice. Maybe we can pair up and solve both of our issues?

mathematicsofpaul commented 3 years ago

@shamoons @rgbayrak did you guys manage to get any further with this issue?

rgbayrak commented 3 years ago

Hey @mathematicsofpaul,

My network (adapted from @maxjcohen) is intact and working but not beating a bi-lstm baseline performance. So, I am thinking maybe I could not tailor the parameters for my task well or I am just hitting the bottleneck with this network.

Unless I have a smart way of (1) doing positional encoding, (2) setting attention size in a way that helps with my problem, (3) figuring out how to translate Q, K, V for continuous input, does not seem like transformer is moving mountains.

mathematicsofpaul commented 3 years ago

It appears that this is too hard to get working, and that the transformer is not lending itself to the time series task. Have you checked out https://github.com/jdb78/pytorch-forecasting? Really amazing stuff over there. Although not the original Transformer architecture, I believe that in order for the transformer to perform as required there needs to be modifications from the original!

P.S Sorry everybody for hijacking this post!

maxjcohen commented 3 years ago

Hi, I agree that the original Transformer architecture is not well suited for time series problems, mainly do to :

It's quadratic complexity in the time dimension
It's attention mechanism that computes all time step in parallel, ignoring the close relationship between a time step to the next

Both of them can be addressed, there are quite a few paper out there, although traditional RNN based architectures still seem to outperform them in most sequence to sequence tasks. You can find a comparison of a few architectures here, with a slightly modified Transformer.

shamoons commented 3 years ago

Hi, I agree that the original Transformer architecture is not well suited for time series problems, mainly do to :

It's quadratic complexity in the time dimension

It's attention mechanism that computes all time step in parallel, ignoring the close relationship between a time step to the next

Both of them can be addressed, there are quite a few paper out there, although traditional RNN based architectures still seem to outperform them in most sequence to sequence tasks. You can find a comparison of a few architectures here, with a slightly modified Transformer.

I’m not sure I understand why the attention mechanism would be considerably different for time series vs discrete (word) embeddings?

maxjcohen commented 3 years ago

Attention mechanisms allows computing every time step in parallel, which improves computation time, but results in a network that has no notion of closeness of time steps. This is one reason why positional encodings where added in the original paper.

This is not a problem in NLP, as the order of words in a sentence in important, but each word does not necessarily depends more on the previous one than the next, or other words much earlier in the sequence. In time series, the opposite is often the case: we can get very accurate precision by considering that prediction for a time step only requires knowledge of the previous one (Markov assumption).

Word embedding is used in most NLP tasks, the Transformer is no exception here.

shamoons commented 3 years ago

Attention mechanisms allows computing every time step in parallel, which improves computation time, but results in a network that has no notion of closeness of time steps. This is one reason why positional encodings where added in the original paper.

This is not a problem in NLP, as the order of words in a sentence in important, but each word does not necessarily depends more on the previous one than the next, or other words much earlier in the sequence. In time series, the opposite is often the case: we can get very accurate precision by considering that prediction for a time step only requires knowledge of the previous one (Markov assumption).

Word embedding is used in most NLP tasks, the Transformer is no exception here.

Thanks for the explanation. In my particular application (audio spectrograms), I don't think that only the previous (or next) timestep matters. To get proper tone, etc, we'll have to look back (and forward) in time a bit.

I also don't understand the quadratic complexity - isn't it the same for time-series or NLP (via embeddings)?

maxjcohen commented 3 years ago

Again, word embedding has nothing to do with the complexity of the model, it's a simple matrix multiplication.

RNN-like models (RNN, GRU, LSTM, etc.) have a linear complexity with time, i.e. their computation time is proportional to the number of steps in the sequence. The Transformer architecture is quadratic, meaning that the computation time is proportional to the square of the number of steps in the sequence. This is one of the limitations I try to address in my repo, by dividing the sequence into chunks.

nihirv commented 3 years ago

I've been struggling getting the TransformerDecoder to work as those above. I think my issue is similar to that of @tylerroost, although my target data doesn't have a [SOS] token.

My target data is as follows:

[['this', 'is', 'an', 'example', '[EOS]', '[PAD]'],
['another', 'target', 'sentence', '[EOS]', '[PAD]', '[PAD]']]

My encoders work fine, but the decoder seems to be playing up slightly. During the forward pass, after encoding my source, I actually add an [SOS] token to the target (I'm doing Variational Inference and this is a strategy that one of the papers uses):

target_shifted = torch.cat((sos_token, target[:, :-1]), 1)

This is followed by generating some target masks:

trg_key_padding_mask = self.generate_pad_mask(target_shifted)
trg_mask = self.generate_square_subsequent_mask(target_shifted.size(-1))

I permute the embeddings so it's [S, N, E], add my latent information to the SOS token and then run the data through the decoder:

target_embedding = self.get_embedding(target_shifted).permute(1, 0, 2) # [S, N, E]
target_embedding[0] = target_embedding[0] + z # z = latent information
decoder_outputs = self.transformer_decoder(target_embedding, encoder_outputs, tgt_mask=trg_mask, tgt_key_padding_mask=trg_key_padding_mask, memory_key_padding_mask=src_mask)

I then pass these outputs to a linear layer to calculate loss etc:

decoder_outputs = decoder_outputs.permute(1, 0, 2) # [N, S, E]
output = self.output(decoder_outputs)

Loss is as follows:

output = model(batch)
target = batch["target"]
loss = self.criterion(output.reshape(-1, output.size(-1)), target.reshape(-1)) # CrossEntropyLoss

With target shifting (i.e. target_shifted), my loss stays almost fixed/hovers around a non-zero value and I get no output (not even a repeated token) returned to me.
If I DON'T perform the target shifting, my loss goes to 0 and I still get no output returned to me.

Annoyingly, the code works as expected with my own implementation of MHA and a Transformer decoder - the issue occurs when using PyTorch's nn.TransformerDecoder()

nihirv commented 3 years ago

Interesting... issue turned out to be that the model dimensionality was too big. Reducing it fixed my problem

maxjcohen commented 3 years ago

Glad you where able to solve your issue. Regarding the training when you don't shift the target, I believe it makes sense that the Transformer's loss goes to 0 without learning anything, as your feeding the very targeted value your asking the model to predict. Shifting the targeted values and masking ensures the Transformer only has access to the previous (and not current) prediction at each time step.

IngvarBaranin commented 3 years ago

It's been mentioned here that the encoder-only variant as shown in the example should be sufficient to address a text generation task.

Has anybody succeeded in doing so with meaningful results? It would be a joy to hear of some. My doubts stem from the understanding that BERT, an encoder representation of a transformer, seems to be pretty bad at text generation.

Also, is there still no example with a TransformerDecoder included?

sahinbatmaz commented 3 years ago

Lately, I just got Transformer (Seq2Seq) working for a dialogue generation task. Hope this helps. Tested on PyTorch 1.2.0.

class TransformerSeq2Seq (nn.Module):
    def __init__(self):
        super(TransformerSeq2Seq, self).__init__()
        self.embedding = nn.Embedding(VOCAB_SIZE, INPUT_DIM)
        self.pos_encoder = PositionalEncoding(INPUT_DIM, dropout)

        encoder_layer = nn.TransformerEncoderLayer(d_model=INPUT_DIM, nhead=NUM_HEADS)
        self.transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=NUM_LAYERS)
        decoder_layer = nn.TransformerDecoderLayer(d_model=INPUT_DIM, nhead=NUM_HEADS)
        self.transformer_decoder = nn.TransformerDecoder(decoder_layer, num_layers=NUM_LAYERS)

        self.linear = nn.Linear(INPUT_DIM, VOCAB_SIZE)
        self.softmax = nn.Softmax(dim=-1)

    def get_mask(self, sz):
        mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)
        mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
        return mask

    def forward(self, src, tgt, src_key_padding_mask=None, tgt_key_padding_mask=None):
        src = self.embedding(src)
        src = self.pos_encoder(src)
        src = self.transformer_encoder(src, src_key_padding_mask=src_key_padding_mask)

        tgt_mask = self.get_mask(tgt.size(0)).to(device)
        tgt = self.embedding(tgt)
        tgt = self.pos_encoder(tgt)

        output = self.transformer_decoder(
            tgt = tgt, 
            memory = src, 
            tgt_mask = tgt_mask, # to avoid looking at the future tokens (the ones on the right)
            tgt_key_padding_mask = tgt_key_padding_mask, # to avoid working on padding
            memory_key_padding_mask = src_key_padding_mask # avoid looking on padding of the src
        )

        output = self.linear(output)
        return output

    def generate(self, src, src_key_padding_mask=None):
        ''' src has dimension of LEN x 1 '''
        src = self.embedding(src)
        src = self.pos_encoder(src)
        src = self.transformer_encoder(src, src_key_padding_mask=src_key_padding_mask)

        inputs = [sos_idx]
        for i in range(MAX_TGT_LEN):
            tgt = torch.LongTensor([inputs]).view(-1,1).to(device)
            tgt_mask = self.get_mask(i+1).to(device)

            tgt = self.embedding(tgt)
            tgt = self.pos_encoder(tgt)
            output = self.transformer_decoder(
                tgt=tgt, 
                memory=src, 
                tgt_mask=tgt_mask,
                memory_key_padding_mask = src_key_padding_mask )

            output = self.linear(output)
            output = self.softmax(output)
            output = output[-1] # the last timestep
            values, indices = output.max(dim=-1)
            pred_token = indices.item()
            inputs.append(pred_token)

        return inputs[1:]

Hi @zhangguanheng66, in the code above, @vitouphy prepared a model with TransformerDecoder for translation task. Is it a good example ? Could it be used in a tutorial ?

Also I have another question. What is the use case of 'memory_mask' ? It has shape of (S,T), so it masks between source and target sentence tokens. As I understand, in translation task, it should be None. But I wonder when it can be needed ? (padding masks are handled by src_key_padding_mask, tgt_key_padding_mask and memory_key_padding_mask)

Thanks.

zhangguanheng66 commented 3 years ago

Hi @zhangguanheng66, in the code above, @vitouphy prepared a model with TransformerDecoder for translation task. Is it a good example ? Could it be used in a tutorial ?

Also I have another question. What is the use case of 'memory_mask' ? It has shape of (S,T), so it masks between source and target sentence tokens. As I understand, in translation task, it should be None. But I wonder when it can be needed ? (padding masks are handled by src_key_padding_mask, tgt_key_padding_mask and memory_key_padding_mask)

In the transformer encoder-decoder architecture, memory_mask is applied on the sequence from encoder to decoder, a.k.a. src in this example.

I can review the PR to convert the RNN translation tutorial (link) to the transformer-based, if someone is interested to submit a PR.

pytorch / tutorials

[HELP WANTED]Tutorial for nn.Transformer with nn.TransformerDecoder #719

zeroes out the old parameter gradients and back prop

zeroes out the "old" parameter gradients and back prop