[HELP WANTED]Tutorial for nn.Transformer with nn.TransformerDecoder

xdwang0726 commented 5 years ago

Hi, The existing tutorials for nn.Transformer e.g. https://pytorch.org/tutorials/beginner/transformer_tutorial.html and https://github.com/pytorch/examples/tree/master/word_language_model both are only use nn.TransformerEncoder. There is no tutorial using nn.TransformerDecoder. I think it is better to add one example that include nn.TransformerDecoder so that users can easy get started with. As for my own experience, I have some troubles using nn.TransfomerDecoder in inference process.

zhangguanheng66 commented 5 years ago

Could you propose a problem for TransformerDecoder. Based on my experience, the applications of nn.TransformerDecoder should be very similar to that of nn.TransformerEncoder.

xdwang0726 commented 5 years ago

Could you propose a problem for TransformerDecoder. Based on my experience, the applications of nn.TransformerDecoder should be very similar to that of nn.TransformerEncoder.

Thank you for your reply! I have some trouble in nn.TransformerDecoder's inference process. I am using transformer for summarization task. My training goes well (got a smoothy converged loss), but the inference process has some problems (i.e. always generate the same token when inference). I have followed the official tutorial, but didn't really help. I am wondering is there any resource I can refer to? Thank you!

zhangguanheng66 commented 5 years ago

@xdwang0726 Did you mask your target and memory inputs to TransformerDecoder properly?

zhangguanheng66 commented 5 years ago

BTW, if you would like to publish your work on pytorch/tutorial, I'm happy to take a look.

xdwang0726 commented 5 years ago

Thank you for your reply! I used generate_square_mask to obtain the tgt_mask, but for memory_mask, I have some problem with the dimensions as generate_square_mask can only generate square masks which is not fit the memory_mask's dimension. So I set memory_mask = None.

xdwang0726 commented 5 years ago

BTW, if you would like to publish your work on pytorch/tutorial, I'm happy to take a look.

Here attached my codes. Thank you for your kind help! Forward steps:

    encoder_vec = self.bert_encoder(src_input_ids, src_token_type_ids, src_attention_mask)
    tgt_mask = self.generate_square_subsequent_mask(tgt_input_ids.shape[1]).to(self.device)  # tgt_mask: (T, T)
    #memory_mask = self.generate_subsequent_mask(len(encoder_vec), tgt_input_ids.shape[1]).to(self.device)  # memory_mask: (T,S)
    decoder_outputs = self.bert_decoder(encoder_vec, tgt_input_ids, tgt_mask, inference)
    score, lsm_score = self.generator(decoder_outputs)
    score = score.transpose(0, 1).squeeze(0)  # shape [num_tgt_tokens, num_classes]

Inference Steps:

    with torch.no_grad():
        batch_size = 1

        # encoder
        encoder_vec = self.bert_encoder(src_input_ids, src_token_type_ids, src_attention_mask)
        # create initial tokens: shape tensor([[101]])
        generated_seq = torch.full((batch_size, 1), Constants.BOS, dtype=torch.long, device=self.device)
        generated_seq4 = tgt_input_ids[:, 0].unsqueeze(0)

        for i in range(1, max_tgt_seq_len):
            tgt_mask = self.generate_square_subsequent_mask(generated_seq.shape[1]).to(self.device)
            decoder_outputs = self.bert_decoder(encoder_vec, generated_seq, tgt_mask, inference)

            _, lsm_score = self.generator(decoder_outputs)
            # Take token with largest probability and use it for the next words
            generated_token_index = torch.topk(lsm_score, 1, dim=-1)[1][-1, :, :]  # lsm_score shape [num_tokens, bs, vocab_size]

            # Concatenate generated token with sequence
            generated_seq = torch.cat((generated_seq, generated_token_index), dim=-1)

zhangguanheng66 commented 5 years ago

Thank you for your reply! I used generate_square_mask to obtain the tgt_mask, but for memory_mask, I have some problem with the dimensions as generate_square_mask can only generate square masks which is not fit the memory_mask's dimension. So I set memory_mask = None.

I'm not sure if this is correct to set the memory mask to None. If you take a look at generate_square_subsequent_mask function, you will see how to generate attention mask. It should be very straightforward..

xdwang0726 commented 5 years ago

Thank you for your reply! I used generate_square_mask to obtain the tgt_mask, but for memory_mask, I have some problem with the dimensions as generate_square_mask can only generate square masks which is not fit the memory_mask's dimension. So I set memory_mask = None.

I'm not sure if this is correct to set the memory mask to None. If you take a look at generate_square_subsequent_mask function, you will see how to generate attention mask. It should be very straightforward..

Thank you for your reply! As from the official documention, arg memory_mask is optional, I assumed that it can be set as None. I will try to generate the memory_mask and see whether it solves the problem. Thanks again for your help!

zhangguanheng66 commented 5 years ago

If you set memory_mask to None, then why not just TransformerEncoder model?

xdwang0726 commented 5 years ago

If you set memory_mask to None, then why not just TransformerEncoder model?

I went through the detailed implementation and found it seems that masks are MUSTs to nn.TransformerDecoder (however, in the documentation, it shows that masks are optional). For my understaing, there are two types of masks: attn_mask and padding_mask. The attn_mask is matrix with value '-inf's and 0s where -inf corresponed to value True and 0 is to False. And 'padding_mask' contains 1s and 0s to represent the paddings. However, in the documentation, it says

key_padding_mask: if provided, specified padding elements in the key will be ignored by the attention. This is an binary mask. When the value is True, the corresponding value on the attention layer will be filled with -inf. attn_mask: mask that prevents attention to certain positions. This is an additive mask (i.e. the values will be added to the attention layer). https://pytorch.org/docs/stable/_modules/torch/nn/functional.html

I appreciate if you can answer this question. Thank you!

zhangguanheng66 commented 5 years ago

If you set memory_mask to None, then why not just TransformerEncoder model?

I went through the detailed implementation and found it seems that masks are MUSTs to nn.TransformerDecoder (however, in the documentation, it shows that masks are optional). For my understaing, there are two types of masks: attn_mask and padding_mask. The attn_mask is matrix with value '-inf's and 0s where -inf corresponed to value True and 0 is to False. And 'padding_mask' contains 1s and 0s to represent the paddings. However, in the documentation, it says

key_padding_mask: if provided, specified padding elements in the key will be ignored by the attention. This is an binary mask. When the value is True, the corresponding value on the attention layer will be filled with -inf. attn_mask: mask that prevents attention to certain positions. This is an additive mask (i.e. the values will be added to the attention layer). https://pytorch.org/docs/stable/_modules/torch/nn/functional.html

I appreciate if you can answer this question. Thank you!

According to the paper Attention is All You Need, mask is not a MUST. When it comes to WLM, we have to use a mask to avoid the model "copy" the word after in the sequence, IMO.

hadaev8 commented 4 years ago

Same here, trying to make encoder-decoder model and cant find tutorial.

zhangguanheng66 commented 4 years ago

Which kind of problem are you working on?

hadaev8 commented 4 years ago

I have a seq2seq model for translating grapheme tp phoneme. Here it is https://colab.research.google.com/drive/1g4ZFCGegOmD-xXL-Ggu7K5LVoJeXYJ75

Loss immediately drops to zero and the model generates the same token at interference. From things I read on google, it looks like something with a decoder mask and model just take the future token from the target input. Still, I passed a mask and it everything should work.

zhangguanheng66 commented 4 years ago

I think you should mask the inputs of both encoder/decoder. Based on what you observe, the masks were definitely not set up correctly.

xdwang0726 commented 4 years ago

I have a seq2seq model for translating grapheme tp phoneme. Here it is https://colab.research.google.com/drive/1g4ZFCGegOmD-xXL-Ggu7K5LVoJeXYJ75

Loss immediately drops to zero and the model generates the same token at interference. From things I read on google, it looks like something with a decoder mask and model just take the future token from the target input. Still, I passed a mask and it everything should work.

I had the same problem when I used seq2seq to do text generation.

zhangguanheng66 commented 4 years ago

I have a seq2seq model for translating grapheme tp phoneme. Here it is https://colab.research.google.com/drive/1g4ZFCGegOmD-xXL-Ggu7K5LVoJeXYJ75 Loss immediately drops to zero and the model generates the same token at interference. From things I read on google, it looks like something with a decoder mask and model just take the future token from the target input. Still, I passed a mask and it everything should work.

I had the same problem when I used seq2seq to do text generation.

I think for text generation, you could use TransformerEncoder, no need to have decoder. For the problem mentioned above, usually it's is due to the mask.

hadaev8 commented 4 years ago

Masking should be easy, right? I have around same loss with and without masking decoder inputs. Also i tried to mask everything. It still doesnt work.

hadaev8 commented 4 years ago

@xdwang0726 I have solved my problem, maybe it will be helpfull example for you https://colab.research.google.com/drive/1g4ZFCGegOmD-xXL-Ggu7K5LVoJeXYJ75

zhangguanheng66 commented 4 years ago

@hadaev8 Could you briefly explain what you have changed? Helpful for future users.

hadaev8 commented 4 years ago

@zhangguanheng66 I'm not sure why it works so, I took it from the tutorial.

output = model(src, trg[:-1,:])
loss = criterion(output.view(-1, output_dim), trg[1:,:].view(-1))

I passed the target without the last index to model and target without first to loss.

xdwang0726 commented 4 years ago

@hadaev8 I did pass the target without the last index, but I still have the problem. Will change the loss and see whether it helps. Thank you!

zeeshansayyed commented 4 years ago

Thank you for sharing the Colan link @hadaev8 . I tried to use the same concepts in my own seq2seq problem. The loss goes down, meaning the system learns something but at inference time the model produces arbitrary values.

Were you able to make it work @xdwang0726 ?

MrShininnnnn commented 4 years ago

@hadaev8 Thanks for sharing. On the basis of your suggestions, I solved the same problem (train loss to 0. in only several epochs and predict the start symbol all the time). For those with the same issue for seq2seq learning, I did the following:

right shift the decoder input
pass tgt_mask and src_key_padding_mask to the nn.Transformer in the training phase
for inference encoding, provide src_key_padding_mask to the encoder
for inference auto-regressive decoding, provide tgt_mask and memory_key_padding_mask (the same as the src_key_padding_mask) to the decoder

hadaev8 commented 4 years ago

@MrShininnnnn @zeeshansayyed Well, now I know better what it all means. Basically, the decoder takes target as input and output. So we need concat zero row (or SOS token) to it so it cant just copy inputs to outputs. If we pass zero + target without last index and ask for target it will teach to predict next index based on all previous. Tgt mask is the only necessary mask, because it prevents decoder from looking at future timesteps.

In my first experiment, I did interference without any masking and it works fine. In the second I work better with tgt_mask. Still, masking in pytorch is not perfect, since it cannot mask query values.

demi6od commented 4 years ago

@hadaev8 Thank you for sharing. I have the same problem, the model generates the same token at interference. memory embeddings are different, tgt embedings are different, but the precictions is always the same tokens. I tried your way, but it still doesn't work.

tgt_mask = nn.Transformer().generate_square_subsequent_mask(len(tgt_emb)).to(g_device)
decoder_output = self.transformer_decoder(tgt=tgt_emb, memory=enc_menory, tgt_mask=tgt_mask)
predictions = self.fc_out(decoder_output)

#train
output = model(src, tgt[:-1])
output_dim = output.shape[-1]        
loss = criterion(output.view(-1, output_dim), tgt[1:].view(-1))

@zhangguanheng66 a TransformerDecoder example will be really helpful to beginner!

MrShininnnnn commented 4 years ago

@demi6od If you are dealing with the auto-regressive decoding, as I mentioned above, remember to pass both the tgt_mask and the memory_key_padding_mask to your decoder.

...
for i in range(tgt_seq_len): 
    decoder_input = tgt_embedding_layer(decoder_inputs[:, :i+1]).transpose(0, 1) 
    decoder_input = pos_encoder(decoder_input)
    tgt_mask = transformer_model.generate_square_subsequent_mask(i+1).to(device) 
    decoder_output = transformer_model.decoder(
        tgt=decoder_input, 
        memory=encoder_hidden_states, 
        tgt_mask=tgt_mask, 
        memory_key_padding_mask=src_key_padding_mask) 
    decoder_output = self.generator(decoder_output)[-1] 
    decoder_inputs[:, i+1] = decoder_output.max(1)[1]
return decoder_inputs

zhangguanheng66 commented 4 years ago

@hadaev8 Thank you for sharing. I have the same problem, the model generates the same token at interference. memory embeddings are different, tgt embedings are different, but the precictions is always the same tokens. I tried your way, but it still doesn't work.
tgt_mask = nn.Transformer().generate_square_subsequent_mask(len(tgt_emb)).to(g_device)
decoder_output = self.transformer_decoder(tgt=tgt_emb, memory=enc_menory, tgt_mask=tgt_mask)
predictions = self.fc_out(decoder_output)

#train
output = model(src, tgt[:-1])
output_dim = output.shape[-1]        
loss = criterion(output.view(-1, output_dim), tgt[1:].view(-1))
@zhangguanheng66 a TransformerDecoder example will be really helpful to beginner!

@demi6od . As @MrShininnnnn mentioned, you didn't pass the masks to the model during training so the model will see all the information in src.

Regarding your request for transformer decoder, we can convert the Language Translation tutorials from RNN models to Transformer (both encoder and decoder). I would be able to review some PRs from OSS.

demi6od commented 4 years ago

@MrShininnnnn Thank you for your help, I will try to add padding mask.

@zhangguanheng66 I have tried testing my model also with the src just as training, so I think it shouldn't output same tokens (e.g, "of of of of of of") since it can see all the information of my input in the compressed memory. I guess maybe I haven't train enough data for the LM, but the loss do converge. I will put my project on the github after finishing it. That's great for your new tutorial! Thanks! I think it will help a lot on better understanding the Transformer module and being more confident for the TransformerDecoder part.

zhangguanheng66 commented 4 years ago

@MrShininnnnn Thank you for your help, I will try to add padding mask.

@zhangguanheng66 I have tried testing my model also with the src just as training, so I think it shouldn't output same tokens (e.g, "of of of of of of") since it can see all the information of my input in the compressed memory. I guess maybe I haven't train enough data for the LM, but the loss do converge. I will put my project on the github after finishing it. That's great for your new tutorial! Thanks! I think it will help a lot on better understanding the Transformer module and being more confident for the TransformerDecoder part.

@demi6od you should check the loss curve to see if it's indeed converged.

demi6od commented 4 years ago

@MrShininnnnn Thank you for your help, I will try to add padding mask. @zhangguanheng66 I have tried testing my model also with the src just as training, so I think it shouldn't output same tokens (e.g, "of of of of of of") since it can see all the information of my input in the compressed memory. I guess maybe I haven't train enough data for the LM, but the loss do converge. I will put my project on the github after finishing it. That's great for your new tutorial! Thanks! I think it will help a lot on better understanding the Transformer module and being more confident for the TransformerDecoder part.

@demi6od you should check the loss curve to see if it's indeed converged.

@zhangguanheng66 Thank you for your advice, I will try to check it. Here is my project ChatBot based on transformer, the second part is about the nn.TransformerDecoder example. I hope it's helpful to others.

hadaev8 commented 4 years ago

Btw why encoder tutorial have this line src = self.encoder(src) * math.sqrt(self.ninp) ?

jahutwb commented 4 years ago

Regarding your request for transformer decoder, we can convert the Language Translation tutorials from RNN models to Transformer (both encoder and decoder). I would be able to review some PRs from OSS.

Hi! I'm looking forward for such converted [Language Translation tutorial]. Meanwhile I thought I could try myself but I encountered several problems at the beginning, just with running this notebook.

There is something wrong with opening notebook in Colab on the tutorial's page https://pytorch.org/tutorials/beginner/torchtext_translation_tutorial.html?highlight=transformer
So I download this and run it from my google drive but then I face another problem TypeError: __init__() got an unexpected keyword argument 'tokenizer_language' while declaring SRC Field. I handle it like in Attention is All You Need.ipynb tutorial https://github.com/bentrevett/pytorch-seq2seq/blob/master/6%20-%20Attention%20is%20All%20You%20Need.ipynb
And finally I get another error RuntimeError: Sizes of tensors must match except in dimension 2. Got 30 and 31 in dimension in rnn_input = torch.cat((embedded, weighted_encoder_rep), dim = 2)

And I kind of give up and decide to write this post. I'm wondering, maybe someone smarter also works on it and finishes before I would.

zhangguanheng66 commented 4 years ago

Regarding your request for transformer decoder, we can convert the Language Translation tutorials from RNN models to Transformer (both encoder and decoder). I would be able to review some PRs from OSS.

Hi! I'm looking forward for such converted [Language Translation tutorial]. Meanwhile I thought I could try myself but I encountered several problems at the beginning, just with running this notebook.

There is something wrong with opening notebook in Colab on the tutorial's page https://pytorch.org/tutorials/beginner/torchtext_translation_tutorial.html?highlight=transformer

So I download this and run it from my google drive but then I face another problem TypeError: __init__() got an unexpected keyword argument 'tokenizer_language' while declaring SRC Field. I handle it like in Attention is All You Need.ipynb tutorial https://github.com/bentrevett/pytorch-seq2seq/blob/master/6%20-%20Attention%20is%20All%20You%20Need.ipynb

And finally I get another error RuntimeError: Sizes of tensors must match except in dimension 2. Got 30 and 31 in dimension in rnn_input = torch.cat((embedded, weighted_encoder_rep), dim = 2)

And I kind of give up and decide to write this post. I'm wondering, maybe someone smarter also works on it and finishes before I would.

@jlin27 and @brianjo Just wondering if you could help with the first feedback and double check the links. Maybe other tutorials like transformer and text classification.

@jahutwb There is an ongoing PR to re-write the translation datasets in torchtext and should be ready soon. https://github.com/pytorch/text/pull/751. If you don't want to wait until merging, you could start with @akurniawan 's branch. I assume the final API won't be very different from the current version.

tylerroost commented 4 years ago

Running into a similar problem: Loss goes to zero, and output does in fact turn out to be correct, just practicing on a simple copy task

Output is just start of sentence over and over again.

When I try to use greedy decoding for inference, I am passing tgt_key_padding_mask, memory_key_padding_mask, and tgt_mask all correctly into the model.transoformer.decoder. Each mask is constructed correctly. I'm also passing tgt_mask, src_key_padding_mask, and tgt_key_padding_mask during training. My encoder is also getting passed src_key_padding_mask, which is the same mask as memory_key_padding_mask in decoder.

Also I have tried right shifting the output and putting in the full output, both of which result in the same error.

zhangguanheng66 commented 4 years ago

Running into a similar problem: Loss goes to zero, and output does in fact turn out to be correct, just practicing on a simple copy task

Output is just start of sentence over and over again.

When I try to use greedy decoding for inference, I am passing tgt_key_padding_mask, memory_key_padding_mask, and tgt_mask all correctly into the model.transoformer.decoder. Each mask is constructed correctly. I'm also passing tgt_mask, src_key_padding_mask, and tgt_key_padding_mask during training. My encoder is also getting passed src_key_padding_mask, which is the same mask as memory_key_padding_mask in decoder.

Also I have tried right shifting the output and putting in the full output, both of which result in the same error.

Which kind of masks are you using? Do you use the mask handler in nn.Transformer? You need a triangle mask. What's your tasks? Are you using transformer to predict next token? If so, you need to shift the source sequence by one for the target in the training process.

tylerroost commented 4 years ago

tgt_mask is created using nn.transformer.generate_square_subsequent_mask key_padding_masks are made using (src == pad).unsqueeze(-2)

task is seq2seq for solving simple linear equations, but for the purpose of testing greedy decoding function I am just using a simple copy task.

Are you using transformer to predict next token? If so, you need to shift the source sequence by one for the target in the training process.

I was prepending with <sos> and appending with <eos> only for the target, I thought this had the effect of shifting. ie src = "I love pytorch", tgt = "<sos>I love pytorch<eos>". I'm using character level and <sos> and <eos> are interpreted as single chars

tylerroost commented 4 years ago

I just tried shifting the source sequence by one for the target in the training process, even with the prepended <sos> and appended <eos> and now it just predicts s over and over again. Not s because <sos> shifted to the left is sos> because I am shifting after converting <sos> to a index.

tylerroost commented 4 years ago

output = model(src, trg[:-1,:])
loss = criterion(output.view(-1, output_dim), trg[1:,:].view(-1))
This seems like its working for me, at least it works for the copy task

zhangguanheng66 commented 4 years ago

tgt_mask is created using nn.transformer.generate_square_subsequent_mask key_padding_masks are made using (src == pad).unsqueeze(-2)

task is seq2seq for solving simple linear equations, but for the purpose of testing greedy decoding function I am just using a simple copy task.

Are you using transformer to predict next token? If so, you need to shift the source sequence by one for the target in the training process.

I was prepending with <sos> and appending with <eos> only for the target, I thought this had the effect of shifting. ie src = "I love pytorch", tgt = "<sos>I love pytorch<eos>". I'm using character level and <sos> and <eos> are interpreted as single chars

If you want to predict the next token, should your target be love pytorch with the source of I love.

tylerroost commented 4 years ago

Thanks, I think I figured out the training and greedy decoding. It works for the copy task at least, but the linear equations are proving more difficult because the model sees the number 1 in the first spot of most solutions because they are normally distributed around 0 with std dev of around 50. Is there some way to deal with lopsided classes? The analogy to translation or language modeling would be if the model sees the starting word I or the over and over again and would therefore predict I or the as the starting word constantly. I am using nn.NLLLoss, and tried using weights to no avail.

Considering the problem has a right answer, would it be possible to create a Loss function, or maybe one already exists, that penalizes for wrong answers based on the distance away from gold. Ie say gold is 15, but the model predicts -1, I want to penalize the -1 prediction proportionally to how far off from the gold it is ie alpha * abs(gold - pred).

masonreznov commented 4 years ago

@hadaev8

@xdwang0726 I have solved my problem, maybe it will be helpfull example for you https://colab.research.google.com/drive/1g4ZFCGegOmD-xXL-Ggu7K5LVoJeXYJ75

@hadaev8, you have used mask = mask.masked_fill(mask==1, float('-inf')) in the function:

def generate_square_subsequent_mask(self, sz):
        mask = torch.triu(torch.ones(sz, sz), 1)
        **mask = mask.masked_fill(mask==1, float('-inf'))**
        return mask

Generally, mask==0 is used. But, you've used mask==1. Is there any specific reason? P.S: i'm a newbie in this field.

him4318 commented 4 years ago

@hadaev8

@xdwang0726 I have solved my problem, maybe it will be helpfull example for you https://colab.research.google.com/drive/1g4ZFCGegOmD-xXL-Ggu7K5LVoJeXYJ75

@hadaev8, you have used mask = mask.masked_fill(mask==1, float('-inf')) in the function:
def generate_square_subsequent_mask(self, sz):
        mask = torch.triu(torch.ones(sz, sz), 1)
        **mask = mask.masked_fill(mask==1, float('-inf'))**
        return mask
Generally, mask==0 is used. But, you've used mask==1. Is there any specific reason? P.S: i'm a newbie in this field.

There are two types of masks. The above one is used to stop the decoder looking onto future values while using attention mechanism. Padding mask is the other mask where id equal to zero is used.

hadaev8 commented 4 years ago

@manuvazquez At the end you should have diagonal matrix with zeros and -inf Im using ==1 just couz its one step less than default pytorch implementation, resul is same.

masonreznov commented 4 years ago

@hadaev8

@xdwang0726 I have solved my problem, maybe it will be helpfull example for you https://colab.research.google.com/drive/1g4ZFCGegOmD-xXL-Ggu7K5LVoJeXYJ75

@hadaev8, you have used mask = mask.masked_fill(mask==1, float('-inf')) in the function:
def generate_square_subsequent_mask(self, sz):
        mask = torch.triu(torch.ones(sz, sz), 1)
        **mask = mask.masked_fill(mask==1, float('-inf'))**
        return mask
Generally, mask==0 is used. But, you've used mask==1. Is there any specific reason? P.S: i'm a newbie in this field.
There are two types of masks. The above one is used to stop the decoder looking onto future values while using attention mechanism. Padding mask is the other mask where id equal to zero is used.

Thanks!! Now I got it.

V3RGANz commented 4 years ago

During inference for text generation, nn.TransformerDecoder slow, because it each time generate sequence of same length (for sequence with len n, in inference, I want to just generate single n+1 th token, without others), is there some examples of using it without regenerating already generated tokens? Is it possible? I was looking on nn.Transformer implementation, seems that query, key and value for Attention always equal, but I think for inference, you need different query (with shape (seq_len=1, batch_size, hidden_size))

upd: by slowing I mean that you need to reduce batch size significantly to compute it.

zhangguanheng66 commented 4 years ago

During inference for text generation, nn.TransformerDecoder slow, because it each time generate sequence of same length (for sequence with len n, in inference, I want to just generate single n+1 th token, without others), is there some examples of using it without regenerating already generated tokens? Is it possible? I was looking on nn.Transformer implementation, seems that query, key and value for Attention always equal, but I think for inference, you need different query (with shape (seq_len=1, batch_size, hidden_size))

upd: by slowing I mean that you need to reduce batch size significantly to compute it.

Just keep in mind that transformer needs input/output to have same length. If your input has the length of n and you can to predict n+1, you have to input the whole sequence into transformer. Ideally, I wish the mask can skip the computation if the sequence is masked. But that's not how it currently works.

mathematicsofpaul commented 4 years ago

@tylerroost Care to post your code? i am attempting something numerical related too, in my case it is time series.

Asdf11x commented 4 years ago

@demi6od If you are dealing with the auto-regressive decoding, as I mentioned above, remember to pass both the tgt_mask and the memory_key_padding_mask to your decoder.

...
for i in range(tgt_seq_len): 
    decoder_input = tgt_embedding_layer(decoder_inputs[:, :i+1]).transpose(0, 1) 
    decoder_input = pos_encoder(decoder_input)
    tgt_mask = transformer_model.generate_square_subsequent_mask(i+1).to(device) 
    decoder_output = transformer_model.decoder(
        tgt=decoder_input, 
        memory=encoder_hidden_states, 
        tgt_mask=tgt_mask, 
        memory_key_padding_mask=src_key_padding_mask) 
    decoder_output = self.generator(decoder_output)[-1] 
    decoder_inputs[:, i+1] = decoder_output.max(1)[1]
return decoder_inputs

Hi Im dealing with the same problem, loss goes down quickly and only one token is repeated during inference, maybe @demi6od or @tylerroost could post a bit more code of your solution? Would appriciate a lot.

Adding this response, which is similar to the quoted here, stating a similar approach. Maybe more explanation would help

V3RGANz commented 4 years ago

@Asdf11x what PyTorch version you're using? I had a similar problem and found out that torch had a bug with generate_square_subsequent_mask which is incorrect and attention see tokens which it should generate, this is why loss going down very quickly. but in latest version it is ok

pytorch / tutorials

[HELP WANTED]Tutorial for nn.Transformer with nn.TransformerDecoder #719