jadore801120 / attention-is-all-you-need-pytorch

A PyTorch implementation of the Transformer model in "Attention is All You Need".
MIT License
8.78k stars 1.97k forks source link

Decoder input #8

Closed munkim closed 7 years ago

munkim commented 7 years ago

Hi, I am not sure if you are feeding the right input to the decoder.

(pg. 2) "Given z, the decoder then generates an output sequence (y1, ..., ym) of symbols one element at a time. At each step the model is auto-regressive, consuming the previously generated symbols as additional input when generating the next."

I believe your decoder input is a batch of target sequences.

jadore801120 commented 7 years ago

Hi @mun94 ,

I think it is not clear that this paragraph is talking about training phase or inference phase.

Therefore, I follow the classic Recurrent Sequence to Sequence training style. The method of feeding the target sequence into the decoder during training phase is called "teacher forcing", which can make the model converge faster. Training with the last output from decoder may also works well, but may take longer time to train. In the inference phase, I use the last output from the decoder as its next input.

However, I may be wrong. If you found out that the authors do train without teacher forcing, please let me know. Thank you!

Yu-Hsiang

munkim commented 7 years ago

Yeah it was not clear to be as well, but when I looked at their tensorflow implementation, I realized that their decoder works like word-based RNN. They feed back the output word token and append the output word token to the encoder input at each time with the shift.

For example, the input at the very first input looks like yt = 0 = [[0, 0, 0, 0, 2], [0, 0, 0, 0, 2]] (batch=2, max_seqlen=5, '2'='<s>']. Then, once you get the output ('4'='hi') from the decoder, it gets fed back into the decoder input sequence, which gets shifted by one. The new resulting sequence that will be now fed into the decoder embedding is y1 = [[0, 0, 0, 2, 4], [0, 0, 0, 2, 4]].

munkim commented 7 years ago

And, wow, 'teacher forcing' seems powerful.

1) How do you deploy the model? In other words, during testing/deployment (not validation), what do you put in replace of target sequences?

2) Would you mind linking the paper that you referenced? Is it this?

Thank you!

jadore801120 commented 7 years ago

Hi @mun94 ,

Would you mind to provide the link and the exact code part about the tensorflow implementation you mentioned? I haven't have time to dig into the official tf code.

munkim commented 7 years ago

I believe this does it. Let me know if I am wrong :) https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/data_reader.py#L227

Meanwhile I would really appreciate if you could let me know what you think about the questions above!

jadore801120 commented 7 years ago

Hi @mun94 ,

  1. This part I think I have already answered in the earlier comment: I feed the target sequence into the decoder during training phase; in the inference phase, I use the last output from the decoder as its next input. You should read the code from spro/practical-pytorch to get more background knowledge about the classic RNN seq2seq training process and teacher forcing. It will help a lot.

  2. The teacher forcing concept are first named in A Learning Algorithm for Continually Running Fully Recurrent Neural Networks . However, it has been well described in the introduction part in Professor Forcing: A New Algorithm for Training Recurrent Networks(the paper you provided).

jadore801120 commented 7 years ago

Hi @mun94 ,

I am sorry. I did not see where the decoder take the last output instead of ground truth in your link. The part you post seems only padding the data instances to the same length for mini batch operation.

I am not an expert of TensorFlow, so please highlight the exact code statements for me instead of just the function name for me. Thank you.

munkim commented 7 years ago

I am not an expert either so I am not sure if my explanation will make sense for you lol.

But basically, in the function below def infer_step(recent_output, _), it does extra padding and add the 'recent output' (line 273): padded = tf.pad(recent_output, [[0, 0], [0, 1], [0, 0], [0, 0]]). Then they slide the window of the decoder input (line 305-309): result = tf.foldl( infer_step, tf.range(decode_length), initializer=initial_output, back_prop=False, parallel_iterations=1)

where 'decode_length=50' (line 140) is the window size.

They do it token-wise. I guess you are giving the whole sequence (which you referred as the 'last output from the decoder') into the decoder, right?

jadore801120 commented 7 years ago

Excuse me.

I guess you are giving the whole sequence (which you referred as the 'last output from the decoder') into the decoder, right?

Yes, I did. And I also said that the teacher forcing (whole target seq into the decoder) is only used in training phase.

The code you paste is under the function called infer and it is for inference phase. In the inference phase, I do the exact same thing as they did.

I think we are discussing about the teacher forcing method in the training phase.

munkim commented 7 years ago

Sorry, I just wasn't clear about the term inference phase.

jadore801120 commented 7 years ago

The inference phase is simply the prediction phase (or test phase). You could take a look on the following resources to get more background knowledge.

Tensorflow tutorial: step by step, with original paper. OpenNMT-py: very well-written, a little bit complex

Since the misunderstanding is cleared up, let me close this issue.

munkim commented 7 years ago

Can you please send me the link to your code where the decoder behaves in a regressive manner? THank you :)

jadore801120 commented 7 years ago

The code is here: https://github.com/jadore801120/attention-is-all-you-need-pytorch/blob/master/transformer/Translator.py#L81-L102

This part also include the beam search part (not a greedy infer), so maybe hard to read.

Long story short. The decode output is stored in beam b. And the last output from decoder are retrieved with the get_current_state and take as the next input. https://github.com/jadore801120/attention-is-all-you-need-pytorch/blob/master/transformer/Translator.py#L81-L82 .

ZedYeung commented 6 years ago

Why not just set teacher forcing probability as spro did.

ylmeng commented 5 years ago

It might be good to stop using teacher forcing after certain epochs, if we can put such a hyperparameter in the scheduler.

stonyhu commented 5 years ago

@jadore801120 Hi guy, you are very nice to provide the pytorch code of transformer, great work! But I have a question for you, do you implement the teacher-forcing strategy in your training process below? https://github.com/jadore801120/attention-is-all-you-need-pytorch/blob/master/transformer/Models.py#L205 It seems you did not use the teacher forcing in the training, right?