ruotianluo / self-critical.pytorch

Unofficial pytorch implementation for Self-critical Sequence Training for Image Captioning. and others.
MIT License
991 stars 278 forks source link

Evaluating Transformer [teacher-forcing during evaluation] #242

Closed nilinykh closed 3 years ago

nilinykh commented 3 years ago

Hej,

I have a question regarding the evaluation of the transformer model. As far as I understand, it is trained in the teacher-forcing fashion: no matter what it predicts, its inputs are always the ground-truth words. Every next input word in the input sequence is masked depending on the current timestamp. This way, we ignore the outputs of the model during training and always feed the input sequence + next ground-truth token. However, it was not clear to me if the same algorithm is applied during evaluation of the model. Could you please point me to the lines in your code where evaluation is using the previously predicted token, attaches it to the generated sequence and feeds to the model? Or is evaluation here also teacher-forced?

ruotianluo commented 3 years ago

These are two ways of evaluation; people usually focus on the former and evaluate with CIDER, BLEU METEOR metrics.

nilinykh commented 3 years ago

Oh, I meant a different thing: how is transformer evaluated internally? So it is an auto-regressive model and during evaluation it should use the previously generated word, append it to the word input and generate the next word, right? Or is it using the ground-truth text just as during training, when teacher forcing is used?

ruotianluo commented 3 years ago

I don't understand what is "evaluated internally". From a high level, the transformer is a sequence model just as RNN.

nilinykh commented 3 years ago

I am just trying to find the place in the code where previously generated words are used as an input to the model to generate next words during evaluation when using beam search. If it is a sequence model, then there should be a place in the code for evaluation, where every next generated word is used as an input to the model...

ruotianluo commented 3 years ago

I see. It's using AttModel's beam search. Transformer is a child of AttModel. To reduce redundancy, I made the Transformer be exposed the same as an RNN.

nilinykh commented 3 years ago

Thanks! I have found the lines I have been looking for, correct me if I am wrong:

Lines 200-202 in CaptionModel.py take the previously generated words as the next input and extract new probabilities for the next words it = beam_seq_table[divm][:, :, t-divm].reshape(-1) logprobs_table[divm], state_table[divm] = self.get_logprobs_state(it.cuda(), *(args[divm] + [state_table[divm]])) logprobs_table[divm] = F.log_softmax(logprobs_table[divm] / temperature, dim=-1)

And then the self.get_logprobs_state comes from AttModel.py, this function is sampling the actual next log probabilities.

So evaluation is done properly, no teacher-forcing here.