microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
19.75k stars 2.52k forks source link

❓ Question : Training - Evaluating discrepancy in Abstractive Summarization #5

Closed astariul closed 4 years ago

astariul commented 4 years ago

Thanks for open-sourcing the code !

After reading your paper, I have a question about the finetuning procedure for Abstractive summarization (and more generally any Seq2Seq task).


image

I understand this idea : Similarly to Bert and to UniLM pretraining, finetuning on Abstractive Summarization is masking some token and predicting it in order to learn a bidirectional representation of tokens.

But at inference time, since we don't have access to the whole summary (it is yet to be generated), we can only apply a left-to-right LM.

It seems a pretty big discrepancy between training and testing.


What I don't understand is that people already tried to use BERT (trained as a bidirectional encoder) as a left-to-right LM. But results were really low.

And in your case, results are very high !

So my questions are :

donglixp commented 4 years ago

Hi @Colanim ,

Only the encoder is pre-trained in a bidirectional manner, while the decoder is left-to-right, which is controlled by the attention mask matrix. So the fine-tuning process is the same as inference in terms of decoding.

-li

astariul commented 4 years ago

Thanks for the quick response @donglixp !

So if I understood well, for abstractive summarization there is 3 tasks :

  1. left-to-right LM on the summary part for decoder
  2. Bidirectional LM on the article part for the encoder
  3. Extractive task based on the first token

Is it right ?

donglixp commented 4 years ago

Because the source side has been given. During fine-tuning, we only compute generation loss for the decoder, which is similar to previous seq2seq models. In the paper, we added an extractive loss in the encoder side, but we didn't use it in the repo's example. The released checkpoint can achieve better results even without the extractive loss.

astariul commented 4 years ago

Ok so in the actual code there is only one loss, which is the generation loss for the decoder (so, left-to-right LM based on the summary).

Thank you very much for your answers !