google-research / pegasus

Apache License 2.0
1.61k stars 316 forks source link

Finetuning Loss not decreasing on Custom Summarization Task [Help wanted] #47

Closed rohitsroch closed 4 years ago

rohitsroch commented 4 years ago

Hi, First of all Great paper! Lately, I have been doing abstractive summarization tasks separately for an agent and a customer, given a conversation transcript between both. We have total labeled datapoints of around 700-1000. (Conversation transcripts)

Currently, I am fine-tuning the released C4 + HugeNews checkpoint to perform abstractive summarization for speaker-1 (Agent) Following is the input/output format to the encoder/decoder

# only the sentences correspond to speaker-1 (agent). Each sentence separated by a full stop
 Input: This is agent sentence-1.  This is agent sentence-2. This is agent sentence-3.

# corresponding ground truth
 output: This is the agent summary
JingqingZ commented 4 years ago

Hi, thanks for the question! Could you elaborate on what do you mean by

The loss is not decreasing after this point and not converging or stuck to local minima.

Does the loss ever decrease in the first 20 epoch? What is the ROUGE score you have achieved by fine-tuning on PEGASUS and T5?

Any plans to release PEGASUS_base

Sorry, there is currently no plan to release base models due to checkpoints incompatibility.

rohitsroch commented 4 years ago

Does the loss ever decrease in the first 20 epoch? What is the ROUGE score you have achieved by fine-tuning on PEGASUS and T5?

@JingqingZ Apologies for the confusion. Yes, the loss decreases smoothly for the first 15-20 epochs but it doesn't converge. Below is the reference plot during training with learning rate 2e-4.

loss

  beam_size = 1
  top_p = 0.95
  top_k = 50
  temperature=0.5

NOTE: Below scores are Average across 78 datapoints in eval set

PEGASUSlarge

ROUGE-1 ROUGE-2 ROUGE-L
precision 0.493 0.237 0.368
recall 0.532 0.263 0.403
fmeasure 0.486 0.237 0.365

T5small

ROUGE-1 ROUGE-2 ROUGE-L
precision 0.507 0.211 0.363
recall 0.443 0.189 0.322
fmeasure 0.455 0.192 0.329
JingqingZ commented 4 years ago

Hi, thanks for the information!

I think the overall performance (given the learning curve and ROUGE scores) of PEGASUS looks reasonable so I don't think there is anything wrong in there. But apparently it can be improved by tuning some hyper-parameters, which need some empirical experiments.

the loss decreases smoothly for the first 15-20 epochs but it doesn't converge. Below is the reference plot during training with learning rate 2e-4.

It seems the loss is still decreasing and the fine-tuning may need more steps. In our paper Appendix C, we provide a full table of hyper-parameters we used to fine-tune each dataset and most of them have more fine-tuning steps (and possibly larger batch size) than yours. The learning rate can be smaller as well if the fluctuation of loss persists.

Considering the relatively small eval set with 78 examples, some slight fluctuation of loss on the eval set is possible.

I didn't use Beam search algorithm for decoding

Beam search actually can improve the ROUGE quite significantly for a couple of points.

Hope this may answer your questions!

rohitsroch commented 4 years ago

@JingqingZ, Thanks a lot for quick help, I will check Appendix C in paper :). Closing this issue!