google-research / pegasus

Apache License 2.0
1.61k stars 316 forks source link

Batch size effect on evaluation results #161

Closed agenius5 closed 3 years ago

agenius5 commented 3 years ago

Hi, @JingqingZ

Wierdly, I'm gettting better results with bigger batch size. I ran evaluation for AESLC dataset on fine-tuned ckpt available on google cloud (32000 steps). I ran the code on tpu with batch size of 8, 128 and 256 and got the following results.

Batch size 8: 37.52 / 21.23 / 36.35 Batch size 32 37.62 / 21.43 / 36.56 Batch size 128: 37.80 / 21.52 / 36.67 Batch size 256: 37.84 / 21.57 / 36.67

I looked it up on google and found that it has something to do with batch normalization. I saw the model uses layer normalization but I am not really sure. I observed similar behaviour with other datasets too when I fune-tuned them.

So, what's the reasons for so much difference. Moreover, which one is more accurate and what batch size did you guy use for evaluation?

JingqingZ commented 3 years ago

We observed the same and selected 256 finally as batch size for fine-tuning. Please refer to our paper Appendix C for details.

Larger batch size seems to generate better gradients.

agenius5 commented 3 years ago

Hey, sorry, I wrote my issue very poorly.

I mean, irrespective of the batch size I use for fine-tuning, evaluation gives best results only with higher batch size. For ex: I fine-tuned Pegasus on AESLC with batch size of 128. Now I expect evaluation results to be almost same irrespective of the batch size I use for evaluation. But, ROUGE score improved when evaluated with higher batch size.

I understand during training, a bigger batch size generate better gradients. But, during evaluation batch size shouldn't matter but apparently it does. The number I posted in my first comments are obtained when I evaluated AESLC test set with various batch size.

So, how come batch size (that I set before running evaluation) is affecting evaluation results ?

JingqingZ commented 3 years ago

In theory, batch size should not affect results in the evaluation phase.

Some ideas but I am not sure

  1. Decoding may not be deterministic.
  2. If the number of testing samples is not multiples of batch size, the remainder of samples could be discarded.

The difference of ROUGE scores with different batch size as you listed is lower than 1%, which I think is acceptable.

agenius5 commented 3 years ago

Yep, remainder samples after last batch were getting discarded. Setting drop_remainder=False fixed the issue. Many thanks!

A little off topic but when we can say that we beat the results published in some previous paper. I have seen new papers achieving a little better scores than pegasus (let's say on CNN/dailymail) dataset but do they count? So, how much better is actually better? Also i have seen in some paper, they report scores saying, "scores within 0.15...". What does that actually mean?

I have no prior experience in research so if you can shed some light regarding this, it'll be a great help.

JingqingZ commented 3 years ago

I suppose, strictly, you may do a statistical hypothesis testing to measure if the difference is statistically significant. 0.15 appears in our paper and some papers is mostly empirical I assume.

agenius5 commented 3 years ago

Thanks a lot.