Closed agenius5 closed 3 years ago
We observed the same and selected 256 finally as batch size for fine-tuning. Please refer to our paper Appendix C for details.
Larger batch size seems to generate better gradients.
Hey, sorry, I wrote my issue very poorly.
I mean, irrespective of the batch size I use for fine-tuning, evaluation gives best results only with higher batch size. For ex: I fine-tuned Pegasus on AESLC with batch size of 128. Now I expect evaluation results to be almost same irrespective of the batch size I use for evaluation. But, ROUGE score improved when evaluated with higher batch size.
I understand during training, a bigger batch size generate better gradients. But, during evaluation batch size shouldn't matter but apparently it does. The number I posted in my first comments are obtained when I evaluated AESLC test set with various batch size.
So, how come batch size (that I set before running evaluation) is affecting evaluation results ?
In theory, batch size should not affect results in the evaluation phase.
Some ideas but I am not sure
The difference of ROUGE scores with different batch size as you listed is lower than 1%, which I think is acceptable.
Yep, remainder samples after last batch were getting discarded. Setting drop_remainder=False
fixed the issue. Many thanks!
A little off topic but when we can say that we beat the results published in some previous paper. I have seen new papers achieving a little better scores than pegasus (let's say on CNN/dailymail) dataset but do they count? So, how much better is actually better? Also i have seen in some paper, they report scores saying, "scores within 0.15...". What does that actually mean?
I have no prior experience in research so if you can shed some light regarding this, it'll be a great help.
I suppose, strictly, you may do a statistical hypothesis testing to measure if the difference is statistically significant. 0.15 appears in our paper and some papers is mostly empirical I assume.
Thanks a lot.
Hi, @JingqingZ
Wierdly, I'm gettting better results with bigger batch size. I ran evaluation for AESLC dataset on fine-tuned ckpt available on google cloud (32000 steps). I ran the code on tpu with batch size of 8, 128 and 256 and got the following results.
Batch size 8:
37.52 / 21.23 / 36.35
Batch size 3237.62 / 21.43 / 36.56
Batch size 128:37.80 / 21.52 / 36.67
Batch size 256:37.84 / 21.57 / 36.67
I looked it up on google and found that it has something to do with batch normalization. I saw the model uses layer normalization but I am not really sure. I observed similar behaviour with other datasets too when I fune-tuned them.
So, what's the reasons for so much difference. Moreover, which one is more accurate and what batch size did you guy use for evaluation?