Consider model variance in bootstrap resampling test

odashi commented 2 years ago

This issue brings up a problem about using so-called "bootstrap resampling test" for evaluating "statistical significance" of machine translation (especially neural MT) methods, and similar generation tasks that are evaluated by MT metrics.

In this criterion, the evaluator will choose several number of generated sentences randomly to simulate the distribution of model outputs, but the evaluator does not consider the variance of the trained model itself.

Consider that we have a baseline model to beat, and our champion model of the proposed methods. The champion model will be produced regardless of recognizing it by authors, e.g., if the model was trained only a few times, any people may not be able to judge if the model is the outlier on the model distribution or not.

In this situation, the "bootstrap resampling test" may judge the model of the proposed method is significantly better, but the evaluation was actually employed for only one model variant which may be a champion, and did not consider any distributional properties of the proposed model.

The "bootstrap resampling test" was introduced on the era of statistical MT, and I guessed the method historically produced reasonable judgements for SMT systems because their study was usually investigating some additions of almost-fixed systems such as Moses (note that I said "almost-fixed" here because they also had random tuning for hyperparameters). In neural MT systems, this assumption had gone because the systems were randomly trained from scratch, and the "bootstrap resampling test" may no longer produce meaningful results but rather give the model a wrong authority.

I was observing continuously that the "bootstrap resampling test" was still utilized in many papers to give "statistical significance" of the model, and strongly worried about misleading this line of research.

neubig commented 2 years ago

I agree, but at the same time I think any statistical testing is probably better than none. Rather than removing testing altogether, it would probably be better to implement testing that also accounts for optimizer instability such as that described in this paper: https://aclanthology.org/P11-2031/

odashi commented 2 years ago

I think any statistical testing is probably better than none.

I don't fully agree with this. A problem of employing statistical testing is that users and reviewers sometimes believe the results regardless of its appropriateness (this usually happens too in other testing, such as so-called "p-value faith"). Many papers used "bootstrap resampling test" regardless of the property I noted in the first comment, and I guessed "not-sure" is much better than wrong authority in this case.

neubig commented 2 years ago

That's a fair point. For the time being I've explained this in a little more detail in the README: https://github.com/neulab/compare-mt#significance-tests

The longer term solution would be implementing tests such as the ones proposed by Clark et al. above in compare-mt. For the time being, anyone who finds this issue and wants to control for random seed selection can use multeval instead.

odashi commented 2 years ago

Thanks for clarifying! I think it would be helpful to add a link to this issue in README since the further discussion may happen here, and replace the title "statistical testing" to "... for single models" or something similar.

kpu commented 2 years ago

I agree, but at the same time I think any statistical testing is probably better than none.

I think a statistical test that always claims significance (and bootstrap does in my experience) is worse than none at all. The papers that do run tests usually have too small of an effect size to be useful and gussy it up behind a probably random significance test. I find the reviewers demanding significance tests, when they should know there isn't really one that works, especially annoying.

odashi commented 2 years ago

Every statistical test is meaningful if and only if the underlying hypothesis is suitable. As for the bootstrap resampling, the H0 of this test is that this particular system produces as the same accuracy as this particular baseline so it may be usable if the authors really wanted to reject this H0. But some papers accidentally introduced this test to argue the significance of a method that involves some model distribution. This kind of judgement should be infeasible unless the method produces the same system every time.

too small of an effect size

Yes this is also a problem by ignoring model variances...

neulab / compare-mt

Consider model variance in bootstrap resampling test #126