Potential issue with spearman R bootstrapping

neulab / ExplainaBoard

Interpretable Evaluation for AI Systems

MIT License

359 stars 36 forks source link

Potential issue with spearman R bootstrapping #537

Closed neubig closed 1 year ago

neubig commented 1 year ago

We observed the following test failure when integrating another PR:

======================================================================
FAIL: test_sample_level_spearmanr_bootstrap (integration_tests.meta_eval_wmt_da_test.MetaEvalNLGCITest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/ExplainaBoard/ExplainaBoard/integration_tests/meta_eval_wmt_da_test.py", line 191, in test_sample_level_spearmanr_bootstrap
    self.assertAlmostEqual(ci[0], 0.6488, 2)
AssertionError: 0.7325904563487001 != 0.6488 within 2 places (0.08379045634870008 difference)

----------------------------------------------------------------------

We are not sure whether this is an issue with the test or the underlying code, but as a temporary measure we reduced the sensitivity of the test. We should go back and check to make sure whether this is just due to bootstrapping variance or whether it's due to a bug in the test itself.

odashi commented 1 year ago

The cause of randomness may be that we didn't fix the random seed.

https://github.com/neulab/ExplainaBoard/blob/9ba998c125587bb5a9b421706b1ca50028af220b/explainaboard/metrics/metric.py#L571

I think a possible fix is:

Provide a way to fix the seed from outside
Implement test case with almost-equal assertion for several seeds

odashi commented 1 year ago

NumPy provides SeedSequence functionality which allows us to hierarchically initialize the random seeds over the whole library. I think it would be better that every class that calculates something has SeedSequence argument in __init__.

neubig commented 1 year ago

That seems reasonable to me.

tetsuok commented 1 year ago

I'm seeing this failures with a high probability. It's time consuming to re-run integration tests just to make PR green. Is someone already working on the fix?

neubig commented 1 year ago

We have a temporary fix of making the test more lenient that is mixed in with this PR: https://github.com/neulab/ExplainaBoard/pull/532

It might be worth pulling out just that fix and making it a separate PR.

tetsuok commented 1 year ago

@neubig I think we should create a separate PR to fix this flaky test as soon as possible, given that the PR is still in review.

neubig commented 1 year ago

I agree. I'm not able to do it now, but welcome anyone else who can do it

tetsuok commented 1 year ago

@neubig I'll create a PR to cherry-pick the change in the test in #532. I will add comments to #532, too.

odashi commented 1 year ago

Plan to apply the change:

Implement seed argument in Metric.__init__, which is defaulting to None. If seed is None, it is initialized randomly.
Implement Metric.get_seed to return the seed that the object holds.
In inherited classes, Implement randomizer with initializing a spawned seed from the seed returned by get_seed.

tetsuok commented 1 year ago

Assigning to me as per the offline discussion with @odashi.