neulab / ExplainaBoard

Interpretable Evaluation for AI Systems
MIT License
360 stars 36 forks source link

Fix confidence interval calculation using t-test #599

Closed neubig closed 1 year ago

neubig commented 1 year ago

Overview

There was an error in how confidence intervals were calculated using student's t-test causing them to be far too wide. This PR fixes this.

Details

When calculating the confidence intervals of the mean of a sample using student's t-test, you need to use a t distribution with the standard deviation of the sample mean. However, we were scaling by the standard deviation of each sample, causing the intervals to be incorrect.

Fixes https://github.com/neulab/explainaboard_web/issues/541 which also provides a bit more context.

Also see discussion here for mathematical justification.

Blocked by #598