huggingface / evaluate

🤗 Evaluate: A library for easily evaluating machine learning models and datasets.
https://huggingface.co/docs/evaluate
Apache License 2.0
1.9k stars 235 forks source link

[FR] Confidence intervals for metrics #581

Open NightMachinery opened 2 months ago

NightMachinery commented 2 months ago

It seems that currently simple metrics such as

evaluate.load(
    "accuracy",
)

do not compute a confidence interval. This can be easily fixed by first computing the mean, and the STD, and then dividing the STD by the square of the sample count (to compute the STD of the mean estimate). (See, e.g., here.)

Even just giving back the variance (or STD) is enough, the user can do their own computations on those.