neulab / ExplainaBoard

Interpretable Evaluation for AI Systems
MIT License
359 stars 36 forks source link

aggregate_stats and calc_metric_from_aggregate don't work in some cases. #497

Closed odashi closed 1 year ago

odashi commented 1 year ago

Metric.aggregate_stats and Metric.calc_metric_from_aggregate does not guarantee that each implementation returns the ndarray with correct shape. These functions must return following arrays:

But several implementations does return other shapes, even the default implementations in Metric.

This causes several wrong consequences. A serious one is bootstrapped CI never returns correct data because the inner sort() doesn't work along the correct axis (I observed this when adding a unit test for calc_confidence_interval).