aggregate_stats and calc_metric_from_aggregate don't work in some cases.

Metric.aggregate_stats and Metric.calc_metric_from_aggregate does not guarantee that each implementation returns the ndarray with correct shape. These functions must return following arrays:

aggregate_stats ... [num_stats] or [num_batches, num_stats]
calc_metric_from_aggregate ... [] or [num_batches]

But several implementations does return other shapes, even the default implementations in Metric.

This causes several wrong consequences. A serious one is bootstrapped CI never returns correct data because the inner sort() doesn't work along the correct axis (I observed this when adding a unit test for calc_confidence_interval).

neulab / ExplainaBoard

aggregate_stats and calc_metric_from_aggregate don't work in some cases. #497