neulab / ExplainaBoard

Interpretable Evaluation for AI Systems
MIT License
359 stars 36 forks source link

Allowed specification of the metric #dimensions #528

Closed neubig closed 1 year ago

neubig commented 1 year ago

This PR loosens the restriction that sufficient statistics must be a vector, and allows them to be a tensor with the dimension equal to Metric.stats_ndim().

It also demonstrates how this works on the NLGMetaEvaluation metric.

@pfliu-nlp and @odashi : could you please check this PR as a potential solution to the discussion in https://github.com/neulab/ExplainaBoard/pull/527 ?

(sorry, after sending the review request I made a change of naming from dim->ndim, which I think is more in line with the naming in numpy)

pfliu-nlp commented 1 year ago

In general, I think if we want to keep stats as-is, this statement should be adjusted:

assert result.shape == result_shape

https://github.com/neulab/ExplainaBoard/blob/e3bca2acc4392a1b56f242e3109e8384974a5019/explainaboard/metrics/metric.py#L415

Note that result.shape will be a tuple with multiple values.

neubig commented 1 year ago

Thanks! I had actually already fixed that in https://github.com/neulab/ExplainaBoard/pull/528/commits/8d5471d40ab8b2a4b90dabbdc0b3b158d08e661a because tests were failing.

pfliu-nlp commented 1 year ago

Cool! I found it! Last comment:

Do you think the following code will ignore the case when we uses_customized_aggregate but don't perform the significance test (not batched)? Or is this on purpose?

      if self.uses_customized_aggregate():
            if stats.is_batched():
                assert stats.get_batch_data().shape[0] == result.shape[0], (
                    "BUG: invalid operation: "
                    f"{type(self).__name__}._aggregate_stats(): "
                    f"Expected batch dimension {stats.get_batch_data().shape[0]}, but "
                    f"got {result.shape[0]}."
                )
      else:
          result_shape = (
              (stats.get_batch_data().shape[0], stats.num_statistics())
              if stats.is_batched()
              else (stats.num_statistics(),)
          )
          assert result.shape == result_shape, (
              "BUG: invalid operation: "
              f"{type(self).__name__}._aggregate_stats(): "
              f"Expected shape {result_shape}, but got {result.shape}."
          )
neubig commented 1 year ago

Yeah, that is intentional. It catches when the batch dimension differs, but doesn't do any checks otherwise. We might want to add additional checks, but I'm not sure what they would be.

pfliu-nlp commented 1 year ago

Yeah, that is intentional. It catches when the batch dimension differs, but doesn't do any checks otherwise. We might want to add additional checks, but I'm not sure what they would be.

OK, then I think it should be fine. (not sure if @odashi has other comments)

odashi commented 1 year ago

@neubig I think this mitigation is not necessary at this point. Please take a look at alternatives in #527

neubig commented 1 year ago

Thanks @odashi ! To clarify, yes, I saw the mitigation strategies there. But I feel that they're relatively complicated. Even this here I feel is a bit hacky: https://github.com/neulab/ExplainaBoard/pull/528/files#diff-dc342aa901c2256e1da8308da557c5b23a17aab2d0383c0a2a795057ecbe61bdL116 and adding the dimensions as stats seems even more hacky.

The PR here seems to be a reasonable middle ground. It relaxes the dimension matching requirement a little bit but only in the case of use_customized_aggregate(), which we will only use sparingly.

odashi commented 1 year ago

@neubig This change gives unnecessarily wide permission for the Metric (not only for the problems in #527 but also any Metrics in the future) which looks too dangerous to me. I understand that the problem in #527 is exactly metric-specific, which should be resolved by the specific metric implementation itself as long as it can be applied.

neubig commented 1 year ago

To be clear, it only gives more permission to metrics that use use_customized_aggregate(), not any Metric. And I think the use of use_customized_aggregate() should be infrequent, and also a reason for increased scrutiny of the correctness of the implementation during code review (maybe the function should be commented as such?).

We could also perhaps make the check above more stringent if there is some part of the relaxed checks that is particularly worrisome.

odashi commented 1 year ago

I think nlg_meta_evaluation is no longer suitable to be implemented as a Metric; the requirement of the underlying data is completely different.

odashi commented 1 year ago

@neubig It means that use_customized_aggregate() can become a way to hack the Metric.

neubig commented 1 year ago

Thanks @odashi , and yes, I totally see what you mean. I didn't think about Metrics that rely on sufficient statistics that aren't a "reduce" (sum or mean) of the sufficient statistics of each example when I first designed the Metric interface, and I think a larger refactoring is in order. We can think of this PR as a temporary fix.