neulab / ExplainaBoard

Interpretable Evaluation for AI Systems
MIT License
359 stars 36 forks source link

Add task metaevaluation for nlg #527

Closed pfliu-nlp closed 1 year ago

pfliu-nlp commented 1 year ago

Blocked by: https://github.com/neulab/ExplainaBoard/pull/526

Based on evaluation metrics achieved in PR 526, this PR aims to introduce task processor.

Notably, the shape check process in function aggregate_stats() is still too strong: https://github.com/neulab/ExplainaBoard/blob/b31d5d6506bdb6fb633b836ef798f25488f4052d/explainaboard/metrics/metric.py#L406 I further relax it in this PR. We can discussion more about this.

neubig commented 1 year ago

Comment: I wonder if you'd be able to use a method for aggregation like I did here? https://github.com/neulab/ExplainaBoard/blob/main/explainaboard/metrics/nlg_meta_evaluation.py#L115-L141

It seems that it would then be possible to avoid removing that check.

pfliu-nlp commented 1 year ago

Comment: I wonder if you'd be able to use a method for aggregation like I did here? https://github.com/neulab/ExplainaBoard/blob/main/explainaboard/metrics/nlg_meta_evaluation.py#L115-L141

It seems that it would then be possible to avoid removing that check.

@neubig (@odashi ) I think about this but don't think it could work since once we perform

data.reshape((data.shape[0], data.shape[-2] * data.shape[-1]))

It's hard for us to recover the data to its original shape. (In the above case, data.shape[-1] has fixed dimension and is hard-coded as 4. In the current case, it's dynamic.

We need to figure out a way to fix this since it also blocks the PR: https://github.com/neulab/ExplainaBoard/pull/526.

odashi commented 1 year ago

@pfliu-nlp For a quick fix, you can also store the size of the dimension as another stats.

pfliu-nlp commented 1 year ago

@pfliu-nlp For a quick fix, you can also store the size of the dimension as another stats.

Yeah, that would be another solution we can consider. But if following this, I feel like the function _aggregate_stats and calc_stats_from_data have been hacked too much. How do you think @neubig

odashi commented 1 year ago

@pfliu-nlp Yes, it is definitely a hack, but it looks better than mitigating a restriction of the methods' presuppositions.