Previously, we have meta-evaluation metrics (and tasks) specifically defined for WMT, which are too dataset dependent and not generalized enough for other NLG tasks, such as summarization and data-to-text. This PR aims to introduce meta-evaluation metrics for generation tasks such as summarization and data-to-text.
@odashi, I should fix most of the comments. One remaining one could be fixed after we have a consensus (in PR https://github.com/neulab/ExplainaBoard/pull/527) about how to store the number of samples.
Previously, we have meta-evaluation metrics (and tasks) specifically defined for WMT, which are too dataset dependent and not generalized enough for other NLG tasks, such as summarization and data-to-text. This PR aims to introduce meta-evaluation metrics for generation tasks such as summarization and data-to-text.