huggingface / evaluate

🤗 Evaluate: A library for easily evaluating machine learning models and datasets.
https://huggingface.co/docs/evaluate
Apache License 2.0
2.03k stars 259 forks source link

Tokenized BLEU considered harmful - Discussion on community-based process #105

Open kpu opened 4 years ago

kpu commented 4 years ago

https://github.com/huggingface/nlp/blob/7d1526dfeeb29248d832f1073192dbf03ad642da/metrics/bleu/bleu.py#L76 assumes the inputs are tokenized by the user. This is bad practice because the user's tokenizer is usually not the same as the one used by mteval-v13a.pl, the closest thing we have to a standard. Moreover, tokenizers are like window managers: they can be endlessly customized and nobody has quite the same options.

As @mjpost reported in https://www.aclweb.org/anthology/W18-6319.pdf BLEU configurations can vary by 1.8. Yet people are incorrectly putting non-comparable BLEU scores in the same table, such as Table 1 in https://arxiv.org/abs/2004.04902 .

There are a few use cases for tokenized BLEU like Thai. For Chinese, people seem to use character BLEU for better or worse.

The default easy option should be the one that's correct more often. And that is sacrebleu. Please don't make it easy for people to run what is usually the wrong option; it definitely shouldn't be bleu.

Also, I know this is inherited from TensorFlow and, paging @lmthang, they should discourage it too.

mjpost commented 4 years ago

I second this request. The bottom line is that scores produced with different reference tokenizations are not comparable. To discourage (even inadvertent) cheating, the user should never touch the reference. The v13a tokenization standard is not ideal, but at least it has been consistently used at matrix.statmt.org, facilitating comparisons.

Sacrebleu exposes all its data sources and additionally provides an API to accessing the references, which seem to fit within the spirit of your codebase.

bittlingmayer commented 4 years ago

Didn't we have a slide and discussion at WMT admitting that, for production-quality models, BLEU doesn't correlate with human eval anyway?

mjpost commented 4 years ago

Yes, there are slides like that at WMT every year :) BLEU correlates with human judgment only at coarse levels, and it seems to be getting worse when people try to use it to do model selection among high-performing neural systems.

However, the point isn't whether BLEU is a good metric, but whether your BLEU score can be compared to other BLEU scores. They only can be compared if you use the same reference tokenization (similar to how you can't compare LM perplexities across different segmentations). sacrebleu was an attempt to get everyone to use WMT's reference tokenization (meaning, your system has to first remove its own tokenization) so that you could just compare across papers. This also prevents scores from being gamed.

tholiao commented 4 years ago

I do not consider as a sufficient solution switching this library's default metric from BLEU to the wrapper around SacreBLEU.

As currently implemented, the wrapper allows end users to toggle SacreBLEU options, but doesn't pass along the SacreBLEU signature. As @mjpost showed in Post18, it's simply not credible to assume that people will stick to the defaults, therefore, the signature is necessary to be explicit about what options were used.

In addition to the v13a or intl options for the SacreBLEU tokenize argument, which was pointed out earlier, papers frequently differ on whether they lowercase text before scoring (lowercase) and the smoothing method used (smooth_method). BLEU scores can differ substantially (over 1 BLEU) just by changing these options.

Losing the SacreBLEU signature is a regression in reproducibility and clarity.

(Perhaps this should belong in a separate issue?)

thomwolf commented 4 years ago

Thanks for sharing your thoughts. This is a very important discussion.

Also one of the first items on our mid-term roadmap (we will try to clean it and share it soon) is to introduce mechanisms to get high-quality traceability and reproducibility for all the processes related to the library.

So having the signature for sacrebleu is really important!

Regarding BLEU, I guess we can just remove it from the canonical metrics included in the repo itself (it won't prevent people to add it as "user-metrics" but at least we won't be promoting it).

On a more general note (definitely too large for the scope of this issue) we are wondering, with @srush in particular, how we could handle the selection of metrics/datasets with the most community-based and bottom-up approach possible. If you have opinions on this, please share!

srush commented 4 years ago

Yeah, I would love to have discussions about ways this project can have an community-based, transparent process to arrive at strong default metrics. @kpu / @mjpost do you have any suggestions of how that might work or pointers to places where this is done right? Perhaps this question can be template for what is likely to be repeated for other datasets.

kpu commented 4 years ago

I think @bittlingmayer is referring to Figure 6 in http://statmt.org/wmt19/pdf/53/WMT02.pdf . When you look at Appendix A there are some cases where metrics fall apart at the high end and some where they correlate well. en-zh is arguably production-quality.

This could evolve into a metrics Bazaar where the value add is really the packaging and consistency: it installs/compiles the metrics for me, gives a reproducible name to use in publication (involve the authors; you don't want a different sacrebleu hash system), a version number, and evaluation of the metrics like http://ufallab.ms.mff.cuni.cz/~bojar/wmt19-metrics-task-package.tgz but run when code changes rather than once a year.

srush commented 4 years ago

While a Bazaar setup works for models / datasets, I am not sure it is ideal for metrics ? Ideal from my perspective would be to have tasks with metrics moderated by experts who document, cite, and codify known pitchfalls (as above^) and make it non-trivial for beginners to mess it up.

bittlingmayer commented 4 years ago

@srush @thomwolf

ModelFront could provide (automated, "QE-based") evaluation for all the pretrained translation models you host. Not bottom-up and not valid for claiming SoTA, but independent, practical for builders and not top-down.

For that I would also suggest some diverse benchmarks (so split it out into datasets with only user-generated data, or only constants, or only UI strings, or only READMEs) which tease out known trade-offs. Even hypothetical magic eval is limited if we always reduce it to a single number.

Realistically people want to know how a model compares to an API like Google Translate, Microsoft Translator, DeepL or Yandex (especially for a language pair like EN:RU, or for the many languages that only Yandex supports), and that could be done too.

kaivu1999 commented 3 years ago

Very important discussion. I am trying to understand the effects of tokenization. I wanted to ask which is a good practice. Sacrebleu should be used on top of the tokenized output, or detokenized(raw) text?

kpu commented 3 years ago

Use sacrebleu on detokenized output and raw unmodified references.