Evaluation with multi-bleu.perl or multi-bleu-detok.perl

bzhangGo commented 7 years ago

Hi all,

I have a question about the evaluation script for machine translaitons.

Personally, I believe that the multi-bleu.perl is the most authorized script (almost all companies and universities use it), and I used it for all my MT experiments. This situation stops until a reviewer strongly rejects the use of multi-bleu.perl, and states rather clearly that I should not use this script because of its heavy dependency on tokenizer. I agree this view, too.

I notice that the moses project provides another script, named multi-bleu-detok.perl which uses an in-built tokenizer. And I tested it on my datasets, for En-De WMT14, the result is:

For multi-bleu.perl, 19.71; For multi-bleu-detok.perl, 24.01 (case-sensitive);

The difference reaches 4 points!! So, I want to ask, could anyone tell me which script sould be used for evaluation? And obviously, the "deok" version gives rather promising results. So if I use the latter script, can I still compare my work with previous publications?

Thanks very much.

rsennrich commented 7 years ago

There's a long tradition of using the NIST evaluation scripts (mteval-vX.perl) for shared translation tasks such as WMT, IWSLT and OpenMT. Using multi-bleu.perl in publications is an unfortunate recent trend.

multi-bleu.perl does not do its own tokenization, and you need to first tokenize both the reference and the output of your system to get similar performance to the NIST evaluation scripts. The problem is that this makes evaluation very sensitive to the tokenization you use, and on many test sets, you can easily win a couple of BLEU points by tokenizing more aggressively (for instance by doing hyphen-splitting). This makes BLEU scores hard to compare across different groups, especially if the groups don't specify their tokenization.

The NIST evaluation scripts, and multi-bleu-detok.perl (and some recent implementations like SacreBLEU) expect the hypothesis to be in the original, untokenized format, and the hypothesis to be detokenized to match this format. The scripts then apply an internal tokenization (which is the same across scripts) before computing BLEU. This makes BLEU easier to compare across groups (with the usual caveat that BLEU is limited as a measure of translation quality).

Long story short, your low scores with multi-bleu.perl may be because you didn't tokenize the reference and output. If you compare tokenized multi-bleu.perl to untokenized multi-bleu-detok.perl, you will usually find a difference of 1-2 BLEU. Using multi-bleu-detok.perl makes your results comparable to shared task submissions to WMT and IWSLT, and other papers using the NIST evaluation scripts. If previous work published results with multi-bleu.perl, you will have a hard time comparing your results to it anyway, unless you're using the same tokenization,

bzhangGo commented 7 years ago

Thanks so much!

Firstly, my reference and output are all well-tokenized, and my training and testing data are all downloaded from the stanford NMT project (the WMT'14 English-German data, notice that all datasets are preprocessed.) Thus, my BLEU score should be normal.

To enable others recognize this problem, I upload the hypothesis from our recent work: The used WMT14 reference file Our translation file When evaluated with multi-bleu.perl, our BLEU score is 22.57; while it increases to 27.60 (a margin of 5 points!!) with multi-bleu-detok.perl.

This problem also happens for WMT14 English-French translation tasks, where our model can achieve over 37 BLEU points with multi-bleu.perl, and over 42 BLEU points with multi-bleu-detok.perl. As far as I know, 42 points may be a new state-of-the-art result. But, can I declare that?

In short, this is not an issue of only 1~2 BLEU points. Through my experiments, there are significant difference between these two scripts.

Secondly, based on all these, what should I do to evaluate my system and report my results?

If I only report the results produced with multi-bleu.perl, reviewers may criticize the tokenization problem;
But if I only report with multi-bleu-detok.perl, researchers may not accept the results considering the difference between the two scripts as well as that almost all previous NMT-related papers reported with multi-bleu.perl.

Therefore, what's the suggested solution? Reporting both results? And how to claim state-of-the-art performance?

rsennrich commented 7 years ago

two things:

do not run multi-bleu-detok.perl on tokenized text. This defeats the purpose of relying on the script's internal tokenization to have a fairer comparison. In particular, your high BLEU score probably comes from the internal tokenization of "##AT##-##AT##" by multi-bleu-detok.perl. Your reference should be the official one (with no preprocessing) from http://www.statmt.org/wmt14/test-filtered.tgz . Your output should be postprocessed (detokenization, removing the weird AT-AT markup) to look like "normal" text.
you have some inconsistencies in tokenization between your reference and your output (e.g. how you escape quotation marks etc.). This will affect your scores negatively.

rsennrich commented 7 years ago

by the way, it's probably easiest if you just download sacreBLEU ( https://github.com/mjpost/sacreBLEU ) and run this:

cat your_output | sacreBLEU -t wmt14 -l en-de

you'll still need to postprocess your output first to be detokenized properly, but at least there's no risk that you use a reference file that has some weird preprocessing.

bzhangGo commented 7 years ago

@rsennrich Thanks a lot. I catch the point. After a simple process of AT-AT and detokenization, I obtained a BLEU score of 21.99, very similar to the score generated by multi-bleu.perl.

By the way, do you have any suggestions about how to report results on the papers? It seems that these two scripts produce different results, so how to declare the state-of-the-art performance? And how to compare with previous publications? (under the assumption that exactly reimplementing all previous results is impossible)

mjpost commented 7 years ago

Even easier, you can just run

pip3 install sacrebleu

and then you'll have sacrebleu in your path. In a day or so I'll push up version 1.1 which will add bootstrap resampling (thanks to a contribution by Christian Federmann) and a number of other features.

The problem that this code is trying to fix is that you can't compare BLEU scores across papers. Usually papers do not include enough information about their setup and the parameters that went into their BLEU computation. As Rico points out, tokenization is one of the most important ones, but there are many others. Some papers are not even clear about whether they did lowercasing. There are even more complications: for example, in WMT14, there are two different en-de test sets, one used for the evaluation (with 2,737 lines) and another released afterwards "for further research" (with 3,003) lines. (These are available as "-t wmt14" and "-t wmt14/full" in sacreBLEU; note that WMT14/en-de is one of the few or only datasets with this distinction, due to a problem encountered during the evaluation). All of these have a large effect on BLEU score.

If you are using WMT test sets, and you use sacreBLEU, then you can safely compare to numbers on the matrix (column: BLEU-cased) or the numbers in the WMT overview papers (e.g., WMT17). If you use sacreBLEU (which essentially just puts a convenient wrapper around mteval-v13a.pl, the official WMT scoring script), then others can compare to you.

bzhangGo commented 7 years ago

@mjpost The toolkit 'sacrebleu' is cool! 👍 It indeed reduces the efforts of other researchers. And I like it very much.

But, unfortunately, recent NMT works often use the multi-bleu.perl script with unclear tokenizations, and new researches usually follow this setting. In particular, the Facebook and Google research group claimed state-of-the-art results evaluated by multi-bleu.perl. And in my mind, if we want develop a state-of-the-art model, we need demonstrate the model using the multi-bleu.perl rather than multi-bleu-detok.perl for fair comparison. This is rather confusing.

alvations commented 7 years ago

No, it's not true that using multi-bleu.pl gives you a fair comparison. Given different tokenization before calling multi-bleu.pl it'll give you inconsistent results.

Actually any MT paper using BLEU without describing explicit tokenization methods should be taken with a pinch of salt.

Sacrebleu and multi-bleu-detok.pl has made it a point that most text for WMT supported languages English, German, French, Turkish, Finnish, Spanish) should be detokenized first and that'll standardize tokenization.

It's an issue for other languages that requires explicit tokenization which mutli-bleu-detok and sacrebleu hasn't support yet but that's not a big problem as for comparative results as long as the outputs are shared online and people can easily tokenized/detokenized and recompute the BLEU score with whichever tokenizer they like.

Still I find it important to describe tokenization/detokenization steps taken before computing BLEU, whether or not sacrebleu or multi-bleu-detok is used.

In brief: Using the same evaluation script don't make the results comparable if the gold standards input to the script isn't consistent.

bzhangGo commented 7 years ago

@alvations Yes, your consideration is right. Actually, by "fair comparison", I mean a relative case instead of an absolute case. But suppose we use the multi-bleu-detok.perl, we would still have two different settings compared with exsiting work: 1) the inputs to the scripts and 2) the script itself.

In addition, I observe that multi-bleu-detok.perl is sensitive to the detokenization procedure, or more generally, the postprocessing. A stronger postprocessing model would enable a higher BLEU score. So, the fairness is only relative, unless authors state clearly what kinds of postprocessing are used.

By the way, for WMT supported languages, such as English, German and French, is using the moses tokenizer the standard practice? If we all use this script for tokenization (at least, we did this), I think the comparison would be fair.

alvations commented 7 years ago

I think the comparable result comes when we use the detokenized/natural version of the gold standard reference translation which is distributed by WMT organizers (because natural text comes detokenized by default), regardless of how tokenized/detokenized the translated hypothesis outputs looks like. Thus sacrebleu.

mjpost commented 7 years ago

If we knew that everyone used the same tokenizer, and with the same flags (there are lots of options to tokenizer.pl), and the same other preprocessing, the comparison would be fair.

Postprocessing can affect things, but the point is that your post-processing is a function of your model, and not the evaluation. The basic problem is that using different tokenizations on the reference changes the denominator in precision calculations for BLEU, and you cannot compare precisions that were computed across different sets (similar to how you cannot compare perplexities across different vocabulary sizes).

It is true that many groups have recently been reporting tokenized BLEU scores, but it doesn't matter. You can't compare unless you know the exact tokenization they used (and remember that very minor differences can make a big difference), as well as casing, normalization, smoothing, and other preprocessing and parameters. Likely most of these were the default — but you don't know that. You unfortunately cannot compare to paper results that used multi-bleu.perl (unless you can somehow get all the required details from them or from their paper). You can, though, make it so that others can easily compare to your numbers.

rsennrich commented 7 years ago

Using the moses tokenizer is relatively common, but even that has different options (like the "-a" option for aggressive tokenization, which does hyphen splitting, or the "-penn" option) that will affect tokenization.

alvations commented 7 years ago

In short, anything not uploaded to http://matrix.statmt.org/ and http://lotus.kuee.kyoto-u.ac.jp/WAT/evaluation/index.html and reports their own scores based on multi-bleu.pl without documentation of what steps were taken to preprocess the gold standards is sort of iffy =)

(I'm not sure about IWSLT but I think they have standardized evaluation processing + BLEU steps too)

bzhangGo commented 7 years ago

Well, thanks very much.

As I stated before, I agree that tokenization affects the evaluation of multi-bleu.perl, and comparisons under this script are somewhat unfair.

But my question is: based on all these information, what should we do next?

Directly using multi-bleu-detok.perl means that the results will become completely incomparable with previous publications. And I do not think my model is convincing if I do not compare with them. There are so many excellent publications, and I can not just ignore them.

I believe that the discussion here is just a beginning, but I am not sure whether I can receive an abosolute answer here, and even whether this is the right place to ask this quesiton. What I can do, I suppose, is to make my comparison more careful and fair.

rsennrich commented 7 years ago

Your results will likely not be directly comparable to previous publications anyway (unless you know that you use the exact same preprocessing). Some options are to report BLEU with multiple scripts (see table 3 and footnote 10: http://aclweb.org/anthology/E17-2060.pdf ), train your own baseline (many papers come with open source implementations), or report other scores, but note that results are not directly comparable.

In the long run, people will hopefully move away from reporting tokenized BLEU (towards detokenized BLEU or something more meaningful than BLEU).

bzhangGo commented 7 years ago

@rsennrich Thanks, that's an option.

moses-smt / mosesdecoder

Evaluation with multi-bleu.perl or multi-bleu-detok.perl #186