Request clarification on BLEU calculation

Archonsh commented 4 years ago

Dear authors,

As you have kindly published all your code on Github, I also tried to implement and reproduce the Docrepair model. However, I am having some trouble reproducing the scores.

In the given dataset, (https://www.dropbox.com/s/06i1yz5zxy2o1ve/emnlp19_docrepair.zip?dl=0), test.dst and test.src have groups of 4 sentences separated by the _eos token.

I am able to reproduce the baseline MT BLEU score, by keep the tokenization, remove BPE and the _eos between the sentences, and using the whole 4-sentences segments as a single input, i.e. considering it as one long sentence. I used test.src as hypothesis and test.dst as reference.

However, I am not able to reproduce the Docrepair BLEU in the same setting. I trained a Docrepair model according to the default parameters and with your dataset. I can only obtain 29.81 BLEU in this case.

Therefore, I just wanted to clarify whether I should remove the _eos in the Docrepair output to calculate the BLEU score or not, so that I can check if my re-training was done correctly. I want to be sure since I would like to re-train the model for a different language pair.

Also, have you considered splitting the segments into individual sentences and calculate BLEU at a single sentence level?

I’d appreciate any help you can give me, and I’m looking forward to hearing back from you.

Thank you very much!

lena-voita commented 4 years ago

Dear Archonsh,

Your description of how you evaluate baseline looks correct, the same should be ok for the DocRepair model.

Please check your total batch size (this is important). Also, for how long do you train DocRepair? We trained for about 1m batches; at 750k the BLEU was already good.

Best, Lena.

lena-voita commented 4 years ago

"Also, have you considered splitting the segments into individual sentences and calculate BLEU at a single sentence level?"

(Sorry, forgot about this question) No, we haven't evaluated DocRepair at sentence level. But in the paper, we also trained DocRepair at sentence level (SentRepair) and evaluated this model. For more details, take a look at the paper.

Archonsh commented 4 years ago

Thank you for your explanation!

I wonder have you encountered the situation where the Docrepair model's output contains >4 or <4 sentences, i.e. there are more/less '_eos' in the output than expected?

Would you remove all the '_eos' and do the BLEU calculation in this case?

lena-voita commented 4 years ago

Hi Archonsh,

No, I haven't had such examples.

Lena.

lena-voita commented 4 years ago

By the way, just in case: for sentence-level repair, there is a WMT 2019 paper which looks at this in more detail and has lots of cool analysis of translations.

APE at Scale and its Implications on MT Evaluation Biases: https://www.aclweb.org/anthology/W19-5204.pdf

Archonsh commented 4 years ago

Sorry for bothering you again.

I trained the Russian model for ~140k steps and then I stoped my training as I saw the dev BLEU was consistently stable at around 35/36. I used batchlen*gpu*sync_every_steps=32000.

Is it the case that I did not train the model long enough?

lena-voita commented 4 years ago

Yes, this is not enough.

lena-voita / good-translation-wrong-in-context

Request clarification on BLEU calculation #7