the computation of BLEU score

sunhaozhepy commented 2 years ago

Hi,

I'm wondering how did you compute the BLEU score of your paper? Did you take one generated distractor as input and the three actual distractors as golden answers?

voidful commented 2 years ago

Hi That's correct, and then I will take the highest score as the result.

sunhaozhepy commented 2 years ago

Hi,

Thanks for the reply! What do you mean by "highest score"? I suppose you just take three golden distractors as references and calculate the BLEU score of the whole validation dataset...?

voidful commented 2 years ago

Hi,

Thanks for the reply! What do you mean by "highest score"? I suppose you just take three golden distractors as references and calculate the BLEU score of the whole validation dataset...?

Hi

model generated distractor for example: he didn't want to return the money peter didn't give the money he did not want to pay back the money

The ground truth is: he had paid Peter all the money his wife didn't let him do so he did not want to pay back the money

We will calculate score for each generated distractor: he didn't want to return the money —> he had paid Peter all the money / his wife didn't let him do so / he did not want to pay back the money

peter didn't give the money —> he had paid Peter all the money / his wife didn't let him do so / he did not want to pay back the money

he did not want to pay back the money —> he had paid Peter all the money / his wife didn't let him do so / he did not want to pay back the money

For each data, we will get the highest score as this distractor score: he didn't want to return the money —> he had paid Peter all the money X he didn't want to return the money —> his wife didn't let him do so X he didn't want to return the money —> he did not want to pay back the money (highest, and we will take this score as distractor score

sunhaozhepy commented 2 years ago

I see. So you choose the one reference that maximizes the BLEU score with the generated distractor, and you take the average on the whole test dataset, is that right? PS: do you know if your predecessors, e.g. Gao et al., 2019, did the same thing when computing BLEU score? Because to me it is not evident to do it like this, they could have done differently...

voidful commented 2 years ago

I see. So you choose the one reference that maximizes the BLEU score with the generated distractor, and you take the average on the whole test dataset, is that right? PS: do you know if your predecessors, e.g. Gao et al., 2019, did the same thing when computing BLEU score? Because to me it is not evident to do it like this, they could have done differently...

Hi

That's correct～ For the PS part, I double check the predecessors results. Their calculation is treating all the reference as ground truth with closest reference length as penalty.

https://github.com/Yifan-Gao/Distractor-Generation-RACE/blob/a7c958f4018ed5fa8f21421813d4ff0083387cab/distractor/eval/bleu/bleu_scorer.py#L237 This is the same evaluation method as Microsoft COCO Captions: Data Collection and Evaluation Server on section 3.3

Our evaluation both taking all the references into account, my perspective is that taking the best match reference compare to consider all reference with length penalty can better show how close the generated distractor to human result. However we should show both calculation result, sorry we miss this part.

sunhaozhepy commented 2 years ago

Thank you for your detailed explanation, that's a great help. I'm currently working on distractor generation so let's wish me good luck! xD

voidful / BDG

the computation of BLEU score #12