Closed sunhaozhepy closed 2 years ago
Hi That's correct, and then I will take the highest score as the result.
Hi,
Thanks for the reply! What do you mean by "highest score"? I suppose you just take three golden distractors as references and calculate the BLEU score of the whole validation dataset...?
Hi,
Thanks for the reply! What do you mean by "highest score"? I suppose you just take three golden distractors as references and calculate the BLEU score of the whole validation dataset...?
Hi
model generated distractor for example: he didn't want to return the money peter didn't give the money he did not want to pay back the money
The ground truth is: he had paid Peter all the money his wife didn't let him do so he did not want to pay back the money
We will calculate score for each generated distractor: he didn't want to return the money —> he had paid Peter all the money / his wife didn't let him do so / he did not want to pay back the money
peter didn't give the money —> he had paid Peter all the money / his wife didn't let him do so / he did not want to pay back the money
he did not want to pay back the money —> he had paid Peter all the money / his wife didn't let him do so / he did not want to pay back the money
For each data, we will get the highest score as this distractor score: he didn't want to return the money —> he had paid Peter all the money X he didn't want to return the money —> his wife didn't let him do so X he didn't want to return the money —> he did not want to pay back the money (highest, and we will take this score as distractor score
I see. So you choose the one reference that maximizes the BLEU score with the generated distractor, and you take the average on the whole test dataset, is that right? PS: do you know if your predecessors, e.g. Gao et al., 2019, did the same thing when computing BLEU score? Because to me it is not evident to do it like this, they could have done differently...
I see. So you choose the one reference that maximizes the BLEU score with the generated distractor, and you take the average on the whole test dataset, is that right? PS: do you know if your predecessors, e.g. Gao et al., 2019, did the same thing when computing BLEU score? Because to me it is not evident to do it like this, they could have done differently...
Hi
That's correct~ For the PS part, I double check the predecessors results. Their calculation is treating all the reference as ground truth with closest reference length as penalty.
https://github.com/Yifan-Gao/Distractor-Generation-RACE/blob/a7c958f4018ed5fa8f21421813d4ff0083387cab/distractor/eval/bleu/bleu_scorer.py#L237
This is the same evaluation method as Microsoft COCO Captions: Data Collection and Evaluation Server
on section 3.3
Our evaluation both taking all the references into account, my perspective is that taking the best match reference compare to consider all reference with length penalty can better show how close the generated distractor to human result. However we should show both calculation result, sorry we miss this part.
Thank you for your detailed explanation, that's a great help. I'm currently working on distractor generation so let's wish me good luck! xD
Hi,
I'm wondering how did you compute the BLEU score of your paper? Did you take one generated distractor as input and the three actual distractors as golden answers?