Closed yaoing closed 2 years ago
Hi, thanks for your interest in our work! Please see Section 4.2 of the paper, where we define Self-BLEU as the BLEU score of the predicted outputs when compared to the original inputs. It therefore measures how similar the output is to the input, to identify models that just reproduce the input rather than paraphrasing it. We are not aware of previous work that has used a different definition of Self-BLEU, please let me know if I have missed this.
As the author of self-bleu say:
We propose Self-BLEU, a metric to evaluate the diversity of the generated data. Since BLEU aims to assess how similar two sentences are, it can also be used to evaluate how one sentence resembles the rest in a generated collection. Regarding one sentence as hypothesis and the others as reference, we can calculate BLEU score for every generated sentence, and define the average BLEU score to be the Self-BLEU of the document.
If I understand correctly, this metric is used for only the generated collections. I just want to make sure of that.
Hi, which paper is that from? I am not aware of any work within paraphrasing that has defined Self-BLEU in the past, but it's possible that someone else has used the name before in another field. To be clear, the self-BLEU reported in the paper is the same as the code, as described in our paper: BLEU(predictions, inputs). Our definition is the same as the one used by DivGAN, who also define a metric 'p-BLEU' for evaluating the diversity of multiple predictions.
I find the definition in this paper: Texygen, at the Section 2.2.3. OK, this should be a rename. And the self-bleu is just a metirc for diversity, for inputs, it's also work.
Thanks for bringing that to my attention, I hadn't seen it before!
Bother you again, I have some other quetion about the metircs in your paper:
sep_ae.py
is 0.8, which is the final choice for the experiments.That's all, thanks for you nice reply!
1) The values in the paper used \alpha = 0.7 - we calculated the iBLEU from the BLEU and Self-BLEU scores which are also reported separately. But yes, the iBLEU scores reported by the code will use 0.8. 2) ROUGE is based around recall and so is suitable for summarisation tasks, not paraphrasing. Meteor is potentially applicable, but does some lexical replacement to allow for different word choices. This is useful if you're trying to evaluate semantic consistency (eg for machine translation) but less useful when you have multiple references or if you're just trying to evaluate string similarity.
I hope that helps!
OK, I understand. It's helpful to me, thanks!
Hello, I notice the self-bleu metric in your code:
You regard the "sem_input" as the reference sentences, but as the definition of self-bleu, it's should calculate the bleu score of different predicted sentence, could you explain it?
Thanks!