About self-bleu metric - Githubissues

tomhosking / separator

Code for the paper "Factorising Meaning and Form for Intent-Preserving Paraphrasing", Tom Hosking & Mirella Lapata (ACL 2021)

MIT License

27 stars 5 forks source link

About self-bleu metric #9

Closed yaoing closed 2 years ago

yaoing commented 2 years ago

Hello, I notice the self-bleu metric in your code:

        refs = [q["paras"] for q in rows]
        inputs = [[q["sem_input"]] for q in rows]

        # refs = [x["paras"] for x in qs_by_para_split]
        max_num_refs = max([len(x) for x in refs])
        refs_padded = [x + [x[0]] * (max_num_refs - len(x)) for x in refs]

        tgt_bleu = sacrebleu.corpus_bleu(output, list(zip(*refs_padded))).score
        self_bleu = sacrebleu.corpus_bleu(output, list(zip(*inputs))).score

You regard the "sem_input" as the reference sentences, but as the definition of self-bleu, it's should calculate the bleu score of different predicted sentence, could you explain it?

Thanks!

tomhosking commented 2 years ago

Hi, thanks for your interest in our work! Please see Section 4.2 of the paper, where we define Self-BLEU as the BLEU score of the predicted outputs when compared to the original inputs. It therefore measures how similar the output is to the input, to identify models that just reproduce the input rather than paraphrasing it. We are not aware of previous work that has used a different definition of Self-BLEU, please let me know if I have missed this.

yaoing commented 2 years ago

As the author of self-bleu say:

We propose Self-BLEU, a metric to evaluate the diversity of the generated data. Since BLEU aims to assess how similar two sentences are, it can also be used to evaluate how one sentence resembles the rest in a generated collection. Regarding one sentence as hypothesis and the others as reference, we can calculate BLEU score for every generated sentence, and define the average BLEU score to be the Self-BLEU of the document.

If I understand correctly, this metric is used for only the generated collections. I just want to make sure of that.

tomhosking commented 2 years ago

Hi, which paper is that from? I am not aware of any work within paraphrasing that has defined Self-BLEU in the past, but it's possible that someone else has used the name before in another field. To be clear, the self-BLEU reported in the paper is the same as the code, as described in our paper: BLEU(predictions, inputs). Our definition is the same as the one used by DivGAN, who also define a metric 'p-BLEU' for evaluating the diversity of multiple predictions.

yaoing commented 2 years ago

I find the definition in this paper: Texygen, at the Section 2.2.3. OK, this should be a rename. And the self-bleu is just a metirc for diversity, for inputs, it's also work.

tomhosking commented 2 years ago

Thanks for bringing that to my attention, I hadn't seen it before!

yaoing commented 2 years ago

Bother you again, I have some other quetion about the metircs in your paper:

the alpha on your paper menthioned is 0.7, but I find the value on your code of sep_ae.py is 0.8, which is the final choice for the experiments.
Have you think about the rouge score or meteor score?

That's all, thanks for you nice reply!

tomhosking commented 2 years ago

1) The values in the paper used \alpha = 0.7 - we calculated the iBLEU from the BLEU and Self-BLEU scores which are also reported separately. But yes, the iBLEU scores reported by the code will use 0.8. 2) ROUGE is based around recall and so is suitable for summarisation tasks, not paraphrasing. Meteor is potentially applicable, but does some lexical replacement to allow for different word choices. This is useful if you're trying to evaluate semantic consistency (eg for machine translation) but less useful when you have multiple references or if you're just trying to evaluate string similarity.

I hope that helps!

yaoing commented 2 years ago

OK, I understand. It's helpful to me, thanks!