google-research-datasets / xsum_hallucination_annotations

Faithfulness and factuality annotations of XSum summaries from our paper "On Faithfulness and Factuality in Abstractive Summarization" (https://www.aclweb.org/anthology/2020.acl-main.173.pdf).
81 stars 6 forks source link

Entailment Score #5

Closed aliisakroe closed 3 years ago

aliisakroe commented 3 years ago

Hi, can you please provide more detail on the entailment score methodology? It seems to be inferred only, from somewhere outside the documentation, so a citation would be useful.

Thank you!

shashiongithub commented 3 years ago

Please see Section 5.4 in our paper.

dleve123 commented 2 years ago

@shashiongithub

Thanks for the awesome paper and dataset!

I also would like some more clarity on the entailment score calculation.

In particular, I see that Entailment is included in eval_scores_xsum_summaries.csv, but I don't see any data in this repository regarding neutral and contradicting labels.

I am trying to "connect the dots" between Table 3 in your paper, the huggingface xsum_factuality dataset, and the data reported in CSVs in this repository.

For instance, when you report that PtGen entails 38.4 - is it correct to say that means 38.4% of the PtGen-generated summaries (out of your 500 human annotated test set) have a => .50 probability of entailing the document. As such, should it be possible to reproduce this 38.4 percent from your eval scores CSV via:

  1. Filtering that CSV by system (PtGen in this case)
  2. Counting the number of filtered rows where the probability is >= to .5
  3. Compute the fraction of (2) to count of rows of (1)

In code:

import pandas as pd
reported_results = pd.read_csv('local/path/to/xsum-eval-scores.csv'')
ptgen_results = reported_results.query('system_bbcid.str.contains("ptgen")', engine="python")
len(ptgen_results) # => 498
len(ptgen_results[ptgen_results.Entailment >= 0.5]) / len(ptgen_results) # => 0.367469...

This computed 36.7 value is close to, but meaningfully different, from the 38.4 reported in the paper.

Given this, a couple of questions:

  1. Assuming I'm approaching this reproduction incorrectly, can you let me know a better path here?
  2. Is there any more-raw data on neutral and contradicting entailment?
  3. Can you provide any more information on the BERT-Large finetuned on Multi-NLI? I'm attempting to reproduce your results with https://huggingface.co/madlag/bert-large-uncased-mnli, but am getting quite different results. If you didn't use an off-the-shelf model, details on fine-tuning would be deeply appreciated – I don't think you reported hyperparameters for your entailment model finetuning in your paper.