Entailment Score - Githubissues

aliisakroe commented 3 years ago

Hi, can you please provide more detail on the entailment score methodology? It seems to be inferred only, from somewhere outside the documentation, so a citation would be useful.

Thank you!

shashiongithub commented 3 years ago

Please see Section 5.4 in our paper.

dleve123 commented 2 years ago

@shashiongithub

Thanks for the awesome paper and dataset!

I also would like some more clarity on the entailment score calculation.

In particular, I see that Entailment is included in eval_scores_xsum_summaries.csv, but I don't see any data in this repository regarding neutral and contradicting labels.

I am trying to "connect the dots" between Table 3 in your paper, the huggingface xsum_factuality dataset, and the data reported in CSVs in this repository.

For instance, when you report that PtGen entails 38.4 - is it correct to say that means 38.4% of the PtGen-generated summaries (out of your 500 human annotated test set) have a => .50 probability of entailing the document. As such, should it be possible to reproduce this 38.4 percent from your eval scores CSV via:

Filtering that CSV by system (PtGen in this case)
Counting the number of filtered rows where the probability is >= to .5
Compute the fraction of (2) to count of rows of (1)

In code:

import pandas as pd
reported_results = pd.read_csv('local/path/to/xsum-eval-scores.csv'')
ptgen_results = reported_results.query('system_bbcid.str.contains("ptgen")', engine="python")
len(ptgen_results) # => 498
len(ptgen_results[ptgen_results.Entailment >= 0.5]) / len(ptgen_results) # => 0.367469...

This computed 36.7 value is close to, but meaningfully different, from the 38.4 reported in the paper.

Given this, a couple of questions:

Assuming I'm approaching this reproduction incorrectly, can you let me know a better path here?
Is there any more-raw data on neutral and contradicting entailment?
Can you provide any more information on the BERT-Large finetuned on Multi-NLI? I'm attempting to reproduce your results with https://huggingface.co/madlag/bert-large-uncased-mnli, but am getting quite different results. If you didn't use an off-the-shelf model, details on fine-tuning would be deeply appreciated – I don't think you reported hyperparameters for your entailment model finetuning in your paper.

google-research-datasets / xsum_hallucination_annotations

Entailment Score #5