Closed aliisakroe closed 3 years ago
Please see Section 5.4 in our paper.
@shashiongithub
Thanks for the awesome paper and dataset!
I also would like some more clarity on the entailment score calculation.
In particular, I see that Entailment
is included in eval_scores_xsum_summaries.csv, but I don't see any data in this repository regarding neutral and contradicting labels.
I am trying to "connect the dots" between Table 3 in your paper, the huggingface xsum_factuality
dataset, and the data reported in CSVs in this repository.
For instance, when you report that PtGen
entails 38.4 - is it correct to say that means 38.4% of the PtGen
-generated summaries (out of your 500 human annotated test set) have a => .50 probability of entailing the document. As such, should it be possible to reproduce this 38.4 percent from your eval scores CSV via:
PtGen
in this case)In code:
import pandas as pd
reported_results = pd.read_csv('local/path/to/xsum-eval-scores.csv'')
ptgen_results = reported_results.query('system_bbcid.str.contains("ptgen")', engine="python")
len(ptgen_results) # => 498
len(ptgen_results[ptgen_results.Entailment >= 0.5]) / len(ptgen_results) # => 0.367469...
This computed 36.7
value is close to, but meaningfully different, from the 38.4
reported in the paper.
Given this, a couple of questions:
Hi, can you please provide more detail on the entailment score methodology? It seems to be inferred only, from somewhere outside the documentation, so a citation would be useful.
Thank you!