google-research-datasets / xsum_hallucination_annotations

Faithfulness and factuality annotations of XSum summaries from our paper "On Faithfulness and Factuality in Abstractive Summarization" (https://www.aclweb.org/anthology/2020.acl-main.173.pdf).
81 stars 6 forks source link

How to Reproduce Table 2 Result #1

Closed yuhui-zh15 closed 3 years ago

yuhui-zh15 commented 4 years ago

Hello, awesome works, and congratulations! I'm wondering how to reproduce the numbers in Table 2.

First, I suppose the base number for Table 2 is 500, but multiply 500 with many percentages in Table 2 will result in decimals (e.g., the number of faithful summaries produced by BERTS2S = 500 * 26.9% = 134.5?). Can you explain the base number of this table?

We elicited judgments from three different annotators for each of 2500 (500x5) document-summary pairs... Results from the full assessment are shown in Table 2, which shows the percentage of documents per system that were annotated as faithful or hallucinated (faithful = 100 - hallucinated).

Besides, I write a script following the instructions in Table 2:

The numbers in “Hallucinated” columns show the percentage of summaries where at least one word was annotated by all three annotators as an intrinsic (I) or extrinsic (E) hallucination. When a summary is not marked with any hallucination, it is “faithful” (100 - I∪E), column “Faith.”

However, the results seem to be different from Table 2... I'm not sure which part I misunderstood, could you provide your script?

Looking forward to your reply.

Many thanks, Yuhui


My script:

import pandas
from collections import Counter, defaultdict
data = pandas.read_csv('hallucination_annotations_xsum_summaries.csv')
data = data.values.tolist()

docids = set([item[0] for item in data])

agg = {}
for system in ['PtGen', 'TConvS2S', 'TranS2S', 'BERTS2S', 'Gold']:
    for docid in docids:
        anns = list(filter(lambda x: x[0] == docid and x[1] == system, data))
        wids = set([ann[-1] for ann in anns])
        if len(wids) != 3: 
            print(system, docid, wids)
        spans_ex = defaultdict(set)
        spans_in = defaultdict(set)
        for ann in anns:
            _, _, ref, type, span, wid = ann
            if type == 'intrinsic':
                spans_in[wid] = spans_in[wid].union(set(span.split()))
            elif type == 'extrinsic':
                spans_ex[wid] = spans_ex[wid].union(set(span.split()))
        is_in = len(set.intersection(spans_in['wid_0'], spans_in['wid_1'], spans_in['wid_2'])) > 0
        is_ex = len(set.intersection(spans_ex['wid_0'], spans_ex['wid_1'], spans_ex['wid_2'])) > 0
        if system not in agg: agg[system] = {}
        agg[system][docid] = (is_in, is_ex)

for system in agg.keys():
    print(system)
    for key, value in Counter(agg[system].values()).items():
        if key == (True, False):
            print('I', value / 500)
        elif key == (False, True):
            print('E', value / 500)
        elif key == (True, True):
            print('I+E', value / 500)
        elif key == (False, False):
            print('O', value / 500)
    print()

My results (here I treat I, E, I+E to three orthogonal categories and the I∪E in the Table should be the sum of I, E, I+E:

PtGen
E 0.556
I+E 0.09
O 0.232
I 0.122

TConvS2S
E 0.618
O 0.186
I+E 0.122
I 0.074

TranS2S
E 0.604
I+E 0.088
O 0.194
I 0.114

BERTS2S
E 0.58
O 0.24
I+E 0.086
I 0.094

Gold
O 0.218
E 0.696
I+E 0.048
I 0.038
shashiongithub commented 4 years ago

Hi Yuhui, thanks for your questions! The base value is 498, we had to ignore two articles in the end, our annotators found they were in Gaelic and left them without assessing them.

Thanks for pointing this out. We missed to provide span ids, we needed that to count "at least one word that was annotated by all three annotators". I will address these issues and also release the script to estimate our scores in a couple of days.

Alex-Fabbri commented 4 years ago

Hi Shashi,

I was wondering if you had an update on this as I was hoping to replicate the faithfulness correlations in Table 4. It looks like there are three articles in Gaelic with ids 39553812, 39497668, and 40254741. Thanks!

shashiongithub commented 4 years ago

Hi Alex, Sorry for the delay on this! I did not get slots to work on this properly. I will be adding this code soon. Please bear with me. Thanks!