Cannot reproduce the IAA results in Table 3 from paper

Hi, @sanjanaramprasad

As above. I was trying to verify the data and calculate the IAA reported in the paper (Table 3). However, I cannot get such a high IAA agreement based on the annotation provided here:

At the summary level, you mentioned that "The agreement at the summary level includes all cases where both annotators marked at least one sentence in the summary as inconsistent." Did you mean that the agreement is calculated as (1) (number of agreed summaries)/ (total number of summaries, i.e. 100), or (2) (number of agreed summaries both labeled as "non-factual")/ (total number of summaries labeled as "non-factual" by either annotator)?
Similarly, at the sentence level, do you mean (1) (number of agreed sentences)/ (total number of sentences), or (2) (number of agreed sentences both labeled as "non-factual")/ (total number of sentences labeled as "non-factual" by either annotator)?

In either case, (1) or (2), I cannot obtain the same agreement for Billsum and PubMed (e.g., 0.93 for Pubmed at sentence level).

Can you shed some light on how you calculated the IAA? Thanks.

Here is my code for IAA:

import pandas as pd 
data_path = "./annotations/billsum_annotations.csv" 

data = pd.read_csv(data_path)
print("summary level") 

print(len(data[data["label_type_ann1"] == "non_factual"]["summary_uuid"].unique()), len(data[data["label_type_ann2"] == "non_factual"]["summary_uuid"].unique())) 

print("-----")
print("agreement:")
print(len(data[(data["label_type_ann1"] == "non_factual") & (data["label_type_ann2"] == "non_factual")]["summary_uuid"].unique()) / ((len(data[data["label_type_ann1"] == "non_factual"]["summary_uuid"].unique())+ len(data[data["label_type_ann2"] == "non_factual"]["summary_uuid"].unique())) - (len(data[(data["label_type_ann1"] == "non_factual") & (data["label_type_ann2"] == "non_factual")]["summary_uuid"].unique()))))

print()

print("sentence level")

print(len(data[(data["label_type_ann1"] == "non_factual") & (data["label_type_ann2"] == "non_factual")]))

print(len(data[(data["label_type_ann1"] == "non_factual") | (data["label_type_ann2"] == "non_factual")])) 

print(len(data[(data["label_type_ann1"] == "non_factual") & (data["label_type_ann2"] == "non_factual")])/len(data[(data["label_type_ann1"] == "non_factual") | (data["label_type_ann2"] == "non_factual")]))

sanjanaramprasad / zero_shot_faceval_domains

Cannot reproduce the IAA results in Table 3 from paper #1