shmsw25 / AmbigQA

An original implementation of EMNLP 2020, "AmbigQA: Answering Ambiguous Open-domain Questions"
https://arxiv.org/abs/2004.10645
117 stars 22 forks source link

Repeated Items in dev_light.json #14

Closed NoviScl closed 4 years ago

NoviScl commented 4 years ago

I was browsing through the dataset (dev_light.json). I find it weird that some examples in the json file contains multiple repeated items, for example: [{'type': 'singleAnswer', 'answer': ['Nick Robinson']}, {'type': 'singleAnswer', 'answer': ['Nick Robinson']}, {'type': 'singleAnswer', 'answer': ['Nick Robinson']}] [{'type': 'singleAnswer', 'answer': ['21']}, {'type': 'singleAnswer', 'answer': ['21']}, {'type': 'singleAnswer', 'answer': ['twenty-one', '21']}]

while some others are in the form like: [{'type': 'singleAnswer', 'answer': ['November 16, 2003', '16 November 2003', 'November 16th, 2003']}]

The same dict was repeated 3 times in the first list. Is this done by purpose as part of the annotation? And for evaluation we can just treat the set of all unique answer strings as the correct answers right?

shmsw25 commented 4 years ago

Hi @NoviScl,

each item in annotations indicates a valid annotation, generated from each annotator. If there are two annotations that are exactly the same, that means two different annotators wrote exact same annotations. Often times they will be different, meaning that two annotators wrote different annotations but both of them are valid. If there is only one annotation, that means the other annotation is marked as invalid during the validation stage.

You can also see that evaluation script computes the metrics for each annotation and takes the maximum.

Is it clear? Let me know if you have any other question.

NoviScl commented 4 years ago

Thanks for the detailed explanation! Yep I understand the annotation scheme now.

One further question, how exactly do you measure recall rate of the retriever in this case, since there are actually multiple correct answers for MultipleQAs? Should the retrieved passages contain all the correct answers for a multiple-answer question to be considered as a successful retrieval?

shmsw25 commented 4 years ago

Great question. Yes, it's hard to measure and there is no official metric for it. But I think it is possible to compute a "macro-average" (an average of each answer being retrieved) and "all" (whether all answers are retrieved or not). I wrote a brief pseudo code below.

# dp is a datapoint
# normalize_answer is a normalization function used in the evaluation script
# retrieved_text is a list of Top K passages
retrieved_text = [normalize_answer(p) for p in retrieved_text]
recall_macro_avg_list, recall_all_list = [], []
for annotation in dp["annotations"]:
  if annotation["type"]=="singleAnswer":
    answers = [annotation["answer"]]
  else:
    answers = [pair["answer"] for pair in annotation["qaPairs"]]
  answers = [[normalize_answer(answer) for answer in _answer] for _answer in answers]
  recall = [any([any([answer in p for answer in _answer]) for p in retrieved_text]) for _answer in answers] # a list of recalls
  recall_macro_avg_list.append(np.mean(recall))
  recall_all_list.append(np.all(recall))
recall_macro_avg = np.max(recall_macro_avg_list)
recall_all = np.max(recall_all_list)
NoviScl commented 4 years ago

Ok I get the idea. This is very helpful. Thanks!!