The relaxed_correctness metric is to compare whether the predicted answer matches with the groundtruth answer. According to its definition:
“Following Methani et al. (2020), we use a relaxed accuracy measure for the
numeric answers to allow a minor inaccuracy that may result from the automatic
data extraction process. We consider an answer to be correct if it is within
5% of the gold answer. For non-numeric answers, we still need an exact match
to consider an answer to be correct.”
The implementation here seems treat the year names numeric answers, which leads to most of the year prediction questions to be correct and thus results in an incorrect high performance. For example, {pred: 2008, GT: 2010} is counted as correct because it is within 5% error, which shouldn't be the case.
The relaxed_correctness metric is to compare whether the predicted answer matches with the groundtruth answer. According to its definition:
The implementation here seems treat the year names numeric answers, which leads to most of the year prediction questions to be correct and thus results in an incorrect high performance. For example, {pred: 2008, GT: 2010} is counted as correct because it is within 5% error, which shouldn't be the case.
Could anyone confirm this?