Bug in the relaxed_correctness eval metric

The relaxed_correctness metric is to compare whether the predicted answer matches with the groundtruth answer. According to its definition:

“Following Methani et al. (2020), we use a relaxed accuracy measure for the numeric answers to allow a minor inaccuracy that may result from the automatic data extraction process. We consider an answer to be correct if it is within 5% of the gold answer. For non-numeric answers, we still need an exact match to consider an answer to be correct.”

The implementation here seems treat the year names numeric answers, which leads to most of the year prediction questions to be correct and thus results in an incorrect high performance. For example, {pred: 2008, GT: 2010} is counted as correct because it is within 5% error, which shouldn't be the case.

Could anyone confirm this?

google-research / pix2struct

Bug in the relaxed_correctness eval metric #35