Evaluation Metric - Githubissues

meaningful96 commented 3 months ago

Hi. Thanks for the great paper.

While reviewing your paper and code, I have a question. In the paper, you mentioned, "we sample 10 hard candidate answers with the same type of the truth for prediction on KBQA." I'm curious about the calculation method of the Exact Match (EM) that you showed in the main results.

Does it mean that if the correct answer is among the 10 candidate samples, it is considered True? In other words, does EM become Hits@10? Or, is it that we only consider the top 10 candidate samples, and if the top-1 with the highest score matches the ground truth, it is considered True?

In addition, how can you extract the 10 hard candidates?

Could you please clarify this for me? Thank you for your time and assistance.

l-xin commented 3 months ago

Well, there might be some misunderstandings in our evaluation. We convert the qa evaluation into the choice question form. We sample 10 candidates as the options (candidates = ground truth + sampled negatives), and evaluate the models by selecting the correct answers from the options (the candidates).

EM is used in multi-choice setting (the number of the ground truth may be greater than 1, e.g., ComplexWebQuestions). The model selects one option if the predicted matching score is greater than the threshold. If the model selects all ground truth options and no negative option from the candidates (i.e., exactly match the ground truth), the EM score is 1; Otherwise (the model misses any truth or selects negative option), the EM score is 0.

In the single-choice setting (e.g., FreebaseQA), EM degrades to the ACC. The model only selects one option with the highest score, and evaluates whether it matches the only ground truth.

The hard negatives are sampled from entities with the same type as the ground truth. For example, if the ground truth is a height entity (e.g., "100 m"), we will sample other height entities as negatives (e.g., "120 m", "1 km", "1.8 m"). We determine the type of one entitiy by the relation in the triple. If exists an triple (A, "height", B), then the type of B is height. For more details, you can refer to our code in sample_negatives.py.

meaningful96 commented 3 months ago

Hi.

Thank you so much for your detailed and clear explanation. It really helped me understand the evaluation process better. I appreciate the effort you put into breaking it down in a simpler way. Thank you again for your support!

l-xin / KICP

Evaluation Metric #1