Closed osoblanco closed 4 years ago
Hi,
The goal of our experiments is to evaluate a model's performance in finding non-trivial answers. During evaluation, we first build a validation KG (train edges + valid edges) and a test KG (train edges + valid edges + test edges). Then given a test query q, we can achieve the answers to this query on both KGs, denoted as [q]_val and [q]_test. By non-trivial answers, we mean [q]_test / [q]_val, representing the answers that you can only achieve on the test KG but not on the validation KG.
Back to your question, the test_hard_ans dictionary contains (q, [q]_test / [q]_val), and the test_ans dictionary contains (q, [q]_test). During experiments, we only evaluate the data points in the test_hard_ans dictionary, and the reason why we also keep the test_ans dictionary is that we filter all [q]_test during ranking in order to calculate the numbers. This filtered setting is also standard in KG link prediction tasks. (It's also exactly the same case for valid_hard_ans and valid_ans).
Kindly let me know if you have additional questions.
Thanks for the swift reply,
I understood the first part. Here you set the [q]_test / [q]_val as the answer
I am not exactly sure if I understand what you mean by saying "the reason why we also keep the test_ans dictionary is that we filter all [q]_test during ranking in order to calculate the numbers".
By filtering do you mean that you remove the answers in [q]_val from the overall answer set?
Also, why do you need the "false_answers" for filtering in here
"false_answers" is everything not in [q]_test as I understand. And every such "false_answer" edge is activated while filtering.
Doesn't this make the score for every edge in [q]_val to be 0?
Yes exactly, we filter [q]_val by setting [q]_val to zero score. Let me give you one example, if there are 7 entities on a KG, e1-e7. And [q]_val = {e1,e2}, [q]_test \ [q]_val = {e3}. Then we only rank e3 against e4 to e7, masking the score of e1,e2 by making them zero.
Hi Ren:
I was wondering about the evaluation/answer sets used for obtaining the designated metrics, i.e. MRR, Hits@K .
There are 2 pickled answer dictionaries for each type of chain, i.e. "test_ans_ic.pkl and test_ans_1c_hard.pkl"
According to https://github.com/hyren/query2box/blob/99dc9f54aa98183976dd73f077033a2886d01891/codes/model.py#L1017-L1021
Only the "_hard" version of answers are used for evaluation. I wanted to clarify the meaning and origin of the "_hard" answers as I couldn't find it in the paper.
It seems that the normal answers are only used for find "false_answers" which in turn are used to filter the scores in
https://github.com/hyren/query2box/blob/99dc9f54aa98183976dd73f077033a2886d01891/codes/model.py#L1005-L1015
If possible can you elaborate a bit on this chunk ?