Evaluation examples using "_hard" ?

osoblanco commented 4 years ago

Hi Ren:

I was wondering about the evaluation/answer sets used for obtaining the designated metrics, i.e. MRR, Hits@K .

There are 2 pickled answer dictionaries for each type of chain, i.e. "test_ans_ic.pkl and test_ans_1c_hard.pkl"

According to https://github.com/hyren/query2box/blob/99dc9f54aa98183976dd73f077033a2886d01891/codes/model.py#L1017-L1021

Only the "_hard" version of answers are used for evaluation. I wanted to clarify the meaning and origin of the "_hard" answers as I couldn't find it in the paper.

It seems that the normal answers are only used for find "false_answers" which in turn are used to filter the scores in

https://github.com/hyren/query2box/blob/99dc9f54aa98183976dd73f077033a2886d01891/codes/model.py#L1005-L1015

If possible can you elaborate a bit on this chunk ?

hyren commented 4 years ago

Hi,

The goal of our experiments is to evaluate a model's performance in finding non-trivial answers. During evaluation, we first build a validation KG (train edges + valid edges) and a test KG (train edges + valid edges + test edges). Then given a test query q, we can achieve the answers to this query on both KGs, denoted as [q]_val and [q]_test. By non-trivial answers, we mean [q]_test / [q]_val, representing the answers that you can only achieve on the test KG but not on the validation KG.

Back to your question, the test_hard_ans dictionary contains (q, [q]_test / [q]_val), and the test_ans dictionary contains (q, [q]_test). During experiments, we only evaluate the data points in the test_hard_ans dictionary, and the reason why we also keep the test_ans dictionary is that we filter all [q]_test during ranking in order to calculate the numbers. This filtered setting is also standard in KG link prediction tasks. (It's also exactly the same case for valid_hard_ans and valid_ans).

Kindly let me know if you have additional questions.

osoblanco commented 4 years ago

Thanks for the swift reply,

I understood the first part. Here you set the [q]_test / [q]_val as the answer

https://github.com/hyren/query2box/blob/99dc9f54aa98183976dd73f077033a2886d01891/codes/model.py#L1008-L1010

I am not exactly sure if I understand what you mean by saying "the reason why we also keep the test_ans dictionary is that we filter all [q]_test during ranking in order to calculate the numbers".

By filtering do you mean that you remove the answers in [q]_val from the overall answer set?

Also, why do you need the "false_answers" for filtering in here

https://github.com/hyren/query2box/blob/99dc9f54aa98183976dd73f077033a2886d01891/codes/model.py#L1001

https://github.com/hyren/query2box/blob/99dc9f54aa98183976dd73f077033a2886d01891/codes/model.py#L1004

"false_answers" is everything not in [q]_test as I understand. And every such "false_answer" edge is activated while filtering.

https://github.com/hyren/query2box/blob/99dc9f54aa98183976dd73f077033a2886d01891/codes/model.py#L1011-L1015

Doesn't this make the score for every edge in [q]_val to be 0?

hyren commented 4 years ago

Yes exactly, we filter [q]_val by setting [q]_val to zero score. Let me give you one example, if there are 7 entities on a KG, e1-e7. And [q]_val = {e1,e2}, [q]_test \ [q]_val = {e3}. Then we only rank e3 against e4 to e7, masking the score of e1,e2 by making them zero.

hyren / query2box

Evaluation examples using "_hard" ? #3