snap-stanford / ogb

Benchmark datasets, data loaders, and evaluators for graph machine learning
https://ogb.stanford.edu
MIT License
1.93k stars 397 forks source link

[question] Help understanding hits@20 #256

Closed felipemello1 closed 3 years ago

felipemello1 commented 3 years ago

Hi, I am having a hard time trying to understand the hits@20 for DDI link prediction. Could someone please help me with it? This is the evaluator code

kth_score_in_negative_edges = torch.topk(y_pred_neg, self.K)[0][-1]
hitsK = float(torch.sum(y_pred_pos > kth_score_in_negative_edges).cpu()) / len(y_pred_pos)

So I tried to build an example with K=3:

import torch

y_pred_pos = torch.tensor([0.8, 0.9, 0.4, 0.5, 0.3])
y_pred_neg = torch.tensor([0.7, 0.45, 0.2, 0.1, 0.6])

#ranked
#[0.9, 0.8, 0.7, 0.6, 0.5, 0.45, 0.4, 0.3, 0.2, 0.1]
#[1.0, 1.0, 0.0, 0.0, 1.0, 0.00, 1.0, 1.0, 0.0, 0.0]

K = 3
kth_score_in_negative_edges = torch.topk(y_pred_neg, K)[0][-1]
print('kth_score_in_negative_edges', kth_score_in_negative_edges)
#>>> kth_score_in_negative_edges tensor(0.4500)

hitsK = float(torch.sum(y_pred_pos > kth_score_in_negative_edges).cpu()) / len(y_pred_pos)
print('hitsK', hitsK)
#>>> hitsK 0.6

What I understand that the algorithm is doing is to get the top_K highest error. Then, if we had to iterate over each positive we got right, all we would have to do is to see how many positives are larger than the top_K error. Therefore, score = 3 / 5 = 0.6

What I dont get is how this broadcasts to each specific drug. In my mind, if I have a triple u,e,v, to generate the corruptions it would always have to be u,e, corrupted_node, without ever changing u. But the algorithm completely disregards that, since it evaluates all validation/test set at once. The corrupted triples can be r,e,t, where r != u. If I had to predict 1MM triples and, for a specific node 'r', I got K of them wrong with very high score, then node 'u' would be impacted by these triple of type r,e,_ . Is that correct?

I was thinking that the right approach would be to create a set of corrupted triples specific to each node u, with the format u,e, corrupted_node. Something like below. Could someone please explain to me why I am wrong and the current method works?

scores = []
for u in all_drugs:
    y_pred_pos = get_positives(starts_with = u)
    y_pred_neg = get_negatives(starts_with = u)
    score = evaluator.eval({'y_pred_pos': y_pred_pos, 'y_pred_neg': y_pred_neg})
    scores.append(score)
scores = np.mean(scores)
weihua916 commented 3 years ago

Hi! Good point. It's true that hits@20 used here is not the standard "hits@20" used in the KG completion literature. In DDI, the task is more of mining the positive edges in the pool of all possible edges (i.e., pairs of nodes). Hence, we are evaluating the positive edges against the pool of negative edges (without fixing the end node).

felipemello1 commented 3 years ago

Thanks for answering @weihua916 ! Out of 111k edges, 33% of valid+test is positive, 66% is negative. If the model mistakenly scores 20 negative edges as 1, then it means that my hits@20 will be 0, even if my F1 score is 0.95, for example. Did I get it right?

weihua916 commented 3 years ago

You are welcome! No that's not correct. We are scoring each positive edge against the pool of provided negative edges. The metric is the ratio of positive edges that are ranked within the top 20 among the negative edges.

felipemello1 commented 3 years ago

Ok, I guess my problem then is understanding what the "pool of provided negative edges" is.

In my first comment, I thought that ideally it would be, for each positive triple u,e,v, a tailored subset of negative edges of type u,e,corrupted. However, what I understood from your first answer was that this pool will be the same for every positive edge, and will be composed of all 66% of 111k negative edges (~70k edges). In other words, we would compare each positive edge with the same set of 70k negative edges.

Thanks for your patience.

weihua916 commented 3 years ago

Yes, that's exactly what we did.

felipemello1 commented 3 years ago

Then, I am not sure if I understand why my previous comment was incorrect. Lets say I get a F1 score of 0.95, and predict the whole dataset pretty well.

If in this negative pool of 70k I have 20 wrong positive predictions = 1, then my hits@20 will be 0 for the whole dataset, because no scores can be > 1 and the following term would be zero float(torch.sum(y_pred_pos > kth_score_in_negative_edges).cpu())

weihua916 commented 3 years ago

I see what you mean. Yes, if there are 20 negative edges on which your model predicts super high scores (higher than any of the positive edges), then the model performance would be 0.

felipemello1 commented 3 years ago

Got it! Thanks @weihua916