snap-stanford / ogb

Benchmark datasets, data loaders, and evaluators for graph machine learning
https://ogb.stanford.edu
MIT License
1.89k stars 397 forks source link

not sure if it's the issue for computing the hit in the ogb.Evaluator #406

Closed Juanhui28 closed 10 months ago

Juanhui28 commented 1 year ago

Hi, thanks for priving a nice package for ogb datasets! I have concern about the evaluation for computing the hit@K. In your implementation, the positive sample is always ranked at the last. Not sure if it's unfair? But for the MRR, it considers ranking at last and at first, so it's more reasonable to me.

Do we also need to consider ranking at last and at first for the hit@k as well?

Thanks!

weihua916 commented 1 year ago

Thanks for raising this issue. I think for Hits@K, the "pessimistic" implementation makes more sense to me. Suppose you predict exactly the same score, let's say alpha, for all positive and negative edges, i.e., the entries of y_pred_pos and y_pred_neg are all alpha. Then, kth_score_in_negative_edges would give alpha. If we were "optimistic" and make this line into

hitsK = float(torch.sum(y_pred_pos >= kth_score_in_negative_edges).cpu()) / len(y_pred_pos)

then, hitsK would be 1, which is clearly not ideal. Let me know if this makes sense.

Juanhui28 commented 1 year ago

Thanks for your reply! Yes, if we rank the positive sample at first when we have a tie, it doesn't make sense. Then how about we combine ranking at last and at first, like the way used for computing the MRR?

optimistic_rank = (y_pred_neg >= y_pred_pos).sum(dim=1) pessimistic_rank = (y_pred_neg > y_pred_pos).sum(dim=1) ranking_list = 0.5 * (optimistic_rank + pessimistic_rank) + 1

Seems it's more reasonable?

weihua916 commented 1 year ago

Hi! I am a bit confused. In these lines, isn't the optimistic/pessimistic the opposite? It looks like the optimistic_rank is always larger than the pessimistic_rank. In general, I think we should only consider the pessimistic ranking for the resulting link prediction system to be useful. So the MRR should only consider the pessimistic rank, in my opinion. What do you think?

Juanhui28 commented 1 year ago

Actually the codes in my last comment are from the offical ogb code for calculating the MRR: _eval_mrr function in the Evaluator class. Yes, I think the names for optimistic/pessimistic are confused. But the MRR considers both rankings in the code.

And I also think combining two is more reasonable, because ranking first/last is unfair. This code ranking_list = 0.5 * (optimistic_rank + pessimistic_rank) + 1 is like we are ranking at middle. So it's like we rank the positive sample randomly and the expectation for the random ranking is ranking at the middle, which makes more sense to me.

weihua916 commented 1 year ago

The MRR code is from PR, so I am confused (even though I approved it before) :)

For Hits@K, the optimistic one does not make sense to me because we can achieve the score of 1 by predicting the same score. Even after averaging with the pessimistic one, the score would be 0.5. I am leaning towards the pessimistic one because I don't think it is right to predict the same scores for many links to cheat the metric. What if you predict the same scores across all pair of nodes. That would give you a score of 0.5 Hits@K.