snap-stanford / ogb

Benchmark datasets, data loaders, and evaluators for graph machine learning
https://ogb.stanford.edu
MIT License
1.89k stars 397 forks source link

Inconsistent Results of LinkPrediction Evaluator for MRR #321

Closed VeritasYin closed 2 years ago

VeritasYin commented 2 years ago

Hello,

I was using the OGB evaluator to test on the dataset Citation2 under the metric MRR. However, I found inconsistent results given by the evaluator between numpy array and tensors when the data type is integer instead of float. I tested the following code on version 1.3.0 and 1.3.3. They both result in the same behavior as below.

from ogb.linkproppred import Evaluator
import numpy as np
import torch

dataset = 'ogbl-citation2'
evaluator = Evaluator(name=dataset)

yp = np.random.randint(2, size=(100,))
yn = np.random.randint(2, size=(100, 1000))
print(evaluator.eval({"y_pred_pos": yp, "y_pred_neg": yn})['mrr_list'].mean())
print(evaluator.eval({"y_pred_pos": torch.from_numpy(yp), "y_pred_neg": torch.from_numpy(yn)})['mrr_list'].mean())

The corresponding output is 0.52068675 and tensor(0.0049), respectively.

weihua916 commented 2 years ago

Hi! Interesting. The issue does not happen on my end.

Python 3.8.3 (default, Jul  2 2020, 11:26:31)
[Clang 10.0.0 ] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from ogb.linkproppred import Evaluator
>>> import numpy as np
>>> import torch
>>> dataset = 'ogbl-citation2'
>>> evaluator = Evaluator(name=dataset)
>>> yp = np.random.randint(2, size=(100,))
>>> yn = np.random.randint(2, size=(100, 1000))
>>> print(evaluator.eval({"y_pred_pos": yp, "y_pred_neg": yn})['mrr_list'].mean())
0.48075244
>>> print(evaluator.eval({"y_pred_pos": torch.from_numpy(yp), "y_pred_neg": torch.from_numpy(yn)})['mrr_list'].mean())
tensor(0.4808)
VeritasYin commented 2 years ago

hello, I tested again from two servers using the same code, the results still seem off

Python 3.8.5 (default, Sep  4 2020, 07:30:14)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from ogb.linkproppred import Evaluator
Using backend: pytorch
RDFLib Version: 5.0.0
>>> import numpy as np
>>> import torch
>>> dataset = 'ogbl-citation2'
>>> evaluator = Evaluator(name=dataset)
>>> yp = np.random.randint(2, size=(100,))
>>> yn = np.random.randint(2, size=(100, 1000))
>>> print(evaluator.eval({"y_pred_pos": yp, "y_pred_neg": yn})['mrr_list'].mean())
0.5306459
>>> print(evaluator.eval({"y_pred_pos": torch.from_numpy(yp), "y_pred_neg": torch.from_numpy(yn)})['mrr_list'].mean())
tensor(0.0051)
>>> torch.__version__
'1.8.0'
>>> np.__version__
'1.21.2'
Python 3.8.12 | packaged by conda-forge | (default, Jan 30 2022, 23:42:07)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from ogb.linkproppred import Evaluator
WARNING:root:The OGB package is out of date. Your version is 1.3.2, while the latest version is 1.3.3.
>>> import numpy as np
>>> import torch
>>> dataset = 'ogbl-citation2'
>>> evaluator = Evaluator(name=dataset)
>>> yp = np.random.randint(2, size=(100,))
>>> yn = np.random.randint(2, size=(100, 1000))
>>> print(evaluator.eval({"y_pred_pos": yp, "y_pred_neg": yn})['mrr_list'].mean())
0.4807206
>>> print(evaluator.eval({"y_pred_pos": torch.from_numpy(yp), "y_pred_neg": torch.from_numpy(yn)})['mrr_list'].mean())
tensor(0.0038)
>>> torch.__version__
'1.8.2'
>>> np.__version__
'1.21.2'
weihua916 commented 2 years ago

I checked my torch version. Can you try updating it?

Python 3.8.8 (default, Apr 13 2021, 19:58:26)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> import torch
>>> torch.__version__
'1.10.1+cu102'
>>> np.__version__
'1.19.5'
VeritasYin commented 2 years ago

I updated the torch packages but nothing changes.

Python 3.8.5 (default, Sep  4 2020, 07:30:14)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from ogb.linkproppred import Evaluator
Using backend: pytorch
RDFLib Version: 5.0.0
>>> import numpy as np
>>> import torch
>>> dataset = 'ogbl-citation2'
>>> evaluator = Evaluator(name=dataset)
>>> yp = np.random.randint(2, size=(100,))
>>> yn = np.random.randint(2, size=(100, 1000))
>>> print(evaluator.eval({"y_pred_pos": yp, "y_pred_neg": yn})['mrr_list'].mean())
0.46074632
>>> print(evaluator.eval({"y_pred_pos": torch.from_numpy(yp), "y_pred_neg": torch.from_numpy(yn)})['mrr_list'].mean())
tensor(0.0038)
>>> torch.__version__
'1.10.1'
>>> np.__version__
'1.21.2'
rusty1s commented 2 years ago

I cannot reproduce this either via torch==1.10.2 and numpy==1.21.2. Any chance you can debug the code on your own to find out where the change in behavior occurs?

weihua916 commented 2 years ago

Will close this for now. Let us know if the problem still persists.

VeritasYin commented 2 years ago

@weihua916 @rusty1s

Hello, I found out the difference between the input array of torch and numpy.

For MRR metric at ogb/linkproppred/evaluate.py(252)_eval_mrr(), it will use different argsort for numpy array and torch tensor. However, the behaviour of torch.argsort and numpy.argsort is not consistent when the input are integers as following,

y_pred (y_pred_pos, y_pred_neg)
array([[0, 0, 1, ..., 0, 1, 1],
       [1, 0, 0, ..., 0, 1, 1],
       [1, 1, 1, ..., 0, 0, 1],
       ...,
       [1, 1, 0, ..., 0, 0, 1],
       [0, 0, 1, ..., 0, 1, 0],
       [1, 0, 1, ..., 1, 0, 0]])

argsort_numpy
array([[ 500,  546,  547, ...,  594,  206,    0],
       [   0,  531,  532, ...,  643,  743,  330],
       [   0,  569,  570, ...,  596,  217,  500],
       ...,
       [   0,  549,  550, ...,  663,  433,  348],
       [ 330,  345,  344, ...,  448,  422, 1000],
       [   0,  394,  803, ...,  480,  462, 1000]])

argsort_torch
tensor([[562, 545, 546,  ..., 594, 205, 596],
        [558, 529, 531,  ..., 590, 591, 593],
        [580, 564, 565,  ..., 436, 651, 650],
        ...,
        [561, 542, 543,  ..., 664, 663, 662],
        [409, 792, 791,  ..., 449, 451, 452],
        [388, 812, 380,  ..., 477, 480, 481]])
weihua916 commented 2 years ago

I see. interesting. We generally do not advise you to assign the same score to different candidate entities...