snap-stanford / ogb

Benchmark datasets, data loaders, and evaluators for graph machine learning
https://ogb.stanford.edu
MIT License
1.93k stars 397 forks source link

Ratio of positive labels in test set is very different from training or validation set in proteins function dataset #262

Closed MengLiuPurdue closed 3 years ago

MengLiuPurdue commented 3 years ago

Hi authors,

May I ask why the label distribution of test set is so different? For example label 0: split_idx = dataset.get_idx_split() train_nodes = split_idx['train'].numpy() valid_nodes = split_idx['valid'].numpy() test_nodes = split_idx['test'].numpy() print(data.y[train_nodes,0].sum()/len(train_nodes)) print(data.y[valid_nodes,0].sum()/len(valid_nodes)) print(data.y[test_nodes,0].sum()/len(test_nodes)) And the results are: 0.6287, 0.6269, 0.0847 respectively. Is this intentional?

Thanks!

weihua916 commented 3 years ago

Hi!

This is a great realization. This is because of the species split, but I was not expecting such a huge difference in the ratio of positive ratios. To give you more context below is the positive label ratio for each species, which does vary significantly across different species:

from ogb.nodeproppred import PygNodePropPredDataset
import torch
dataset = PygNodePropPredDataset('ogbn-proteins')
data = dataset[0]

unique_species = torch.unique(data.node_species)
for species_id in unique_species:
    print('species id: ', int(species_id))
    node_idx = torch.nonzero(data.node_species == species_id)[:,0]
    print(float(data.y[node_idx].sum()/(len(node_idx)*112))) # 112 tasks
    print()

Output

species id:  3702
0.0976378545165062

species id:  4932
0.1544351875782013

species id:  6239
0.06898193806409836

species id:  7227
0.12101421505212784

species id:  7955 (test species)
0.022200047969818115

species id:  9606
0.21192370355129242

species id:  10090 (validation species)
0.1685332953929901

species id:  511145
0.1294090896844864
weihua916 commented 3 years ago

The good news is that our evaluation metric is ROC-AUC, which should be pretty robust to the difference in positive label ratio.

MengLiuPurdue commented 3 years ago

The good news is that our evaluation metric is ROC-AUC, which should be pretty robust to the difference in positive label ratio.

I see. Thanks!