Closed MengLiuPurdue closed 3 years ago
Hi!
This is a great realization. This is because of the species split, but I was not expecting such a huge difference in the ratio of positive ratios. To give you more context below is the positive label ratio for each species, which does vary significantly across different species:
from ogb.nodeproppred import PygNodePropPredDataset
import torch
dataset = PygNodePropPredDataset('ogbn-proteins')
data = dataset[0]
unique_species = torch.unique(data.node_species)
for species_id in unique_species:
print('species id: ', int(species_id))
node_idx = torch.nonzero(data.node_species == species_id)[:,0]
print(float(data.y[node_idx].sum()/(len(node_idx)*112))) # 112 tasks
print()
Output
species id: 3702
0.0976378545165062
species id: 4932
0.1544351875782013
species id: 6239
0.06898193806409836
species id: 7227
0.12101421505212784
species id: 7955 (test species)
0.022200047969818115
species id: 9606
0.21192370355129242
species id: 10090 (validation species)
0.1685332953929901
species id: 511145
0.1294090896844864
The good news is that our evaluation metric is ROC-AUC, which should be pretty robust to the difference in positive label ratio.
The good news is that our evaluation metric is ROC-AUC, which should be pretty robust to the difference in positive label ratio.
I see. Thanks!
Hi authors,
May I ask why the label distribution of test set is so different? For example label 0:
split_idx = dataset.get_idx_split()
train_nodes = split_idx['train'].numpy()
valid_nodes = split_idx['valid'].numpy()
test_nodes = split_idx['test'].numpy()
print(data.y[train_nodes,0].sum()/len(train_nodes))
print(data.y[valid_nodes,0].sum()/len(valid_nodes))
print(data.y[test_nodes,0].sum()/len(test_nodes))
And the results are:0.6287, 0.6269, 0.0847
respectively. Is this intentional?Thanks!