snap-stanford / ogb

Benchmark datasets, data loaders, and evaluators for graph machine learning
https://ogb.stanford.edu
MIT License
1.93k stars 397 forks source link

BIO KG Dataset has mislabels? #210

Closed MikeDoes closed 3 years ago

MikeDoes commented 3 years ago

It appears that the same index values have multiple different types.

For example, if we take head index value 0, then search for all the instances in the training set.

The head types have the following distribution: function 23 protein 2 disease 1 drug 1

Could you confirm that this is indeed an error in the dataset?

See Jupyter Notebook Here: https://colab.research.google.com/drive/1pUWrZVLve4Ohc3w3ZmsPYAIy55_T4NIC?usp=sharing

image

weihua916 commented 3 years ago

Hi! This is not an error. We index different node types differently. In other words, 0-th drug is different from 0-th function.

You can think of (node_type, idx) to specify a distinct entity in the biomedical KG.

MikeDoes commented 3 years ago

Ok, thanks now it's much clearer