snap-stanford / ogb

Benchmark datasets, data loaders, and evaluators for graph machine learning
https://ogb.stanford.edu
MIT License
1.89k stars 398 forks source link

Question about ogbn-proteins dataset #104

Closed ritatsousa closed 3 years ago

ritatsousa commented 3 years ago

Hi, According to the description of ogbn-proteins dataset, the 112 kinds of labels correspond to Gene Ontology (GO) functions. Why, among all the functions defined in the GO, these 112 labels were selected? Thanks in advance!

weihua916 commented 3 years ago

Hi! This is a great question. We consider GO functions that are fine-grained (leaves of the GO hierarchy) and have enough positive labels to train models on. That's how we have arrived at the 112 kinds of labels.

ritatsousa commented 3 years ago

Thanks for your prompt response!
However, GO: 0003674, a root of the GO hierarchy, isn't it one of the kinds of labels? I also take the opportunity to ask what was the number of positive labels that you considered as enough.

weihua916 commented 3 years ago

You are welcome! I double-checked my code, and you are right. Let me correct my previous response. The following is what we did:

We first get a list of GOs that are immediate parents of the leave GOs. Out of those GOs, we extracted GOs that appear more than 200 times in each of train/valid/test splits. This gives us the 112 GOs in our dataset.