About node and label information of ogbn-proteins

snap-stanford / ogb

Benchmark datasets, data loaders, and evaluators for graph machine learning

https://ogb.stanford.edu

MIT License

1.89k stars 397 forks source link

About node and label information of ogbn-proteins #261

Closed MengLiuPurdue closed 2 years ago

MengLiuPurdue commented 2 years ago

Hi authors,

Is there a way to know the specifc protein name of each node and the specific protein function name of each label?

Thanks!

weihua916 commented 2 years ago

Yes, you can!

Go to ${root}/ogbn_proteins/mapping, and read README.md there.

MengLiuPurdue commented 2 years ago

Yes, you can!

Go to ${root}/ogbn_proteins/mapping, and read README.md there.

Found it. Thanks!

Sutongtong233 commented 1 year ago

I see in mapping/README.md, it says the edge feature dimension is 8. However, this is a node classification task, and when I load the dataset, data.x is in shape [N, 8]. Is there a mistake? It should be node feature rather than edge feature.

weihua916 commented 1 year ago

It is correct. See below.

Python 3.8.8 (default, Apr 13 2021, 19:58:26)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from ogb.nodeproppred import PygNodePropPredDataset
>>> dataset = PygNodePropPredDataset(name = 'ogbn-proteins')
>>> dataset[0]
Data(edge_index=[2, 79122504], edge_attr=[79122504, 8], node_species=[132534, 1], y=[132534, 112])

My guess is that you apply this line to get data.x.

Sutongtong233 commented 1 year ago

Thanks a lot! I have another question about the Protein dataset. I had seen another protein dataset called PPI, with 50-dimension node features processed from C1, C3, and C7 from GSEA database. I am confused with the data pre-processing and tried to contact the contributor of GraphSAGE, which first used this PPI dataset, but do not receive a reply. Do you know any information about this? Thanks in advance.