snap-stanford / ogb

Benchmark datasets, data loaders, and evaluators for graph machine learning
https://ogb.stanford.edu
MIT License
1.89k stars 397 forks source link

Ogbg_molpcba Scaffold Index #252

Closed JiuhaiChen closed 2 years ago

JiuhaiChen commented 2 years ago

Hi, i am working on ogbg_molpcba dataset, and noticed that scaffold index can be downloaded from https://snap.stanford.edu/ogb/data/misc/ogbg_molpcba/. I was wondering how you obtain these scaffold index ? From the source dataset or generate by some software ? If the later case, can i know how to generate it ? Thanks!

weihua916 commented 2 years ago

Hi! The code is here.

JiuhaiChen commented 2 years ago

Thanks for your reply. Is the smiles_string stored in mol.csv.gz under ./mapping folder ? If i want to generate scaffold index from smile_string, is there anything else i need to do ? What i have tried is to open mol.csv.gz file and call scaffold_split function, but it seems the format is not right. i just add these code based on your code send me before:

if name == 'main': with gzip.open("mol.csv.gz", "rb") as f: data = f.read() train_idx, valid_index, test_idx = scaffold_split(list(data))

The error message: mol = Chem.MolFromSmiles(smiles) TypeError: No registered converter was able to produce a C++ rvalue of type std::basic_string<wchar_t, std::char_traits, std::allocator > from this Python object of type int

Thanks!

weihua916 commented 2 years ago

Yes, that's correct. You can read mol.csv.gz by

import pandas as pd
df = pd.read_csv('mol.csv.gz')
smiles_list = df['smiles'].tolist()

More details can be found in mapping/README.md. Hope this helps!

JiuhaiChen commented 2 years ago

Thanks! For ogbg_ppa dataset, i was wondering if there is species index, just like ogbg_proteins and ogbl_ppa, species index is included in the dataset ?

weihua916 commented 2 years ago

Yes, species index for ogbg-ppa should be in the corresponding mapping/ directory. See mapping/README.md for details.

mapping/ will be most likely located in dataset/ogbg_ppa/mapping

JiuhaiChen commented 2 years ago

Hi, OGB Team, for ogbg_proteins and ogbl_ppa, i was wondering how you encode the species index into node features? Just append the species index into each node feature? For each graph, since it only belongs to one species domain, do you encode one species index into all node features within one graph? And for ogbg-ppa, ogbg-molhiv, ogbg-molpcba, do you encode the species index into node or edge feature? Thanks !