[Suggestion] Including SMILES strings for the datasets

snap-stanford / ogb

Benchmark datasets, data loaders, and evaluators for graph machine learning

https://ogb.stanford.edu

MIT License

1.89k stars 397 forks source link

[Suggestion] Including SMILES strings for the datasets #325

Closed fmocking closed 2 years ago

fmocking commented 2 years ago

Hi,

I noticed it is not possible to modify the data based on the SMILES in some datasets. It would be great to have the smiles2graph parameter in ogbg-molhiv (and 10 smaller datasets) just like in PCQM4MDataset.

Thanks,

weihua916 commented 2 years ago

We have provided the SMILES strings in mapping/.

fmocking commented 2 years ago

Thank you for the prompt response. The problem with having them in mapping/ is that since there is no identifier for each sample there is no way to find the corresponding items in the mapping with pre_transform. Instead, the whole process function needs to be modified so that mapping element can be passed to pre_transform together with data. If one uses this solution to solve this problem, they will need to update the process function manually for upcoming versions.

So my suggestion is to make mapping information available to pre_transform natively.

weihua916 commented 2 years ago

The reason we did not provide SMILES in our dataset object is two-fold: (1) standardize the molecular graph representation for the graph learning community (not necessarily chemistry experts) (2) make our package not dependent on rdkit.

i-th molecule in data_list is i-th molecule in the file under mapping

weihua916 commented 2 years ago

For the PCQM4Mv2 dataset, our dataset object does provide the SMILES strings.

from ogb.lsc import PCQM4Mv2Dataset
dataset = PCQM4Mv2Dataset(root = ROOT, only_smiles = True)

# get i-th molecule and its target value (nan for test data)
i = 1234
print(dataset[i]) # ('CC(NCC[C@H]([C@@H]1CCC(=CC1)C)C)C', 6.811009678015001)