Closed fmocking closed 2 years ago
We have provided the SMILES strings in mapping/
.
Thank you for the prompt response. The problem with having them in mapping/
is that since there is no identifier for each sample there is no way to find the corresponding items in the mapping with pre_transform. Instead, the whole process function needs to be modified so that mapping element can be passed to pre_transform
together with data
. If one uses this solution to solve this problem, they will need to update the process function manually for upcoming versions.
So my suggestion is to make mapping information available to pre_transform natively.
The reason we did not provide SMILES in our dataset object is two-fold: (1) standardize the molecular graph representation for the graph learning community (not necessarily chemistry experts) (2) make our package not dependent on rdkit.
i-th molecule in data_list
is i-th molecule in the file under mapping
For the PCQM4Mv2 dataset, our dataset object does provide the SMILES strings.
from ogb.lsc import PCQM4Mv2Dataset
dataset = PCQM4Mv2Dataset(root = ROOT, only_smiles = True)
# get i-th molecule and its target value (nan for test data)
i = 1234
print(dataset[i]) # ('CC(NCC[C@H]([C@@H]1CCC(=CC1)C)C)C', 6.811009678015001)
Hi,
I noticed it is not possible to modify the data based on the SMILES in some datasets. It would be great to have the smiles2graph parameter in ogbg-molhiv (and 10 smaller datasets) just like in PCQM4MDataset.
Thanks,