snap-stanford / ogb

Benchmark datasets, data loaders, and evaluators for graph machine learning
https://ogb.stanford.edu
MIT License
1.89k stars 397 forks source link

add reorder atom in mol.py #445

Closed v-shaoningli closed 1 year ago

v-shaoningli commented 1 year ago

Add reorder atom function to achieve atom-to-atom correspondence in PCQM4Mv2 when dealing with sdf. The code is originally from ViSNet. We can test the order by

from rdkit import Chem
from ogb.utils import ReorderAtoms

suppl = Chem.SDMolSupplier('pcqm4m-v2-train.sdf')
mol = suppl[0]
mol = ReorderAtoms(mol)
atomic_number = []          
for atom in mol.GetAtoms():
    atomic_number.append(atom.GetAtomicNum())
print(atomic_number)
# >>> [6, 6, 6, 6, 8, 6, 6, 6, 6, 6, 8, 8, 6, 6, 6, 6, 7]

and

from ogb.lsc import PCQM4Mv2Dataset
from ogb.utils import smiles2graph
from functools import partial

dataset = PCQM4Mv2Dataset(root = ROOT, smiles2graph=partial(smiles2graph, removeHs=True))
print(dataset[0].x[:, 0] + 1) # atomic number
# >>> [6, 6, 6, 6, 8, 6, 6, 6, 6, 6, 8, 8, 6, 6, 6, 6, 7]
weihua916 commented 1 year ago

Thanks for the PR!

Just to confirm; this does not change the identity of the molecule, right? It just changes the ordering.

v-shaoningli commented 1 year ago

Right. Since we need the order value in ReorderAtoms function to re-index the position in .sdf, it should return this. For instance:

i = 0
mol = suppl[i]
mol, order = ReorderAtoms(mol)
order = torch.tensor(order).long()
N = mol.GetNumAtoms()
pos = suppl.GetItemText(i).split('\n')[4:4 + N]
pos = [[float(x) for x in line.split()[:3]] for line in pos]
pos = torch.tensor(pos)[order]

I would create a new PR to fix this.