How to correctly parse the SMILES of the PubChem dataset?

raimis commented 2 years ago

The SMILES of the PubChem dataset are generated with OpenFF-Toolkit (https://github.com/openmm/spice-dataset/blob/main/pubchem/createPubchem.py). So, Molecule from OpenFF-Toolkit should be able read them correctly, but this isn't a case.

Get a SMILES:

import h5py

h5 = h5py.File('pubchem/pubchem-1-2500.hdf5')
smiles = h5['103914790']['smiles'][0]
print(smiles)

b'[N:1]1=[C:2]2[N:3]([C:5]([H:17])([H:18])[C:4]1([H:15])[H:16])[C:12]1([H:30])[C:8]([H:23])([H:24])[C:13]3([H:31])[C:6]([H:19])([H:20])[C:11]2([H:29])[C:7]([H:21])([H:22])[C:14]([H:32])([C:9]1([H:25])[H:26])[C:10]3([H:27])[H:28]'

Parse the SMILES and print elements:

from openff.toolkit.topology import Molecule

mol = Molecule.from_smiles(smiles, hydrogens_are_explicit=True, allow_undefined_stereo=True)
print([atom.element.symbol for atom in mol.atoms])

['N', 'C', 'N', 'C', 'H', 'H', 'C', 'H', 'H', 'C', 'H', 'C', 'H', 'H', 'C', 'H', 'C', 'H', 'H', 'C', 'H', 'C', 'H', 'H', 'C', 'H', 'C', 'H', 'H', 'C', 'H', 'H']

Despite the SMILES contains the explicit hydrogen and atom indices, the order of atom doesn't match, e.g. the 5th atom in the SMILE is C, but in the molecule it is H.

pavankum commented 2 years ago

Hi @raimis , please use Molecule.from_mapped_smiles() which retains the atom mapping.

mol = Molecule.from_mapped_smiles(smiles, allow_undefined_stereo=True)

raimis commented 2 years ago

@pavankum thanks! I haven't noticed that in the documentation.

pavankum commented 2 years ago

Thank you for the feedback, we will make sure to update the documentation.

openmm / spice-dataset

How to correctly parse the SMILES of the PubChem dataset? #16