Preserving Atom Indexing with RDkit?

benf549 commented 5 months ago

Hi all, thanks so much for curating this dataset!

I'd like to use RDKit to load the SMILES string for each entry in the hdf5 file to get some basic atom-level features like formal-charge, hybridization, etc.. and I was wondering if anyone knows how to get RDKit to preserve the indices passed in through the smiles string or how you might go about converting the smiles string with atom indices to another format such that I can work backwards to the order of the features in SPICE once loaded with rdkit.

Pulling the smiles string from the arg entry:

mol = Chem.MolFromSmiles('[H:36][C@:16]([C:4](=[O:2])[N:7]([H:19])[C:12]([H:27])([H:28])[H:29])([C:15]([H:34])([H:35])[C:14]([H:32])([H:33])[C:13]([H:30])([H:31])[N:8]([H:20])[C:5](=[N+:10]([H:22])[H:23])[N:6]([H:17])[H:18])[N:9]([H:21])[C:3](=[O:1])[C:11]([H:24])([H:25])[H:26].[H:38][O:37][H:39].[H:41][O:40][H:42].[H:44][O:43][H:45].[H:47][O:46][H:48].[H:50][O:49][H:51].[H:53][O:52][H:54].[H:56][O:55][H:57].[H:59][O:58][H:60].[H:62][O:61][H:63].[H:65][O:64][H:66].[H:68][O:67][H:69].[H:71][O:70][H:72].[H:74][O:73][H:75].[H:77][O:76][H:78].[H:80][O:79][H:81].[H:83][O:82][H:84].[H:86][O:85][H:87].[H:89][O:88][H:90].[H:92][O:91][H:93].[H:95][O:94][H:96]' )

loads the molecule, but looping over the atoms results in a different atom order than the atomic numbers and of course rdkit drops all the hydrogens...

Order from HDF5 Entry: ['O', 'O', 'C', 'C', 'C', 'N', 'N', 'N', 'N', 'N', 'C', 'C', 'C', 'C', 'C', 'C', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H']
Order after RDkit load: ['C', 'C', 'O', 'N', 'C', 'C', 'C', 'C', 'N', 'C', 'N', 'N', 'N', 'C', 'O', 'C', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']

Any tips would be greatly appreciated!

EDIT: In case anyone runs into this in the future and doesn't want to install openff into a pre-existing environment, I ripped this out of the openff molecule parser from here:

print('Order from HDF5 Entry:', atoms_list)
rdmol = Chem.MolFromSmiles(smiles, sanitize=False)
assert rdmol is not None, "Unable to parse the SMILES string"

# strip the atom map from the molecule if it has one
# so we don't affect the sterochemistry tags
for atom in rdmol.GetAtoms():
    if atom.GetAtomMapNum() != 0:
        # set the map back to zero but hide the index in the atom prop data
        atom.SetProp("_map_idx", str(atom.GetAtomMapNum()))
        # set it back to zero
        atom.SetAtomMapNum(0)

# Chem.SanitizeMol calls updatePropertyCache so we don't need to call it ourselves
# https://www.rdkit.org/docs/cppapi/namespaceRDKit_1_1MolOps.html#a8d831787aaf2d65d9920c37b25b476f5
Chem.SanitizeMol(rdmol, Chem.SANITIZE_ALL ^ Chem.SANITIZE_ADJUSTHS ^ Chem.SANITIZE_SETAROMATICITY)
Chem.SetAromaticity(rdmol, Chem.AromaticityModel.AROMATICITY_MDL)

# Chem.MolFromSmiles adds bond directions (i.e. ENDDOWNRIGHT/ENDUPRIGHT), but
# doesn't set bond.GetStereo(). We need to call AssignStereochemistry for that.
Chem.AssignStereochemistry(rdmol)

rdkit_idx_to_spice_idx = {}
for atom_idx in range(rdmol.GetNumAtoms()):
    atom = rdmol.GetAtomWithIdx(atom_idx)
    assert atom.GetNumImplicitHs() == 0, "Expected no implicit hydrogens"
    rdkit_idx_to_spice_idx[atom_idx] = int(atom.GetProp("_map_idx"))

print('Order before remap', [x.GetSymbol() for x in rdmol.GetAtoms()])
print('Order after remap', [j[0] for j in sorted([(rdmol.GetAtomWithIdx(x[0]).GetSymbol(), x[1]) for x in rdkit_idx_to_spice_idx.items()], key=lambda x: x[1])])

Output:

Order from HDF5 Entry: ['O', 'O', 'C', 'C', 'C', 'N', 'N', 'N', 'N', 'N', 'C', 'C', 'C', 'C', 'C', 'C', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H']
Order before remap ['H', 'C', 'C', 'O', 'N', 'H', 'C', 'H', 'H', 'H', 'C', 'H', 'H', 'C', 'H', 'H', 'C', 'H', 'H', 'N', 'H', 'C', 'N', 'H', 'H', 'N', 'H', 'H', 'N', 'H', 'C', 'O', 'C', 'H', 'H', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H']
Order after remap ['O', 'O', 'C', 'C', 'C', 'N', 'N', 'N', 'N', 'N', 'C', 'C', 'C', 'C', 'C', 'C', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H', 'O', 'H', 'H']

peastman commented 5 months ago

The easiest way to do it is with OpenFF Toolkit.

from openff.toolkit import Molecule
ffmol = Molecule.from_mapped_smiles(smiles, allow_undefined_stereo=True)
rdmol = ffmol.to_rdkit()

benf549 commented 5 months ago

Awesome thanks!

openmm / spice-dataset

Preserving Atom Indexing with RDkit? #101