I'd like to use RDKit to load the SMILES string for each entry in the hdf5 file to get some basic atom-level features like formal-charge, hybridization, etc.. and I was wondering if anyone knows how to get RDKit to preserve the indices passed in through the smiles string or how you might go about converting the smiles string with atom indices to another format such that I can work backwards to the order of the features in SPICE once loaded with rdkit.
loads the molecule, but looping over the atoms results in a different atom order than the atomic numbers and of course rdkit drops all the hydrogens...
EDIT:
In case anyone runs into this in the future and doesn't want to install openff into a pre-existing environment, I ripped this out of the openff molecule parser from here:
print('Order from HDF5 Entry:', atoms_list)
rdmol = Chem.MolFromSmiles(smiles, sanitize=False)
assert rdmol is not None, "Unable to parse the SMILES string"
# strip the atom map from the molecule if it has one
# so we don't affect the sterochemistry tags
for atom in rdmol.GetAtoms():
if atom.GetAtomMapNum() != 0:
# set the map back to zero but hide the index in the atom prop data
atom.SetProp("_map_idx", str(atom.GetAtomMapNum()))
# set it back to zero
atom.SetAtomMapNum(0)
# Chem.SanitizeMol calls updatePropertyCache so we don't need to call it ourselves
# https://www.rdkit.org/docs/cppapi/namespaceRDKit_1_1MolOps.html#a8d831787aaf2d65d9920c37b25b476f5
Chem.SanitizeMol(rdmol, Chem.SANITIZE_ALL ^ Chem.SANITIZE_ADJUSTHS ^ Chem.SANITIZE_SETAROMATICITY)
Chem.SetAromaticity(rdmol, Chem.AromaticityModel.AROMATICITY_MDL)
# Chem.MolFromSmiles adds bond directions (i.e. ENDDOWNRIGHT/ENDUPRIGHT), but
# doesn't set bond.GetStereo(). We need to call AssignStereochemistry for that.
Chem.AssignStereochemistry(rdmol)
rdkit_idx_to_spice_idx = {}
for atom_idx in range(rdmol.GetNumAtoms()):
atom = rdmol.GetAtomWithIdx(atom_idx)
assert atom.GetNumImplicitHs() == 0, "Expected no implicit hydrogens"
rdkit_idx_to_spice_idx[atom_idx] = int(atom.GetProp("_map_idx"))
print('Order before remap', [x.GetSymbol() for x in rdmol.GetAtoms()])
print('Order after remap', [j[0] for j in sorted([(rdmol.GetAtomWithIdx(x[0]).GetSymbol(), x[1]) for x in rdkit_idx_to_spice_idx.items()], key=lambda x: x[1])])
Hi all, thanks so much for curating this dataset!
I'd like to use RDKit to load the SMILES string for each entry in the hdf5 file to get some basic atom-level features like formal-charge, hybridization, etc.. and I was wondering if anyone knows how to get RDKit to preserve the indices passed in through the smiles string or how you might go about converting the smiles string with atom indices to another format such that I can work backwards to the order of the features in SPICE once loaded with rdkit.
Pulling the smiles string from the
arg
entry:loads the molecule, but looping over the atoms results in a different atom order than the atomic numbers and of course rdkit drops all the hydrogens...
Any tips would be greatly appreciated!
EDIT: In case anyone runs into this in the future and doesn't want to install openff into a pre-existing environment, I ripped this out of the openff molecule parser from here:
Output: