openforcefield / cmiles

Generate canonical molecule identifiers for quantum chemistry database
https://cmiles.readthedocs.io
MIT License
23 stars 7 forks source link

Explicit valence issue with rdkit #60

Open wutobias opened 3 years ago

wutobias commented 3 years ago

Hi folks,

please have a look at the following code snippet:

import qcportal as ptl
import cmiles

qc_client  = ptl.FractalClient("https://api.qcarchive.molssi.org:443/")
ds         = qc_client.get_collection('OptimizationDataset', "FDA Optimization Dataset 1")
qcrecord   = ds.get_record("C[NH2+]C[C@@H](c1ccc(c(c1)O)O)O-0", 
                           specification="default")
qcmol      = qcrecord.get_final_molecule()
oemol      = cmiles.utils.load_molecule(qcmol.dict(), toolkit="openeye")
rdmol      = cmiles.utils.load_molecule(qcmol.dict(), toolkit="rdkit")

The oemol is returned successfully, however rdmol cannot be created due to valence issues with N atoms: RDKit ERROR: [13:24:10] Explicit valence for atom # 9 N, 4, is greater than permitted. I think this is fixable with the following code block inserted after line 44 in cmiles/_cmiles_rd.py (not elegant though, won't cover general valence issues):

    # Fix N connectivity issue
    for atom_idx in range(em.GetNumAtoms()):
        atom = em.GetAtomWithIdx(atom_idx)
        if atom.GetAtomicNum() == 7:
            total_bo = 0.
            for bond in atom.GetBonds():
                total_bo += bond.GetBondTypeAsDouble()
            if total_bo > 3.:
                atom.SetFormalCharge(1)

Regards, Tobias

wutobias commented 3 years ago

Found this PR that seems to target the same issue: https://github.com/openforcefield/cmiles/pull/37 Not sure about the details of this particular solution but it seems that the error message from rdkit is parsed in order to identify the offending N atom. This seems to be fragile, since rdkit might change the formatting of error messages in the future. The above solution explicitly counts the number of bonded neighbors, which should be more stable.

j-wags commented 3 years ago

Sorry for the delay on this -- CMILES is effectively deprecated, though we've done a really bad job of communicating it. Its functionality has been migrated to the actively-supported OpenFF toolkit and QCSubmit packages.

QCA molecules submitted by OpenFF now all have CMILES records attached to them, which explicitly contain the connection graph (including things like bond order and formal charge). So we shouldn't need to do any guessing about nitrogens.

We need to put out a migration guide for users for how to do all their favorite CMILES workflows using these two other tools. In the meantime, here's a little code snippet I put together to pull down initial and final molecules using the OpenFF toolkit. I think that there's a more streamlined pathway either already available in QCSubmit, or that will be available in the next release.

from openff.toolkit.topology import Molecule
import pint
from simtk.openmm import unit 
import numpy as np

import qcportal as ptl
from openff.toolkit.topology import Molecule

punit = pint.UnitRegistry()

# Some stuff I scapped together form other code bits. I don't understand how this works. 
client= ptl.FractalClient()
ds = client.get_collection('OptimizationDataset', "FDA Optimization Dataset 1")
entry = ds.get_entry('C[NH2+]C[C@@H](c1ccc(c(c1)O)O)O-0')
qcrecord   = ds.get_record("C[NH2+]C[C@@H](c1ccc(c(c1)O)O)O-0", 
                           specification="default")

initial_qcmol = client.query_molecules(entry.initial_molecule)[0]
final_qcmol = qcrecord.get_final_molecule()

# set conformer qcmol geometry
initial_geometry = unit.Quantity(
        np.array(initial_qcmol.geometry, float), unit.bohr
    )
final_geometry = unit.Quantity(
        np.array(final_qcmol.geometry, float), unit.bohr
    )

# Make a single mol with both conformers so we can visualize it in nglview
combined_offmol = Molecule.from_qcschema(entry)
combined_offmol.add_conformer(initial_geometry.in_units_of(unit.angstrom))
combined_offmol.add_conformer(final_geometry.in_units_of(unit.angstrom))

# Make OE and RDKit mols from this combined offmol
oemol = combined_offmol.to_openeye()
rdmol = combined_offmol.to_rdkit()

combined_offmol.visualize(backend='nglview')
wutobias commented 3 years ago

Thanks for looking into this. I am now changing my workflow towards Molecule.from_qcschema.

jchodera commented 3 years ago

Sorry for the delay on this -- CMILES is effectively deprecated, though we've done a really bad job of communicating it

@j-wags : Would it make sense to add a note at the top of the https://github.com/openforcefield/cmiles repo README to communicate this?

j-wags commented 3 years ago

Good call. I've just added a notice and some references to new API points to the top of the README.