openforcefield / openff-toolkit

The Open Forcefield Toolkit provides implementations of the SMIRNOFF format, parameterization engine, and other tools. Documentation available at http://open-forcefield-toolkit.readthedocs.io
http://openforcefield.org
MIT License
309 stars 90 forks source link

Molecule.from_mapped_smiles allows loading of invalid mapped SMILES with RDKit backend #1696

Open j-wags opened 1 year ago

j-wags commented 1 year ago

cc https://github.com/openforcefield/qca-dataset-submission/pull/207 cc https://github.com/openforcefield/qca-dataset-submission/issues/327 cc https://github.com/openforcefield/openff-qcsubmit/pull/228 cc https://github.com/openforcefield/openff-toolkit/issues/1696

To Reproduce

from openff.toolkit import Molecule, RDKitToolkitWrapper, OpenEyeToolkitWrapper
Molecule.from_mapped_smiles("[F:1][c:2]1[c:3]([H:32])[c:4]([H:33])[c:5]([H:34])[c:6]([F:7])[c:8]1[C:9]1=[N:12][N:13]2[C:14](=[C:15]([H:37])[N:16]=[C:17]2[N:18]([c:19]2[c:20]([H:39])[nH+:21][c:22]([H:40])[c:23]([H:41])[c:24]2[N:25]2[C:26]([H:42])([H:43])[C@:30]([NH+:31]([H:51])[H:52])([H:50])[C:29]([H:48])([H:49])[C:28]([H:46])([H:47])[C:27]2([H:44])[H:45])[H:38])[C:11]([H:36])=[C:10]1[H:35]",
                           toolkit_registry=RDKitToolkitWrapper())

returns a molecule with two nitrogen radicals, corresponding to the [NH+] atoms in the CMILES

Screen Shot 2023-08-15 at 7 36 50 PM

while

from openff.toolkit import Molecule, RDKitToolkitWrapper, OpenEyeToolkitWrapper
Molecule.from_mapped_smiles("[F:1][c:2]1[c:3]([H:32])[c:4]([H:33])[c:5]([H:34])[c:6]([F:7])[c:8]1[C:9]1=[N:12][N:13]2[C:14](=[C:15]([H:37])[N:16]=[C:17]2[N:18]([c:19]2[c:20]([H:39])[nH+:21][c:22]([H:40])[c:23]([H:41])[c:24]2[N:25]2[C:26]([H:42])([H:43])[C@:30]([NH+:31]([H:51])[H:52])([H:50])[C:29]([H:48])([H:49])[C:28]([H:46])([H:47])[C:27]2([H:44])[H:45])[H:38])[C:11]([H:36])=[C:10]1[H:35]",
                           toolkit_registry=OpenEyeToolkitWrapper())

raises an error

ValueError: 'hydrogens_are_explicit' was specified as True, but OpenEye Toolkit interpreted SMILES '[F:1][c:2]1[c:3]([H:32])[c:4]([H:33])[c:5]([H:34])[c:6]([F:7])[c:8]1[C:9]1=[N:12][N:13]2[C:14](=[C:15]([H:37])[N:16]=[C:17]2[N:18]([c:19]2[c:20]([H:39])[nH+:21][c:22]([H:40])[c:23]([H:41])[c:24]2[N:25]2[C:26]([H:42])([H:43])[C@:30]([NH+:31]([H:51])[H:52])([H:50])[C:29]([H:48])([H:49])[C:28]([H:46])([H:47])[C:27]2([H:44])[H:45])[H:38])[C:11]([H:36])=[C:10]1[H:35]' as having implicit hydrogen. If this SMILES is intended to express all explicit hydrogens in the molecule, then you should construct the desired molecule as an OEMol (where oechem.OEHasImplicitHydrogens(oemol) returns False), and then use Molecule.from_openeye() to create the desired OFFMol.

The CMILES is, IMO, misformed, so both pathways should raise an error.

Additional details

This molecule is from the industry benchmark set. You can find all of these with the following code:

from qcportal import FractalClient
from openff.toolkit import Molecule
client = FractalClient()

col = client.get_collection(collection_type="OptimizationDataset", name = "OpenFF Industry Benchmark Season 1 v1.1")
bad = [i for i,j in col.data.records.items() if "NH+" in j.attributes["canonical_isomeric_explicit_hydrogen_mapped_smiles"]]

Computing environment (please complete the following information): OFFTK 0.14.0