openforcefield / openff-toolkit

The Open Forcefield Toolkit provides implementations of the SMIRNOFF format, parameterization engine, and other tools. Documentation available at http://open-forcefield-toolkit.readthedocs.io
http://openforcefield.org
MIT License
311 stars 91 forks source link

Molecule.from_openeye() can silently transform the input in dangerous ways #1712

Open jchodera opened 1 year ago

jchodera commented 1 year ago

Describe the bug

Molecule.from_openeye() silently adds hydrogens if they are implicit in this line rather than raising an Exception. This is dangerous behavior, since the main utility of the {to|from}_openeye() methods is to transform the exact representation (and especially the exact atom numbering) between OpenFF and OpenEye toolkits. If the representation (and especially number of atoms and atom numberings) can silently change, this will cause major issues.

This is already partly enforced by the fact that Molecule.from_openeye() requires the user to specify allow_undefined_stereo=True if the incoming representation isn't full explicit (in stereochemistry). It does not make sense to allow the representation to not be fully explicit in otherways without a user-specified override (e.g. allow_implicit_hydrogens=True).

To Reproduce

>>> from openff.toolkit.topology import Molecule
>>> from openeye import oechem
>>> oemol = oechem.OEGraphMol()
>>> oechem.OESmilesToMol(oemol, 'CCO')
True
>>> oemol.NumAtoms()
3
>>> offmol = Molecule.from_openeye(oemol)
>>> offmol.n_atoms
9

Output See above.

Computing environment (please complete the following information):

Additional context This was discovered in preparing bug reports for #1710 and #1711

mattwthompson commented 1 year ago

Reminds one a bit of #566 / https://github.com/openforcefield/openff-toolkit/pull/762, no?

j-wags commented 1 year ago

Thankfully, this doesn't seem to be mangling the chemistry of any molecules. The number of hydrogens on each heavy atom is already specified before the molecule enters the OpenFF ecosystem:

from openff.toolkit.topology import Molecule
from openeye import oechem
oemol = oechem.OEGraphMol()
oechem.OESmilesToMol(oemol, 'CCO')

for oeatom in oemol.GetAtoms():
    print(oeatom.GetImplicitHCount())

3 2 1

The atom ordering/indices of the "existent" (non-implicit) atoms also seem to be preserved (though the OE docs don't explicitly guarantee this):

offmol = Molecule.from_openeye(oemol)
[atom.symbol for atom in offmol.atoms]

['C', 'C', 'O', 'H', 'H', 'H', 'H', 'H', 'H']

Making implicit hydrogens explicit is a major user convenience that I'd currently advocate for even if we didn't have existing behavior to consider. But we do have existing behavior to consider, and changing this to throwing an error would constitute a major behavior change and would break many user workflows. I'd be happy to resolve this with a PR that updates docstrings to say that implicit Hs become explicit. I don't think it's in scope for the OFF Toolkit to expand the Molecule class to support implicit Hs.

jchodera commented 1 year ago

Making implicit hydrogens explicit is a major user convenience that I'd currently advocate for even if we didn't have existing behavior to consider.

Yes, absolutely---but it shouldn't happen silently. As I recommended above, if the user wants to use this convenience---which will alter the atom ordering/identities---the additional burden of requiring the user specify a flag prevents mistakes and misuse.

changing this to throwing an error would constitute a major behavior change and would break many user workflows.

Is there any evidence of this? And even if so, if folks are relying on a dangerous, silent behavior that modifies atom ordering---which is critically important for us to preserve---then they should update the behavior to make the request to modify the molecule explicit.

Again, this would be fully consistent with the current requirement that the user explcitly request that handle molecules with undefined stereochemistry since they may be altered as well.

jchodera commented 1 year ago

The atom ordering/indices of the "existent" (non-implicit) atoms also seem to be preserved (though the OE docs don't explicitly guarantee this):

This is certainly not guaranteed to always be the case.