openforcefield / openff-toolkit

The Open Forcefield Toolkit provides implementations of the SMIRNOFF format, parameterization engine, and other tools. Documentation available at http://open-forcefield-toolkit.readthedocs.io
http://openforcefield.org
MIT License
303 stars 89 forks source link

Switch internal stereochemistry representation to CW/CCW and Cis/Trans #139

Open j-wags opened 5 years ago

j-wags commented 5 years ago

I think there is a problem with supporting CIP stereochemistry definitions (R/S stereocenters, cis/trans double bonds). CIP stereochemistry relies on the ranking of the "priority" of connected substituents. As long as we expose the "add_atom" and "add_bond" functions of Molecule, the "priority" of substituents of stereocenters could change.

I see a few ways to move forward:

One thought is that we could continue to use CIP stereo, however this would mean that any time a new atom/bond is added, all CIP tags for the molecule would have to be invalidated. A typical use case for our toolkit might be that someone methylates (or even just protonates) an existing ligand. In this case, our toolkit would have to remove all stereochemistry tags and expect the user to put it back, which is clearly undesired.

Another idea is that add_atom and add_bond could convert the molecule to RDKit/OE, then add the bond/atom, recompute CIP tags, and then convert back to an OFF mol. However this would be very cumbersome. I'd consider this a last resort.

I'm thinking that the best long-run solution is that:

We still have the problem of "what if the user makes something that wasn't imported as a stereocenter into a stereocenter?". For example, they import a molecule that's symmetric about a double bond, and then add atoms to make it non-symmetric. I'll need to keep thinking about this, but for now I'll implement checks in to_rdkit and to_openeye to raise assertions if there's undefined stereo.

j-wags commented 5 years ago

Happy to take input on this, but I'm mostly filing this as an issue so the design decision will be searchable later.

davidlmobley commented 5 years ago

Ah, OK. This is interesting.

It seems like this is only a (potential) problem which will get triggered when add_bond and add_atom are exposed. Should we avoid this for the initial implementation by simply not exposing these, then deal with it as you propose in the "long term solution"?

Might also benefit from input from @bannanc .

j-wags commented 5 years ago

Yes, the mutability presents a problem. We currently have most of our internal molecule representation in the FrozenMolecule class, which has private _add_atom and _add_bond functions. There's a more public Molecule class that inherits FrozenMolecule and exposes public add_atom and add_bond functions. These different levels let us distinguish between "trusted" vs. "dangerous" molecule modification operations. I'm on the fence about whether we'll expose the public Molecule add_X functions in the initial release.

However, it's important that we switch to the CW/CCW and cis/trans stereo specs now, since these tags will be used in molecule serialization, and changing this after the initial release will break compatibility.

davidlmobley commented 5 years ago

Agreed.

jchodera commented 5 years ago

I think there is significant risk in having stereochemistry definitions be implementation-specific (CW/CCW) rather than implementation-independent (R/S, E/Z), but I defer to @j-wags to figure out what will cause larger headaches in the long run.

I spoke to Greg Landrum (of RDKit) today, and he mentioned that we should look into the recently-added interchange format support RDKit added in the latest release: https://github.com/rdkit/rdkit/pull/1798

There, support was added for the CommonChem JSON standard, which is something we may want to consider supporting in our serialization format and/or internal representations.

j-wags commented 5 years ago

Hah. Looks like we have headaches in either case. Now I'm thinking it wouldn't be so bad to run all mutable molecule operations through toolkits 😛

I think (think!) rdkit.Chem.CHI_TETRAHEDRAL_{CW,CCW} and oechem.OEAtomStereo_{Right,Left}Handed are one-to-one mappings of each other, as long as the neighboring atoms list is ordered the same way in each case.

rdkit.Chem.BondStereo.STEREOCIS and oechem.OEBondStereo_Cis should be easy to line up.

I'm still thinking that non-CIP stereo is the way to go, but I do see the drawbacks.

The common interchange format is a good idea. The to-do list for SMIRNOFF 1.0 is getting sort of long, so it might be in range for a subsequent SMIRNOFF version. I'll start an issue for it.

bannanc commented 5 years ago

To me, it sounds like the obvious answer for now is that we should keep add_X protected for now. Then open it up later if we need to. However, I'm confused about:

the CW/CCW and cis/trans stereo specs now, since these tags will be used in molecule serialization, and changing this after the initial release will break compatibility.

I don't quite understand how changing something internally, even extending the publicly accessible methods would break compatibility. But, if it would then yes I guess we need to decide now.

I think I agree with John on using the R/S and E/Z since if we do open up the add_atom or add_bond the users will most likely expect to set the stereochemistry using those terms. @j-wags I'm happy to chat more tomorrow if you want to.

j-wags commented 5 years ago

It'd be good to chat tomorrow, @bannanc. I'll post the notes on here for everyone.

To clarify, the compatibility issue comes from two conflicting design goals: 1) We shouldn't get so deep in the weeds that we implement our own CIP substituent-priority algorithm 2) OFFMol serialization/deserialization should be independent of cheminformatics toolkits.

These two conflict if we initially use CIP stereochemistry, then our first users create files containing serialzied OFFMols with CIP style stereocenters, and then later we switch to CW/CCW style stereochem. If that happened, the CIP-style serialized files wouldn't be deserializeable into CW/CCW-style OFFMols (well, not without using a toolkit, which violates 2), or writing our own CIP priority algorithm, which violates 1) )

davidlmobley commented 5 years ago

It seems like the main argument to deal with this now is a worry about breaking backwards/forwards compatibility of serialized OFFMols, which I'm not convinced is a compelling argument as most of our users are MAINLY going to be interested in whether they can assign FF parameters to their molecules and systems of choice, and much less worried about whether our OFFMol serialization/deserialization is going to be stable over years.

BTW @j-wags you should be sure to post the image you put on Slack here for the record. Perhaps also porting in other relevant discussion from there too.

davidlmobley commented 5 years ago

The common interchange format is a good idea. The to-do list for SMIRNOFF 1.0 is getting sort of long, so it might be in range for a subsequent SMIRNOFF version. I'll start an issue for it.

Also I think we probably need a definitive target timeline for this; it was already long overdue before we got you on board, and we really need to wrap it soon. Definitely this should not be in 1.0 in my opinion.

j-wags commented 5 years ago

(Reposted from Slack, there's also some discussion about this topic there)

Alright, in the interest of time I'm happy to move forward with a CIP implementation. However, this will mean that molecule mutability (when we get there) will require a cheminformatics toolkit.

Just to make completely sure we're on the same page, here's a figure of what might happen if we rely internally on CIP stereochemistry when we get to the mutability stage. The molecule on the left is E (the five membered ring is trans to the Cl). If the user adds a 6 membered ring, and we only know that the double bond is "E" we get the molecule on the bottom right. This molecule is chemically valid (will pass stereochemistry checks and everything), however the 5-membered ring is now cis to the Cl (because the 6 membered ring has the higher CIP priority). The user almost certainly intended to make the molecule on the top right.

internal_cip_pitfall