Behavior of molecule state enumerators

trevorgokey commented 3 years ago

Is your feature request related to a problem? Please describe. When using the molecule enumeration methods (enumerate_tautomers, enumerate_stereoisomers, enumerate_protomers), it has come to attention that, during batch processing in e.g. QCSubmit, it is not clear that calling these methods will or will not have the input molecule in the return. This means that, if I have a molecule and I enumerate with a max_* of 5, it is not clear whether I now have 5 or 6 states of this molecule.

Describe the solution you'd like Either documentation/doc strings that explain the behavior (hopefully independent of TK wrapper), or an option kwarg to indicate whether the input molecule must be in the return list. This is more relevant when max_* < total_*, since I assume that if there is no max specified, then the return will have the input molecule since all enumerations were generated.

In general, it makes sense to me that 1) the return always includes the input molecule, and 2) if I specify a max_*, the input molecule is always in the return, even if the wrapper enumerator didn't explicitly generate it.

Describe alternatives you've considered Manually checking this downstream works. However, it would be desirable to not have to propagate TK differences downstream to e.g. QCSubmit. For example, if I call enumerate_stereoisomers, it would be great if I didn't have to consider which wrapper was used.

j-wags commented 3 years ago

@trevorgokey I think this is a really good request. To help clarify -- Would it be necessary for your purposes for the "original" molecule in the output to be fully identical to the starting mol? Or is having a molecule with an isomorphic graph sufficient? Specifically, would the "original" molecule in the output need the same:

Name
Atom ordering
Conformers
Properties dictionary
Partial charges/bond orders ?

trevorgokey commented 3 years ago

So I have been a bit careless, as @jthorton pointed out, the docstring indicates that the return does not have the included molecule. I only looked at the top-level description in the docstring and didn't think to read the object descriptions.

With that said, I have to disagree with the docstring, as I am actually seeing the input returned:

In RD:

In [3]: mol = molecule.Molecule.from_smiles("CCCC([OH])(CC)", allow_undefined_stereo=True)
Warning (not error because allow_undefined_stereo=True): 

In [4]: mol.enumerate_stereoisomers()
Out[4]: 
[Molecule with name '' and SMILES '[H][O][C@]([H])([C]([H])([H])[C]([H])([H])[H])[C]([H])([H])[C]([H])([H])[C]([H])([H])[H]',
 Molecule with name '' and SMILES '[H][O][C@@]([H])([C]([H])([H])[C]([H])([H])[H])[C]([H])([H])[C]([H])([H])[C]([H])([H])[H]']

In [5]: mol = molecule.Molecule.from_smiles("CCC[C@@H]([OH])(CC)")

In [6]: mol.enumerate_stereoisomers()
Out[6]: []

In OE:

In [2]: mol = molecule.Molecule.from_smiles("CCCC([OH])(CC)", allow_undefined_stereo=True)
Warning (not error because allow_undefined_stereo=True): OEMol has unspecified stereochemistry. oemol.GetTitle(): 
Problematic atoms are:
Atom atomic num: 6, name: , idx: 3, aromatic: False, chiral: True with bonds:
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 2, aromatic: False, chiral: False
bond order: 1, chiral: False to atom atomic num: 8, name: , idx: 4, aromatic: False, chiral: False
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 5, aromatic: False, chiral: False
bond order: 1, chiral: False to atom atomic num: 1, name: , idx: 14, aromatic: False, chiral: False

In [3]: mol.enumerate_stereoisomers()
Out[3]: 
[Molecule with name '' and SMILES '[H][C@](C([H])([H])C([H])([H])[H])(C([H])([H])C([H])([H])C([H])([H])[H])O[H]',
 Molecule with name '' and SMILES '[H][C@@](C([H])([H])C([H])([H])[H])(C([H])([H])C([H])([H])C([H])([H])[H])O[H]']

In [4]: mol = molecule.Molecule.from_smiles("CCC[C@@H]([OH])(CC)")

In [5]: mol.enumerate_stereoisomers()
Out[5]: [Molecule with name '' and SMILES '[H][C@@](C([H])([H])C([H])([H])[H])(C([H])([H])C([H])([H])C([H])([H])[H])O[H]']

In [6]: mol.enumerate_stereoisomers(undefined_only=True)
Out[6]: []

Unfortunately there is a difference and neither really do what I want. I would first expect, with an exact SMILES, to have both isomers returned (undefined_only=False). Barring this, if we give an exact SMILES, I see that either 1 or no molecules are returned. I think I would prefer no molecules, since it would not immediately be clear if the C@@ or C@ version was returned (remember that undefined_only=False here).

Given this weirdness, as I think about it more, I like the idea of never including the input. This, for the most part, simplifies things and makes it much easier to reason with.

trevorgokey commented 3 years ago

@jwags Trying hard not to focus too much on the details, it makes sense to me that enumerate_stereo returns exactly what from_smiles would return if I gave it an exact SMILES. In fact, I would be happy enough if enumerate_stereo simply returned a list of SMILES, but the explicit conversion would be nice. I am sitting in the stage where I am trying to enter the toolkit, so my standpoint is starting from nothing, rather than somewhere elbow-deep.

With that said, to answer your other points. Since I am preferring more of a "return a SMILES" approach, it is perfectly acceptable to me that A) the ordering can be different, and that B) there is no reasonable person who would want to copy the conformers and/or partial charges over. I do think that protomer/tautomer enumeration should copy the correct bond orders over though.

trevorgokey commented 3 years ago

However! Returning 0 when there is an undefined stereocenter in the SMARTS input is a bad idea, so we should come up with a sane behavior here (or reflect existing behavior in the docs, etc). I don't count CCCC([OH])(CC) as either C@@ or C@ (I consider it a logical OR) , but calling generate_conformers on it (with allow_undefined_stereo=True on the from_smiles call) seems to work, which (to me) seems dubious. Luckily, both wrappers chose C@@ by default in the 3D structure, but unfortunately a to_smiles doesn't reflect this after conformers were generated.

j-wags commented 3 years ago

Ah, sorry to drop off this conversation.

I don't count CCCC([OH])(CC) as either C@@ or C@ (I consider it a logical OR) , but calling generate_conformers on it (with allow_undefined_stereo=True on the from_smiles call) seems to work, which (to me) seems dubious. Luckily, both wrappers chose C@@ by default in the 3D structure, but unfortunately a to_smiles doesn't reflect this after conformers were generated.

It's important to remember that coordinates are second-class citizens in OpenFF-land. We deal with graph molecules which MAY have coordinates attached, and not the other way around. Similar to my "tiers of molecule info" show-and-tell a few weeks ago, I think it's fine for all sorts of automagic to happen when we have a graph representation and we're "deriving" some acceptable coordinates from it. However, it should require user input to interpret stereochemistry from 3D coordinates.

Put another way: it's basically impossible to police that graph and 3D stereochemistry are in sync. We'd need to re-validate the stereochemistry every time a new conformer is added, or even when a single coordinate is changed. Not even OpenEye polices conformers to ensure they match the graph stereochemistry.

Attempt to summarize

To make sure I'm following and can turn this into a close-able issue, we've identified some issues/desired behaviors here:

For enumerate_stereoisomers:

When a molecule with no stereocenters is provided
- Always return an empty list
When a fully stereo-defined molecule is provided
- if undefined_only=True
  - return an empty list
- if undefined_only=False
  - Return lots of stereoisomers
When a partially stereo-defined molecule is provided
- if undefined_only=True
  - Return two or more stereoisomers
- if undefined_only=False
  - Return all possible stereoisomets
We shouldn't require the input molecule to be present in the output
The molecules present in the output should have the same atom order (I think we can reasonably guarantee this), element, formal charge, aromaticity, bonds, and bond orders as the input, but "optional" info like partial charges and conformers should be left out.

openforcefield / openff-toolkit

Behavior of molecule state enumerators #738

Attempt to summarize