Open j-wags opened 4 years ago
First, some potentially silly questions that would help my understanding
Is there a reason why Molecule
does not currently know its aromaticity model at its level? I can't wrap by head around why objects at higher (Topology
) and lower (Atom
/Bond
) have this information but it doesn't. My reading of your proposals is that each would require a molecule object knowing its aromaticity model in order to apply it on-the-fly to its own graph.
What is the relationship between aromaticity models and partial charges? (It's usually indirect or unrelated, I would think, but there are a lot of partial charge models out there.)
How important it is to us that users be able to build up molecules from scratch using the Molecule
API and/or modify existing molecules, as compared to reading from file/object and taking only what is there?
I think the first question that should be resolved is your "Important requirement" (paraphrasing):
The Molecule class either
- Knows its components' aromaticity by knowing its own aromaticity model or
- Does not know its aromaticity model, its constituents don't know their aromaticity model, and if anything requires aromaticity, that method
I am strongly in favor of 1, mostly on the basis that aromaticity seems to be important for a lot of infrastructure and science and I think that forbidding that from being stored in the object is going to limit us in significant ways. Downstream use cases can do whatever they want with that data (overwrite, ignore, etc.) but I think storing it in the molecule is much more natural than carrying it along the path of a workflow. Say in the case of serialization, storing the model alongside the molecule seems to offer no benefits compared to storing it in the molecule at a high level and also makes it easier to write out a molecule to file without a model, which we'd want to avoid.
My first impressions of the ideas you sketched out:
Separate classes
I'm not a fan of this idea; if we want one objects to be a particular "interpretation" of another, more stable object, I'd rather handle that by doing that "interpretation," basically the Demote idea. I'd also worry about how easily these classes can be come de-synced and the work done to force a state of agreement while enabling the user to fiddle with molecules.
Forbid in-place modification
I'm iffy on if, fundamentally, "[identical graph molecules] with different aromaticity models would literally be different objects," should be the target. We definitely want to avoid ambiguity about how different models with describe the same graph provide different views/interpretations, but this will lead to large amounts of duplication that I worry is unnecessary. Here is a case in which my lack of expertise on aromaticity models shows; in my head, if two models agree precisely on which rings/moieties in a molecule are aromatic, those molecules (the "real" molecules with models applied, not their simplified graph representation) can be treated identically (in every way except keeping track of the model used to generate it!).
Demote
This is also my preferred of these options. This seems to be the best fit for what I understand aromaticity models' purpose to be and also seems the most tractable to implement. A feature (unclear to me if good or bad) is that a user can't come in an insist some bond/atom is aromatic when it disagrees with the model - but is that something we even want to allow? More generally, should users be able to decide aromaticity for themselves in explicit disagreement with the behavior of known models? A similar feature (again, not clearly good or bad) is that this limits what the user can do to what aromaticity models are currently supported by the toolkit. I'd suggest that's a good thing in the sense that it respects the boundary of what is and isn't supported and abstracts that friction away to the toolkit wrappers, where it should be. But maybe somebody wants to be able to forcefully specify aromaticity, or store the results of a model that they know but the toolkit doesn't support. I'm not sure if that's a significant limitation or not.
The other questions you laid out:
Should we add the requirement the guarantee that
A molecule's interpretation at any point in its lifecycle must be a "state function" of its original data source?
This would seem nice, I'm not sure if critical. Right now, molecule readers/views and force fields disagreeing on aromaticity models is going to be a potential landmine, but that's mostly because of the big-picture issues you're raising here, and I'd expect them to be less of a headache if any of these proposals are implemented (and satisfy the wishlist you have). Say we run with something close to the Demote proposal - is this not something we get for free?
Should we ever let a graph molecule have a bond order of 1.5?
Seems to be clearly "no," either because graph molecules don't store their bond orders in 2/3 of your options (Separate classes/Demote) or because bringing a bond order to the table without a corresponding aromaticity model ( Forbid in-place modification) invalidates said data. Is this a gotcha related to what aromaticity models output?
How do we resolve issues where loading a file REQUIRES an aromaticity model?
I guess if the model is specified in the file, just do that. But I figure that's extremely rare and we're more often going to run into things like the SMILES issues highlighted in the linked thread. So then the issue is deciding what to do when a format requires (or really badly wants) an aromaticity model during a conversion, ideally without exploding in complexity with the different combinations of toolkits, models, and flavors of file formats. Passing aromaticity_model
to to_file
(to_{rdkit|openeye}
already has this) would be the first step, and also exposes the problem - if the Molecule
object in memory (with a known aromaticity model) can be trusted to round-trip to a format, it really becomes a question for the conversions between the wrapped toolkits.
(Continuing from #206)
Describe the problem
Problem 1: Our current data structures and functions handle aromaticity in an unsafe way. There are three places where we store aromaticity information, but don't rigorously enforce that they are consistent (largely my fault [1]) or have mechanisms to flag/resolve inconsistencies. These three places are:
Atom
orBond
'sis_aromatic
attribute.aromaticity_model
attributearomaticity_model
attributeProblem 2:
A
Molecule
currently does not know its own aromaticity model, just whether its atoms and bonds ARE aromatic. Currently we have different degrees of enforcement, usually assuming the molecule conforms toDEFAULT_AROMATICITY_MODEL
, but not all Molecule-creation pathways enforce this, and the public API allows invalidating this assumption. This means that the information in theAtom
/Bond.is_aromatic
attributes is missing essential context and can't safely be used for most purposes.Problem 3:
Currently, molecule equality comparisons don't check that all molecules involved have equivalent aromaticity models. It's possible that two molecules are identical with one particular aromaticity model, but not with another, since the aromaticity model is the bridge between a single molecule's multiple possible kekule structures. These molecule equality comparisons are core to our functionality, since they are how
Topology.add_molecule
knows whether a newly-added molecule is unique or not.Problem 4: It's taken me almost two years to understand the subtlety of the aromaticity issue, and where in the codebase we are or aren't making assumptions about the use of identical aromaticity models. This will spell organizational trouble as the volume of contributors and PRs increases, so I think we need at least approachable documentation of this, and at best a change to our object model that doesn't allow implementation of dangerous behavior.
So far this hasn't been an issue because we currently only support
OEAroModel_MDL
and its RDKit equivalent. But users constructing molecules atom-by-atom using the API [2] may have been unknowingly circumventing our "safe pathways" and getting weird results.@davidlmobley points out that aspects of this problem that affect parameterization will be resolved by WBO interpolation of parameters. This is good news, but resolving this still won't help the problems with our
Molecule
andTopology
classes.Solution specification
Important requirements:
Molecule
class eitheratom/bond.is_aromatic
flags, but if so, it MUST know its own aromaticity model, oris_aromatic
flags, but methods that depend on aromaticity perception MUST include an optional argument for which aromaticity model to useLess important requirements:
atom
/bond.is_aromatic
before a molecule is added to aTopology
, then the equivalent atom/bond after addition to the Topology MUST return the same result.is_aromatic
value of an atom, any Molecule-wide aromaticity model should be invalidated.Bad outcomes, which I'd like to avoid if possible.
Topology.add_molecule
MAY reject a Molecule, if the Molecule's existing aromaticity model may have irreparably changed its graph such that it would be misinterpreted by the Topology's aromaticity model.ForceField.create_openmm_system
MAY reject a Topology if it was constructed with a different aromaticity model, since the other aromaticity model could have irreparably changed the molecular graphs.Topology.add_molecule
is not able to trust if a Molecule says it has a particular aromaticity model, and must re-percieve its aromaticity.ForceField.create_openmm_system
is not able to trust a Topology that says it has a particular aromaticity model, and must re-percieve each molecule's aromaticity.Proposed solutions
I'd like to push for a change to the object model/API, with the goal of making dangerous/ambiguous behavior hard or impossible to implement.
I see a few options on this front:
is_aromatic
flags oraromaticity_model
values.is_aromatic
attributes ofAtom
s andBond
s altogether, and make aromaticity get computed on-the-fly when needed.The advantage of the Separate classes option is that it's "safe". The disadvantage is that we'll have to implement a complex "allowlist" to determine when instances of the two classes can safely interact, and we'll be frustratingly restrictive by default. It'll also be very user-unfriendly.
The advantage of the Forbid in-place modification option is that the risks of modifying the aromaticity model are inherently communicated by making the same molecule with different aromaticity models literally be different objects. We escape the obligation of policing when a molecule's "meaning" really changes. The downside is that we don't "automagically" handle anything -- All of the burden is shifted on to the user.
The advantage of the Demote option is that we could fully treat
is_aromatic
as computed-on-the-fly property (caching when appropriate). Each method requiring aromaticity info would have an optional kwarg for the aromaticity model to use. A disadvantage is that the==
operator would either become very strict (matching only precisely the same kekule structure, even if many are trivially possible) or it would have to quietly apply theDEFAULT_AROMATICITY_MODEL
. Use of anything other than the default model will require explicit calls to theis_isomorphic
method.I'm in favor of Demote at the moment, but would be interested to hear other approaches.
Unresolved questions
A molecule's interpretation at any point in its lifecycle must be a "state function" of its original data source
? This would be a very clear guiding principle for developers. Unpacking this statement, it means "If I load a molecule using aromaticity model A, then during parameterization, a FF reinterprets it using aromaticity model B, then the toolkit MUST provide the same result as if the molecule were initially loaded using model B (This would directly solve [3]). Of course, if the user manually modifies the molecule after loading it, then we do not need to uphold this guarantee.[1] at least I hadn't wrapped my head around this problem until recently, so my development since the 0.1.0 release may have removed the carefully-considered checks that were there. But storing this information in multiple places is bad, because they could fall out of sync. For example,
atom
/bond.is_aromatic
has a public setter But also, in some contexts (like a Topology), the molecule containing those atoms and bonds is part of a larger Topology, which has its ownaromaticity_model
. Since the aromaticity of an atom or bond is a deterministic result of applying an aromaticity model to a chemical graph, what would it mean to set theis_aromatic
flag of an atom to a different value? Which value should the ForceField trust when processing that Topology? [2] We're even finding that SMILES representations of graph molecules that don't specify their aromaticity model can be troublesome in corner cases -- #511 [3] For a current example: the individual molecules in the Topology are checked for uniqueness inTopology.add_molecule
and grouped if found to be redundant. If two molecules are found to be equivalent under one aromaticity model, they may still be found to be separate under another. But the ForceField only knows what the Topology knows, so if a Topology was populated using one aromaticity model, a FF with a different aromaticity model can not safely interpret that Topology.