openforcefield / openff-toolkit

The Open Forcefield Toolkit provides implementations of the SMIRNOFF format, parameterization engine, and other tools. Documentation available at http://open-forcefield-toolkit.readthedocs.io
http://openforcefield.org
MIT License
309 stars 90 forks source link

Safer and more consistent aromaticity handling #697

Open j-wags opened 4 years ago

j-wags commented 4 years ago

(Continuing from #206)

Describe the problem

Problem 1: Our current data structures and functions handle aromaticity in an unsafe way. There are three places where we store aromaticity information, but don't rigorously enforce that they are consistent (largely my fault [1]) or have mechanisms to flag/resolve inconsistencies. These three places are:

Problem 2:

A Molecule currently does not know its own aromaticity model, just whether its atoms and bonds ARE aromatic. Currently we have different degrees of enforcement, usually assuming the molecule conforms to DEFAULT_AROMATICITY_MODEL, but not all Molecule-creation pathways enforce this, and the public API allows invalidating this assumption. This means that the information in the Atom/Bond.is_aromatic attributes is missing essential context and can't safely be used for most purposes.

Problem 3:

Currently, molecule equality comparisons don't check that all molecules involved have equivalent aromaticity models. It's possible that two molecules are identical with one particular aromaticity model, but not with another, since the aromaticity model is the bridge between a single molecule's multiple possible kekule structures. These molecule equality comparisons are core to our functionality, since they are how Topology.add_molecule knows whether a newly-added molecule is unique or not.

Problem 4: It's taken me almost two years to understand the subtlety of the aromaticity issue, and where in the codebase we are or aren't making assumptions about the use of identical aromaticity models. This will spell organizational trouble as the volume of contributors and PRs increases, so I think we need at least approachable documentation of this, and at best a change to our object model that doesn't allow implementation of dangerous behavior.

So far this hasn't been an issue because we currently only support OEAroModel_MDL and its RDKit equivalent. But users constructing molecules atom-by-atom using the API [2] may have been unknowingly circumventing our "safe pathways" and getting weird results.

@davidlmobley points out that aspects of this problem that affect parameterization will be resolved by WBO interpolation of parameters. This is good news, but resolving this still won't help the problems with our Molecule and Topology classes.

Solution specification

Important requirements:

Less important requirements:

Bad outcomes, which I'd like to avoid if possible.

Proposed solutions

I'd like to push for a change to the object model/API, with the goal of making dangerous/ambiguous behavior hard or impossible to implement.

I see a few options on this front:

The advantage of the Separate classes option is that it's "safe". The disadvantage is that we'll have to implement a complex "allowlist" to determine when instances of the two classes can safely interact, and we'll be frustratingly restrictive by default. It'll also be very user-unfriendly.

The advantage of the Forbid in-place modification option is that the risks of modifying the aromaticity model are inherently communicated by making the same molecule with different aromaticity models literally be different objects. We escape the obligation of policing when a molecule's "meaning" really changes. The downside is that we don't "automagically" handle anything -- All of the burden is shifted on to the user.

The advantage of the Demote option is that we could fully treat is_aromatic as computed-on-the-fly property (caching when appropriate). Each method requiring aromaticity info would have an optional kwarg for the aromaticity model to use. A disadvantage is that the == operator would either become very strict (matching only precisely the same kekule structure, even if many are trivially possible) or it would have to quietly apply the DEFAULT_AROMATICITY_MODEL. Use of anything other than the default model will require explicit calls to the is_isomorphic method.

I'm in favor of Demote at the moment, but would be interested to hear other approaches.

Unresolved questions

[1] at least I hadn't wrapped my head around this problem until recently, so my development since the 0.1.0 release may have removed the carefully-considered checks that were there. But storing this information in multiple places is bad, because they could fall out of sync. For example, atom/bond.is_aromatic has a public setter But also, in some contexts (like a Topology), the molecule containing those atoms and bonds is part of a larger Topology, which has its own aromaticity_model. Since the aromaticity of an atom or bond is a deterministic result of applying an aromaticity model to a chemical graph, what would it mean to set the is_aromatic flag of an atom to a different value? Which value should the ForceField trust when processing that Topology? [2] We're even finding that SMILES representations of graph molecules that don't specify their aromaticity model can be troublesome in corner cases -- #511 [3] For a current example: the individual molecules in the Topology are checked for uniqueness in Topology.add_molecule and grouped if found to be redundant. If two molecules are found to be equivalent under one aromaticity model, they may still be found to be separate under another. But the ForceField only knows what the Topology knows, so if a Topology was populated using one aromaticity model, a FF with a different aromaticity model can not safely interpret that Topology.

mattwthompson commented 4 years ago

First, some potentially silly questions that would help my understanding

I think the first question that should be resolved is your "Important requirement" (paraphrasing):

The Molecule class either

  1. Knows its components' aromaticity by knowing its own aromaticity model or
  2. Does not know its aromaticity model, its constituents don't know their aromaticity model, and if anything requires aromaticity, that method

I am strongly in favor of 1, mostly on the basis that aromaticity seems to be important for a lot of infrastructure and science and I think that forbidding that from being stored in the object is going to limit us in significant ways. Downstream use cases can do whatever they want with that data (overwrite, ignore, etc.) but I think storing it in the molecule is much more natural than carrying it along the path of a workflow. Say in the case of serialization, storing the model alongside the molecule seems to offer no benefits compared to storing it in the molecule at a high level and also makes it easier to write out a molecule to file without a model, which we'd want to avoid.

My first impressions of the ideas you sketched out:

Separate classes

I'm not a fan of this idea; if we want one objects to be a particular "interpretation" of another, more stable object, I'd rather handle that by doing that "interpretation," basically the Demote idea. I'd also worry about how easily these classes can be come de-synced and the work done to force a state of agreement while enabling the user to fiddle with molecules.

Forbid in-place modification

I'm iffy on if, fundamentally, "[identical graph molecules] with different aromaticity models would literally be different objects," should be the target. We definitely want to avoid ambiguity about how different models with describe the same graph provide different views/interpretations, but this will lead to large amounts of duplication that I worry is unnecessary. Here is a case in which my lack of expertise on aromaticity models shows; in my head, if two models agree precisely on which rings/moieties in a molecule are aromatic, those molecules (the "real" molecules with models applied, not their simplified graph representation) can be treated identically (in every way except keeping track of the model used to generate it!).

Demote

This is also my preferred of these options. This seems to be the best fit for what I understand aromaticity models' purpose to be and also seems the most tractable to implement. A feature (unclear to me if good or bad) is that a user can't come in an insist some bond/atom is aromatic when it disagrees with the model - but is that something we even want to allow? More generally, should users be able to decide aromaticity for themselves in explicit disagreement with the behavior of known models? A similar feature (again, not clearly good or bad) is that this limits what the user can do to what aromaticity models are currently supported by the toolkit. I'd suggest that's a good thing in the sense that it respects the boundary of what is and isn't supported and abstracts that friction away to the toolkit wrappers, where it should be. But maybe somebody wants to be able to forcefully specify aromaticity, or store the results of a model that they know but the toolkit doesn't support. I'm not sure if that's a significant limitation or not.

The other questions you laid out:

Should we add the requirement the guarantee that A molecule's interpretation at any point in its lifecycle must be a "state function" of its original data source?

This would seem nice, I'm not sure if critical. Right now, molecule readers/views and force fields disagreeing on aromaticity models is going to be a potential landmine, but that's mostly because of the big-picture issues you're raising here, and I'd expect them to be less of a headache if any of these proposals are implemented (and satisfy the wishlist you have). Say we run with something close to the Demote proposal - is this not something we get for free?

Should we ever let a graph molecule have a bond order of 1.5?

Seems to be clearly "no," either because graph molecules don't store their bond orders in 2/3 of your options (Separate classes/Demote) or because bringing a bond order to the table without a corresponding aromaticity model ( Forbid in-place modification) invalidates said data. Is this a gotcha related to what aromaticity models output?

How do we resolve issues where loading a file REQUIRES an aromaticity model?

I guess if the model is specified in the file, just do that. But I figure that's extremely rare and we're more often going to run into things like the SMILES issues highlighted in the linked thread. So then the issue is deciding what to do when a format requires (or really badly wants) an aromaticity model during a conversion, ideally without exploding in complexity with the different combinations of toolkits, models, and flavors of file formats. Passing aromaticity_model to to_file (to_{rdkit|openeye} already has this) would be the first step, and also exposes the problem - if the Molecule object in memory (with a known aromaticity model) can be trusted to round-trip to a format, it really becomes a question for the conversions between the wrapped toolkits.