openforcefield / openff-interchange

A project (and object) for storing, manipulating, and converting molecular mechanics data.
https://docs.openforcefield.org/projects/interchange
MIT License
71 stars 23 forks source link

`add_smirnoff` method #436

Closed Yoshanuikabundi closed 2 years ago

Yoshanuikabundi commented 2 years ago

This is part of my crusade against API consumption of Topology :stuck_out_tongue:

Would it be crazy to refactor the parts of from_smirnoff that apply the force field out into an parametrize_smirnoff method that everyone can use? I'm imagining it would open up possibilities of adding more from methods that include topology information but not a forcefield, and then applying the force field in a separate step. For example:

interchange = Interchange.from_pdb( # Could be implemented through OpenMM's `PDBFile` and `Topology.from_openmm()`
    "cubic_nanometer_of_200mM_saline.pdb", 
    unique_molecules=[Molecule.from_smiles(s) for s in [
        "O",
        "[Na+]",
        "[Cl-]",
    ]],
)
assert all(interchange.positions, interchange.box, interchange.molecules) # All the info from the PDB in one step!
interchange.parametrize_smirnoff(ForceField("openff-2.0.0.offxml")) # Bonus points for `parametrize_foyer` as well

Alongside

interchange = Interchange.from_topology(some_offtk_top)
interchange.parametrize_smirnoff(ForceField("openff-2.0.0.offxml"))

Alongside

interchange = Interchange.from_molecules([...])
assert interchange.positions # Taken from conformer[0]
interchange.box = ...
interchange.parametrize_smirnoff(...)

And

interchange = Interchange()
interchange.topology = ...
interchange.box = ...
interchange.positions = ...
interchange.parameterize_smirnoff(...)

And

interchange = Interchange.from_mdtraj(...)
interchange.parameterize_smirnoff(...)

This would mean users could prepare their system however they like, bring the system into Interchange with the appropriate method, and then parameterize it.

An alternative is to keep things how they are now, where these sorts of construction methods are implemented on Topology and from_smirnoff plays the role of parameterizing. This means users have to keep Topology in their mental model.

An alternative method name would be add_smirnoff, add_foyer etc.

As a side note, does from_smirnoff store the entire force field in the Interchange or just the potential handlers that are used in the topology? If it stores the entire force field, you could also provide similar capabilities in the form of add_pdb, add_mdtraj, add_molecule etc and make the topology argument of from_smirnoff optional.

Yoshanuikabundi commented 2 years ago

This refactor would also make it really clear what arguments belong on the parameterize methods. At the moment, box is an argument to from_smirnoff, but positions isn't, despite both pieces of information potentially being present in the Topology. By breaking the method up, users can easily understand that a from_topology method takes all the information from the topology, and if they want to override something they need to use attribute assignment. No-one feels awkward about sometimes being able to do everything in one method call and sometimes not. There's a clear separation of concerns between topology information and force field information.

mattwthompson commented 2 years ago

Thanks for the thorough write-up and the creative idea(s)! I think this or something like this could be added - it's not crazy - but there are a few rough edges and complexity pits that we'll need to work around (most of which should be addressed whether or not a proposal like this goes through, in all fairness).

I'll have to stew on this a few days before I give this the feedback it deserves - feel free to drop a reminder ping if I haven't gotten to this by Tuesday (Australia time).

Yoshanuikabundi commented 2 years ago

"It's Tuesday Australia time" reminder ping :)

mattwthompson commented 2 years ago

I definitely see that this is where the rubber hits the road in your so-called crusade! I can't quite settle on a yes or no answer, and I don't even have that many questions for you to answer, but hopefully this wall of text can serve as a forward iteration. The examples you used were super instructive and I originally intended to respond to them one-by-one, but I think I covered everything elsewhere. I tried to be concise, but .... uhh, didn't pull that off.

First, some things I like. The verbage of .from_smirnoff makes total sense to me, but I can see how it obfuscates how it actually does several things, only one of which is strictly "SMIRNOFF" (the parameterization). foo = Interchange; foo.topology = my_topology; foo.box = my_box_vectors; foo.positions = my_positions faithfully processes most of the arguments of from_smirnoff but doesn't require any SMIRNOFF data.

Second, some parts of this design are features we intend to add as some point, just maybe with a different look. Topology.from_pdb is something I've been pushing for (and I recall @j-wags also sees it as important) but, you know, we need to get Molecule.from_pdb finished up first. Interchange.from_mdtraj or something like that I'd like to see happen, probably made more clear once the GROMACS and Amber importers are better. Also, from_smirnoff should probably take in a positions argument, optionally, similar to how it can take box vectors information there.

The danger of this design is that it opens the door to multi-parameterization and/or post-parameterization modification of the topology.

I think the best way to tie these concerns together is the example of combining a small molecule force field with a protein force field. Right now, as you're aware, that takes a bunch of steps and I can't easily see a way for any to be squished together:

protein = Molecule.from_pdb(...)
protein_force_field = ForceField(...)
ligand = Molecule.from_sdf(...)
small_molecule_force_field = ForceField(...)
# not needing to do `.to_topology()` is a nice improvement!
protein_interchange = Interchange.from_smirnoff(protein_force_field, [protein])
ligand_interchange = Interchange.from_smirnoff(small_molecule_force_field, [ligand])
complex = protein_interchange + ligand_interchange
# maybe repeat for water, ions, other components

None of these steps can really be skipped in the current design. This proposal would, I think, result in something like

protein_force_field = ForceField(...)
small_molecule_force_field = ForceField(...)
complex = Interchange()
complex.add_pdb("protein.pdb")
complex.add_sdf("ligand.sdf")
# Let's assume that positions and box vectors are automagically set by now
complex.parametrize_smirnoff(protein_force_field)
complex.parametrize_smirnoff(small_molecule_force_field)

(For the sake of argument, let's ignore the future wonder that is Rosemary and a line of self-consistent biopolymer + small molecule force fields - users will always want to i.e. combine Sage and ff14SB.) How would this work? I suppose parametrize_smirnoff would need to know which parts of the "topology" it's allowed to access? We don't want AM1-BCC to even start running on a protein, for example. This sort of thing would also happen with i.e. using MyFavoriteWaterModel instead of the TIP3P that ships with Sage.

This would mean users could prepare their system however they like, bring the system into Interchange with the appropriate method, and then parameterize it.

An alternative is to keep things how they are now, where these sorts of construction methods are implemented on Topology and from_smirnoff plays the role of parameterizing. This means users have to keep Topology in their mental model.

I think Topology serves a practical role for us here in that it provides a sort of buffer between toolkit stuff and what happens here. Molecule parsing, manipulation, etc. can happen to the user's content, but things need to be tied up into a single object before Interchange can use it. And from the other perspective, Interchange can trust that toolkit shenanigans will always have a clean interface coming out of it. I certainly see the value add of removing this Topology step from workflows altogether when considering the user's perspective ... all this being said, maybe we should just say "the toolkit does molecule stuff," rip out openff/toolkit/topology/topology.py, and revive it here under a different name(s) ... 🤔

As a side note, ...

It doesn't store any of the force field per se, it just stores the bits as applied to the topology it's given. In less glib terms - no, just a portion of it (unless you somehow manufactured a pile of molecules that happen to exercise every parameter in a SMIRNOFF force field - it would take a mad scientist to pull that off for our force field lines, I think).

Yoshanuikabundi commented 2 years ago

I think I'm coming around to Topology, but here are my thoughts:

Multi-parameterization - I agree, this is something we would have to guard against. I think a policy of "apply_smirnoff() raises an exception if any potential handlers are already defined" would be appropriate here - it's consistent with the OpenFF policy of protecting provenance. Interchange just shouldn't let you apply a force field incorrectly (unless you really work for it). If people do want to re-parameterize, we could add a clear_parameters method that removes all potential handlers?

Post-parameterization modification - We could just hash the topology when parameters are added and then refuse to output anything if that hash doesn't match. Or better yet, make the topology immutable once parameters have been added, if Pydantic makes that possible (maybe a validator that compares a hash?). Having the topology be accessible and mutable as part of the Interchange is dangerous in other ways as well - if people change its box vectors, nothing happens to the interchange. If people change conformer[0], nothing happens. If people swap out the topology for a completely different one with at least as many atoms, they will be VERY confused. We very much encourage people to work on the topology directly by not, for example, providing a to_openmm_topology method, so I think we should consider tightening this down in any case.

For alchemic mutation... With my original proposal it would probably have to look like this:

interchange = Interchange.from_topology(topology_a)
interchange.add_alchemic_topology(topology_b) # Would need this for every from method
interchange.apply_smirnoff(...)
for i in range(21):
    interchange.lambda = i * 0.05
    interchange.to_gromacs(f"intrcg_l{interchange.lambda}.top")

vs, for the current approach, something that looks like:

interchange = Interchange.from_smirnoff(
    force_field = ...,
    topology = topology_a,
    topology_b = topology_b,
    ...
)
for i in range(21):
    interchange.lambda = i * 0.05
    interchange.to_gromacs(f"intrcg_l{interchange.lambda}.top")

But an alternative would be to have no from methods, just add methods, that take an argument:

interchange = Interchange()
interchange.add_topology(topology_a)
interchange.add_topology(topology_b, alchemic_b_state=True)
interchange.apply_smirnoff(...)
for i in range(21):
    interchange.lambda = i * 0.05
    interchange.to_gromacs(f"intrcg_l{interchange.lambda}.top")

All the add methods could process the input in the same way, and then at the last second apply the new Topology to the appropriate attribute depending on the argument, so adding free energy support would be a small enough change.

I think the best way to tie these concerns together is the example of combining a small molecule force field with a protein force field.

I think I prefer the current way of doing this honestly. It forces the user to both (a) declare which parts of the system gets each force field and (b) cleanly separate the system into parts that are each parameterized with a single force field. In principle, Interchange could even raise an error if the two force fields used incompatible non bonded models (or whatever). So my proposal would look like this:

protein = Interchange()
protein.add_pdb(...)
protein.apply_smirnoff(protein_ff)

ligand = Interchange()
ligand.add_sdf(...)
ligand.apply_smirnoff(ligand_ff)

complex = protein + ligand
# maybe repeat for water, ions, other components

I think Topology serves a practical role for us here in that it provides a sort of buffer between toolkit stuff and what happens here. Molecule parsing, manipulation, etc. can happen to the user's content, but things need to be tied up into a single object before Interchange can use it. And from the other perspective, Interchange can trust that toolkit shenanigans will always have a clean interface coming out of it. I certainly see the value add of removing this Topology step from workflows altogether when considering the user's perspective ... all this being said, maybe we should just say "the toolkit does molecule stuff," rip out openff/toolkit/topology/topology.py, and revive it here under a different name(s)

Hmmm

I think that's a really good point. In strictly typed languages there's this idea of a "typestate", which is where the state of a program is encoded in its type. This is good because you can control what methods can be called in each state. In Python its common for a single type to encode multiple states, but Topology -> Interchange is (could be) a really clean typestate: an Interchange is a Topology with parameters. Topology has all the system construction methods, and Interchange has all the output methods, and the only way to get an Interchange is to consume both topology and parameter information. Once you have an Interchange, you can't change the Topology any more, but you gain the ability to export it in a million different formats.

I think to get the most out of this pattern though, we should kinda go all in on it. Positions will usually come in at the Topology stage, so Topology should get either a positions array or methods that let you access all of its conformer[0]s through a single array (Or both - constructing a Topology could create a big array and set all the molecule's conformer[0] to the appropriate view into it). Interchange shouldn't have its own box and positions attributes but should defer to the stored topology, if only to avoid user confusion. And we should absolutely do something to discourage/protect against mutation of the topology once the Interchange is created.

I do stand by the idea that folding Topology into Interchange would make it easier to teach and simpler for users to think about. Typestates are rare in Python, and it's reasonably common to build up a single object in stages. But having a clean separation of concerns between Topology and Interchange would also make this easier.

I guess the current situation is this:

Which I'm trying to teach as though it were this (where will the arrow for multi-component PDBs go?):

And I would like either this (fewer types, Interchange has many responsibilities):

Or, now that you mention it, this (Topology does system construction, Interchange just does parameterization, typestate pattern, extra Topology type):

mattwthompson commented 2 years ago

(Please send me a reminder ping if I haven't responded again by the end of your work week - Friday-ish Australia time, I think? - and I'll have something for you when you start your next work week!)

SimonBoothroyd commented 2 years ago

+1 for @mattwthompson's comments on this.

I'm not sure I follow the arguments for changing to add_topology and apply_smirnoff, and to me they look quite confusing (especially, what if you add topology then apply smirnoff then add a new topology...) and unnecessarily verbose. I find from_smirnoff a lot cleaner and clearer, and this is a very common design pattern.

In general the way I think about Interchange aligns with your last figure above, and really that

To me an Interchange object that is missing either a Topology or a ForceField is in an invalid state and doesn't make much sense.

I may be missing something here though?

I'd also strongly advocate for not having Interchange directly expose alchemical modification methods. The level of complexity this introduces is large, especially given that one of the goals of interchange is exporting to multiple packages which each handle such calculations very differently, and how the alchemical modification should be made will be quite method / code dependant.

Yoshanuikabundi commented 2 years ago

@SimonBoothroyd The opportunity is for Interchange to replace Topology, so that we have fewer public types. I think the definition of "topology" is a bit murky - in the Toolkit it's molecules and positions and box vectors, in GROMACS (and Amber?) it's the entire parameterized system but not positions or box vectors, in OpenMM it's atoms and bonds and box vectors, in MDTraj it's atoms and bonds... it's just one of those words that means a different thing to everyone who reads it, and its potential meanings seem to encompass any conceivable subset of values needed for an MD simulation. So with my documentation/teaching hat on, I always avoid using it because there's always a better, more specific word or phrase, except when I'm forced to by an API.

The model I'm proposing is to expand Interchange to encompass the roles of Topology to simplify the API for users. At the moment, with the current released API, from a user's perspective, Topology is redundant. There are no construction methods on Topology that don't also exist on Molecule, and Interchange.from_smirnoff can take a list of Molecule objects instead of a Topology. That changes with the biopolymers release, because Topology gets from_pdb, so we kinda have to have this conversation now. If you look at older examples, they create a Topology with from_molecules and just turn it straight into an Interchange. New API methods that construct systems of multiple molecules could live on Interchange instead of Topology and users can ignore the existence of Topology. The workflow then has 3 types instead of 4 (Molecule, ForceField, Interchange), which is actually a pretty huge win for simplicity on its own, even when you ignore the massive win of not using the word "topology" for anything.

I think this lines up with the original design of the Toolkit, just with the rename of Topology to Interchange. They both bring all the information together and then output it to another format (originally an OpenMM system) for simulation.

I think this is a big win for teachability and simplicity, but I accept that it has a cost in terms of encapsulation and scope for Interchange. Whichever way we go, I think we should do it conscientiously and make the resulting API as good as we can - which I think if we decide to keep both types means changes to Topology to make setting positions easier and make setting velocities possible, and changes to Interchange so box vectors and positions aren't duplicated in the topology within the interchange, so that we actually get the mental model implied by the last diagram.

I also think, no matter what we do, we should try to freeze the topology once its been parametrized. It will be so easy for users to change a topology thinking the parameters will update automatically. So add topology + apply smirnoff + add topology = exception(Parameters already applied). Having an all-encompassing type that you build up by calling several methods is also a common design pattern (see MatPlotLib, not that we should be taking API lessons from them...). I don't agree that it's verbose; it just moves the topology construction lines from Topology to Interchange. If anything, I would expect it to be less verbose; it saves the line that passes the topology to the interchange.

@mattwthompson Ping!

SimonBoothroyd commented 2 years ago

I think I follow what you're saying, but I I'm not sure that I agree and as a user it would get a -1 vote from me. Ultimately trying to shoehorn everything into a single object for the sake of one less type just leads to more complexity in the long run.

I agree that the choice of Topology as a name is a bit confusing given the multi-use problem, but also its not clear what a better name would be and so a long a we are self consistent and precise in our meaning and usage then that's probably fine.

At the moment, with the current released API, from a user's perspective, Topology is redundant

From a user's perspective, perhaps not so 😉 I use it to store how many of each molecule do I have before I start thinking about applying parameters. Whether that be multiple different force fields, force fields with slightly perturbed values, or as a way to generate inputs for tools and parameterisation engines such as LigParGen or antechamber that Interchange does not yet or may never support.

I'd recommend trying to think of the design of these classes in terms of the single-responsibility principle.

There is a very clear division of which object is responsible for what, and what the information content of each object is. Interchange for example never needs to think about how a Topology is constructed and is abstracted from this, leaving an object whose whole purpose is devoted to this to handle it.

By merging Topology and Interchange, you start to open the door to Interchange being a does all stores all kind of object and the complexity balloons.

I think this is a big win for teachability and simplicity

I think this is where we probably disagree. The conceptual workflow of

1) define what's in my box e.g.

```
topology = Topology.from_pdb("my-system.pdb")` 
# or 
topology = Topology.from_molecules([water] * 1000 + [ligand] + [tyk2])
```

2) apply the force field parameters

```
interchange_ff_x = Interchange.from_smirnoff(topology, ff_x)
interchange_ff_y = Interchange.from_smirnoff(topology, ff_y)
````

3) export

```
interchange_x.to_gromacs(...)
```

seems simple and logical enough.

Having an all-encompassing type that you build up by calling several methods is also a common design pattern

Context is important though. If you view an Interchange as a force field applied to a topology, there's no reason to build it piecewise when you have all of the constituent components ahead of time, especially when all of those components are required to put the object into a valid state. By making the user provide the pieces up front, you are implicitly saying the construction of the object is complete. This is in contrast to something like a graph where it's valid to have a set of empty axes in isolation, then piecewise populate the graph with extra series.

Topology to make setting positions easier and make setting velocities possible

I don't think the topology currently stores positions? Confusingly, it allows it's nested molecules to store conformers, but I don't think this is the same as positions. For example if a molecule stores two conformers, does that mean two copies of a molecule should be in the final system?

Yoshanuikabundi commented 2 years ago

I completely agree that what I'm asking for is an increase in implementation complexity, and that's an extremely important consideration, and it would be beyond reasonable to reject this PR on those grounds. I also agree that the distinction between Topology and Interchange makes complete sense, I'll even call it elegant. I'm just going to make sure this idea gets a fair hearing :smile:

I think this is where we probably disagree.

I guess what I keep coming back to is comparing the third and fourth diagrams I prepared earlier. The third seems simpler to me, and yet it conveys more information.

This is in contrast to something like a graph where it's valid to have a set of empty axes in isolation, then piecewise populate the graph with extra series.

Sure but it's no less valid to have an Interchange with no parameters. It's about as useful as a set of empty axes. You could simulate neutrinos with it pretty accurately. Pretty sure you can even write it to a .gro or .pdb file, and I think even Amber's coordinate format. And you can visualize it with nglview, which you can't do with a topology (yet?).

I don't think the topology currently stores positions?

There's an existing convention, which is explicitly used by Interchange.from_smirnoff and is used heavily in examples, that conformers[0] is the molecule's position, and multiple copies of a molecule in a topology are represented by separate molecule objects. Other conformers are ignored. It's not space efficient and I would have no problem with switching to the opposite convention but there it is.

Yoshanuikabundi commented 2 years ago

One other thing to consider - there are methods that are important for consumers of Interchange that are implemented on Topology. For example, to simulate with OpenMM you need a system (interchange.to_openmm()) and a topology (interchange.topology.to_openmm()), the easiest way to iterate over atoms is in the topology, checking atom connectivity and matching smirks codes are both easier with topology, etc. But there's other stuff that has to be done on the interchange and changing the topology has no effect, and worse reading from the topology may be wrong - box vectors, positions, bonds, constraints, etc. The user has to know when they should look at Interchange and when to look at Topology for a method, and they also have to know which methods are safe to call on a topology when it's in an interchange. These seem like pretty massive footguns. If we made topology less user facing this would be simpler - we'd just have to implement all these methods on interchange and users would know not to bother with the topology directly.

As a side note, I'm not proposing removing or deprecating the current way of doing things for people who prefer it. Many of the methods on interchange would just hand off to the equivalent topology method, which would preserve the separation of concerns for developers.

SimonBoothroyd commented 2 years ago

Sure but it's no less valid to have an Interchange with no parameters

Ah sorry, to be clearer, it is if you view the Interchange as being a representation of a force field applied to a topology (which I'd advocate we should be thinking of it as), rather than as a foot in both pre- and post- parameterisation worlds kind of object.

there are methods that are important for consumers of Interchange that are implemented on Topology. For example, to simulate with OpenMM you need a system (interchange.to_openmm()) and a topology (interchange.topology.to_openmm())

This then falls into the fun world of API crafting! An Interchange object does not necessarily need to (and probably shouldn't) expose it's inner topology directly, and can instead selectively expose what it is safe to like you mention last by having stub methods that simply call to the inner topology 🙂 This would be in addition to exposing things like virtual sites which shouldn't (and after 0.11 won't) exist on a topology without a force field having been applied.

For the to_openmm method you mention for example, interchange.to_openmm() should likely return a tuple as a cleaner way to get both back e.g. openmm_topology, openmm_system = interchange.to_openmm(). Internally it could for example call the inner topologies to_openmm method to get the base topology, add any v-sites, and then return this along with the system.

j-wags commented 2 years ago

Thanks for bringing this up today in our check in @Yoshanuikabundi, and sorry I haven't chimed in here. To summarize some of the points we discussed:

We can keep thinking about exactly how to handle the footguns around Interchange.topology. Like, a naive user running Interchange.topology.add_molecule would be a big problem. Maybe we should cast all of the molecules in a Topology into FrozenMolecules when we make an Interchange, to seal off add_atom and add_bond. But users SHOULD be free to access Interchange.topology.atoms and get a rich view of the underlying chemical graphs. So something like "a topology where you can change the positions but not the number of molecules or the connectivity" would be good. We have a few options for how to implement this, but I think a monkey-patched "view" class could work well:

class TopologyView(Topology):
    def __init(self, other: Topology):
        # Have self.positions be a REFERENCE to the real positions
        self.positions = other.positions
        ...

    def add_molecule(*args, **kwargs):
        return NotImplementedError
    def add_constraint():
        return NotImplementedError

class Interchange:
    @property
    def topology():
        return TopologyView(self._topology)
mattwthompson commented 2 years ago

Sincere thanks for the proposal and discussion @Yoshanuikabundi but I don't think this is the direction we're going to be taking.

Some of these ideas will be/have been implemented, such as Interchange.to_openmm_topology which exists now and could be renamed/rejiggered in some way.