Closed Yoshanuikabundi closed 2 years ago
This refactor would also make it really clear what arguments belong on the parameterize methods. At the moment, box
is an argument to from_smirnoff
, but positions
isn't, despite both pieces of information potentially being present in the Topology.
By breaking the method up, users can easily understand that a from_topology
method takes all the information from the topology, and if they want to override something they need to use attribute assignment. No-one feels awkward about sometimes being able to do everything in one method call and sometimes not. There's a clear separation of concerns between topology information and force field information.
Thanks for the thorough write-up and the creative idea(s)! I think this or something like this could be added - it's not crazy - but there are a few rough edges and complexity pits that we'll need to work around (most of which should be addressed whether or not a proposal like this goes through, in all fairness).
I'll have to stew on this a few days before I give this the feedback it deserves - feel free to drop a reminder ping if I haven't gotten to this by Tuesday (Australia time).
"It's Tuesday Australia time" reminder ping :)
I definitely see that this is where the rubber hits the road in your so-called crusade! I can't quite settle on a yes or no answer, and I don't even have that many questions for you to answer, but hopefully this wall of text can serve as a forward iteration. The examples you used were super instructive and I originally intended to respond to them one-by-one, but I think I covered everything elsewhere. I tried to be concise, but .... uhh, didn't pull that off.
First, some things I like. The verbage of .from_smirnoff
makes total sense to me, but I can see how it obfuscates how it actually does several things, only one of which is strictly "SMIRNOFF" (the parameterization). foo = Interchange; foo.topology = my_topology; foo.box = my_box_vectors; foo.positions = my_positions
faithfully processes most of the arguments of from_smirnoff
but doesn't require any SMIRNOFF data.
Second, some parts of this design are features we intend to add as some point, just maybe with a different look. Topology.from_pdb
is something I've been pushing for (and I recall @j-wags also sees it as important) but, you know, we need to get Molecule.from_pdb
finished up first. Interchange.from_mdtraj
or something like that I'd like to see happen, probably made more clear once the GROMACS and Amber importers are better. Also, from_smirnoff
should probably take in a positions
argument, optionally, similar to how it can take box vectors information there.
The danger of this design is that it opens the door to multi-parameterization and/or post-parameterization modification of the topology.
Multi-parameterization: Now, there's an assumption that parameterization will only happen once. I guess this is implicit in the process of Topology
-> Interchange
; re-parameterizing an Interchange
that already exists doesn't really make sense in the current design, and that's even without explicitly tracking a "has parameterization happened to me?" sort of flag. Something like, with the proposed API:
foo = Interchange()
foo.topology = topology
...
foo.apply_smirnoff(smirnoff99_frosst)
foo.apply_smirnoff(rosemary)
would probably happen. What's the behavior here - simply re-parameterize everything? Probably not dangerous, but a little weird, and not something a user would stumble into doing with the current one-shot sort of design.
Post-parameterization modification:
topology = Molecule.from_smiles("C=C").to_topology()
foo = Interchange.from_smirnoff(force_field, topology)
foo.topology.molecules[0].atoms[0].add_atom(1, 0, False)
foo.topology.molecules[0].atoms[1].add_atom(1, 0, False)
now you can see my parameters as applied to ethene are stored on my ethane topology ... and probably other bits are invalid since the hydrogens don't have any physics parameters. Now, an astute observer would notice this is possible with the current API, and they'd be right. But I think it's much less likely a user would do this by accident if we nudge them into completely preparing their topology before anything hits Interchange world. Certainly we could work in the logic to update state for many or all of the cases we can think of, but it's the sort of thing that just explodes in complexity in a way that makes me weary of even taking a small step in that direction. One endgame is supporting free energy methods (i.e. alchemical modifications and dual topologies) but exactly how to do that safely without just re-parameterizing everything is tricky (and maybe not worth it, I don't know). Again, the current design is probably going to have an uphill battle to support it; if we can demonstrate (via rhetoric or prototyping) that this proposal makes such features easier while accounting for added complexity, I'd be more in favor of it.
I think the best way to tie these concerns together is the example of combining a small molecule force field with a protein force field. Right now, as you're aware, that takes a bunch of steps and I can't easily see a way for any to be squished together:
protein = Molecule.from_pdb(...)
protein_force_field = ForceField(...)
ligand = Molecule.from_sdf(...)
small_molecule_force_field = ForceField(...)
# not needing to do `.to_topology()` is a nice improvement!
protein_interchange = Interchange.from_smirnoff(protein_force_field, [protein])
ligand_interchange = Interchange.from_smirnoff(small_molecule_force_field, [ligand])
complex = protein_interchange + ligand_interchange
# maybe repeat for water, ions, other components
None of these steps can really be skipped in the current design. This proposal would, I think, result in something like
protein_force_field = ForceField(...)
small_molecule_force_field = ForceField(...)
complex = Interchange()
complex.add_pdb("protein.pdb")
complex.add_sdf("ligand.sdf")
# Let's assume that positions and box vectors are automagically set by now
complex.parametrize_smirnoff(protein_force_field)
complex.parametrize_smirnoff(small_molecule_force_field)
(For the sake of argument, let's ignore the future wonder that is Rosemary and a line of self-consistent biopolymer + small molecule force fields - users will always want to i.e. combine Sage and ff14SB.) How would this work? I suppose parametrize_smirnoff
would need to know which parts of the "topology" it's allowed to access? We don't want AM1-BCC to even start running on a protein, for example. This sort of thing would also happen with i.e. using MyFavoriteWaterModel instead of the TIP3P that ships with Sage.
This would mean users could prepare their system however they like, bring the system into Interchange with the appropriate method, and then parameterize it.
An alternative is to keep things how they are now, where these sorts of construction methods are implemented on Topology and from_smirnoff plays the role of parameterizing. This means users have to keep Topology in their mental model.
I think Topology
serves a practical role for us here in that it provides a sort of buffer between toolkit stuff and what happens here. Molecule parsing, manipulation, etc. can happen to the user's content, but things need to be tied up into a single object before Interchange can use it. And from the other perspective, Interchange can trust that toolkit shenanigans will always have a clean interface coming out of it. I certainly see the value add of removing this Topology
step from workflows altogether when considering the user's perspective ... all this being said, maybe we should just say "the toolkit does molecule stuff," rip out openff/toolkit/topology/topology.py
, and revive it here under a different name(s) ... 🤔
As a side note, ...
It doesn't store any of the force field per se, it just stores the bits as applied to the topology it's given. In less glib terms - no, just a portion of it (unless you somehow manufactured a pile of molecules that happen to exercise every parameter in a SMIRNOFF force field - it would take a mad scientist to pull that off for our force field lines, I think).
I think I'm coming around to Topology, but here are my thoughts:
Multi-parameterization - I agree, this is something we would have to guard against. I think a policy of "apply_smirnoff()
raises an exception if any potential handlers are already defined" would be appropriate here - it's consistent with the OpenFF policy of protecting provenance. Interchange just shouldn't let you apply a force field incorrectly (unless you really work for it). If people do want to re-parameterize, we could add a clear_parameters
method that removes all potential handlers?
Post-parameterization modification - We could just hash the topology when parameters are added and then refuse to output anything if that hash doesn't match. Or better yet, make the topology immutable once parameters have been added, if Pydantic makes that possible (maybe a validator that compares a hash?). Having the topology be accessible and mutable as part of the Interchange is dangerous in other ways as well - if people change its box vectors, nothing happens to the interchange. If people change conformer[0], nothing happens. If people swap out the topology for a completely different one with at least as many atoms, they will be VERY confused. We very much encourage people to work on the topology directly by not, for example, providing a to_openmm_topology
method, so I think we should consider tightening this down in any case.
For alchemic mutation... With my original proposal it would probably have to look like this:
interchange = Interchange.from_topology(topology_a)
interchange.add_alchemic_topology(topology_b) # Would need this for every from method
interchange.apply_smirnoff(...)
for i in range(21):
interchange.lambda = i * 0.05
interchange.to_gromacs(f"intrcg_l{interchange.lambda}.top")
vs, for the current approach, something that looks like:
interchange = Interchange.from_smirnoff(
force_field = ...,
topology = topology_a,
topology_b = topology_b,
...
)
for i in range(21):
interchange.lambda = i * 0.05
interchange.to_gromacs(f"intrcg_l{interchange.lambda}.top")
But an alternative would be to have no from
methods, just add
methods, that take an argument:
interchange = Interchange()
interchange.add_topology(topology_a)
interchange.add_topology(topology_b, alchemic_b_state=True)
interchange.apply_smirnoff(...)
for i in range(21):
interchange.lambda = i * 0.05
interchange.to_gromacs(f"intrcg_l{interchange.lambda}.top")
All the add
methods could process the input in the same way, and then at the last second apply the new Topology
to the appropriate attribute depending on the argument, so adding free energy support would be a small enough change.
I think the best way to tie these concerns together is the example of combining a small molecule force field with a protein force field.
I think I prefer the current way of doing this honestly. It forces the user to both (a) declare which parts of the system gets each force field and (b) cleanly separate the system into parts that are each parameterized with a single force field. In principle, Interchange could even raise an error if the two force fields used incompatible non bonded models (or whatever). So my proposal would look like this:
protein = Interchange()
protein.add_pdb(...)
protein.apply_smirnoff(protein_ff)
ligand = Interchange()
ligand.add_sdf(...)
ligand.apply_smirnoff(ligand_ff)
complex = protein + ligand
# maybe repeat for water, ions, other components
I think Topology serves a practical role for us here in that it provides a sort of buffer between toolkit stuff and what happens here. Molecule parsing, manipulation, etc. can happen to the user's content, but things need to be tied up into a single object before Interchange can use it. And from the other perspective, Interchange can trust that toolkit shenanigans will always have a clean interface coming out of it. I certainly see the value add of removing this Topology step from workflows altogether when considering the user's perspective ... all this being said, maybe we should just say "the toolkit does molecule stuff," rip out openff/toolkit/topology/topology.py, and revive it here under a different name(s)
Hmmm
I think that's a really good point. In strictly typed languages there's this idea of a "typestate", which is where the state of a program is encoded in its type. This is good because you can control what methods can be called in each state. In Python its common for a single type to encode multiple states, but Topology -> Interchange is (could be) a really clean typestate: an Interchange is a Topology with parameters. Topology has all the system construction methods, and Interchange has all the output methods, and the only way to get an Interchange is to consume both topology and parameter information. Once you have an Interchange, you can't change the Topology any more, but you gain the ability to export it in a million different formats.
I think to get the most out of this pattern though, we should kinda go all in on it. Positions will usually come in at the Topology stage, so Topology should get either a positions array or methods that let you access all of its conformer[0]s through a single array (Or both - constructing a Topology could create a big array and set all the molecule's conformer[0] to the appropriate view into it). Interchange shouldn't have its own box and positions attributes but should defer to the stored topology, if only to avoid user confusion. And we should absolutely do something to discourage/protect against mutation of the topology once the Interchange is created.
I do stand by the idea that folding Topology into Interchange would make it easier to teach and simpler for users to think about. Typestates are rare in Python, and it's reasonably common to build up a single object in stages. But having a clean separation of concerns between Topology and Interchange would also make this easier.
I guess the current situation is this:
Which I'm trying to teach as though it were this (where will the arrow for multi-component PDBs go?):
And I would like either this (fewer types, Interchange has many responsibilities):
Or, now that you mention it, this (Topology does system construction, Interchange just does parameterization, typestate pattern, extra Topology type):
(Please send me a reminder ping if I haven't responded again by the end of your work week - Friday-ish Australia time, I think? - and I'll have something for you when you start your next work week!)
+1 for @mattwthompson's comments on this.
I'm not sure I follow the arguments for changing to add_topology
and apply_smirnoff
, and to me they look quite confusing (especially, what if you add topology then apply smirnoff then add a new topology...) and unnecessarily verbose. I find from_smirnoff
a lot cleaner and clearer, and this is a very common design pattern.
In general the way I think about Interchange
aligns with your last figure above, and really that
Molecule
object stores the graph representation of a molecule.Topology
object is a collection of molecules - namely, how many of copies of each molecule do I have in my simulation box.Interchange
object is a Topology
that has been parameterised with a given ForceField
. It for the most part should look like a Topology
only now tagged with assigned parameter information. To me an Interchange
object that is missing either a Topology
or a ForceField
is in an invalid state and doesn't make much sense.
I may be missing something here though?
I'd also strongly advocate for not having Interchange
directly expose alchemical modification methods. The level of complexity this introduces is large, especially given that one of the goals of interchange is exporting to multiple packages which each handle such calculations very differently, and how the alchemical modification should be made will be quite method / code dependant.
@SimonBoothroyd The opportunity is for Interchange
to replace Topology
, so that we have fewer public types. I think the definition of "topology" is a bit murky - in the Toolkit it's molecules and positions and box vectors, in GROMACS (and Amber?) it's the entire parameterized system but not positions or box vectors, in OpenMM it's atoms and bonds and box vectors, in MDTraj it's atoms and bonds... it's just one of those words that means a different thing to everyone who reads it, and its potential meanings seem to encompass any conceivable subset of values needed for an MD simulation. So with my documentation/teaching hat on, I always avoid using it because there's always a better, more specific word or phrase, except when I'm forced to by an API.
The model I'm proposing is to expand Interchange
to encompass the roles of Topology
to simplify the API for users. At the moment, with the current released API, from a user's perspective, Topology
is redundant. There are no construction methods on Topology
that don't also exist on Molecule
, and Interchange.from_smirnoff
can take a list of Molecule
objects instead of a Topology
. That changes with the biopolymers release, because Topology gets from_pdb
, so we kinda have to have this conversation now. If you look at older examples, they create a Topology with from_molecules
and just turn it straight into an Interchange. New API methods that construct systems of multiple molecules could live on Interchange
instead of Topology
and users can ignore the existence of Topology
. The workflow then has 3 types instead of 4 (Molecule
, ForceField
, Interchange
), which is actually a pretty huge win for simplicity on its own, even when you ignore the massive win of not using the word "topology" for anything.
I think this lines up with the original design of the Toolkit, just with the rename of Topology
to Interchange
. They both bring all the information together and then output it to another format (originally an OpenMM system) for simulation.
I think this is a big win for teachability and simplicity, but I accept that it has a cost in terms of encapsulation and scope for Interchange. Whichever way we go, I think we should do it conscientiously and make the resulting API as good as we can - which I think if we decide to keep both types means changes to Topology
to make setting positions easier and make setting velocities possible, and changes to Interchange
so box vectors and positions aren't duplicated in the topology within the interchange, so that we actually get the mental model implied by the last diagram.
I also think, no matter what we do, we should try to freeze the topology once its been parametrized. It will be so easy for users to change a topology thinking the parameters will update automatically. So add topology + apply smirnoff + add topology = exception(Parameters already applied). Having an all-encompassing type that you build up by calling several methods is also a common design pattern (see MatPlotLib, not that we should be taking API lessons from them...). I don't agree that it's verbose; it just moves the topology construction lines from Topology
to Interchange
. If anything, I would expect it to be less verbose; it saves the line that passes the topology to the interchange.
@mattwthompson Ping!
I think I follow what you're saying, but I I'm not sure that I agree and as a user it would get a -1 vote from me. Ultimately trying to shoehorn everything into a single object for the sake of one less type just leads to more complexity in the long run.
I agree that the choice of Topology
as a name is a bit confusing given the multi-use problem, but also its not clear what a better name would be and so a long a we are self consistent and precise in our meaning and usage then that's probably fine.
At the moment, with the current released API, from a user's perspective, Topology is redundant
From a user's perspective, perhaps not so 😉 I use it to store how many of each molecule do I have before I start thinking about applying parameters. Whether that be multiple different force fields, force fields with slightly perturbed values, or as a way to generate inputs for tools and parameterisation engines such as LigParGen or antechamber that Interchange does not yet or may never support.
I'd recommend trying to think of the design of these classes in terms of the single-responsibility principle.
Molecule
: the molecule object encapsulates the graph representation of a single molecule. It offers API points to parse that graph representation from multiple sources such as SMILES, MOL2, etc, as well as to export that representation to other formats. This parsing should in principle be limited to loading a single molecule, optionally in multiple conformers.
Topology
: an object that defines a collection of molecules each in a single conformer. It offers API points to load and store a collection of molecules from a multi-molecule file format such as PDB, a multi-molecule SMILES, or from a user constructed list, as well as to export that representation to other formats. On a Topology
, the from_xxx
method has different meaning than Molecule.from_xxx
as it is geared towards loading a collection of molecules, rather than a single one, with each molecule having a single conformer.
In short, the responsibility of Topology
is to store and manipulate a simulation box of molecules.
Interchange
: an object that stores which set of force field parameters have been applied to which elements in a topology, and offers API points to parse and export that information to different simulations.
There is a very clear division of which object is responsible for what, and what the information content of each object is. Interchange for example never needs to think about how a Topology
is constructed and is abstracted from this, leaving an object whose whole purpose is devoted to this to handle it.
By merging Topology
and Interchange
, you start to open the door to Interchange
being a does all stores all kind of object and the complexity balloons.
I think this is a big win for teachability and simplicity
I think this is where we probably disagree. The conceptual workflow of
1) define what's in my box e.g.
```
topology = Topology.from_pdb("my-system.pdb")`
# or
topology = Topology.from_molecules([water] * 1000 + [ligand] + [tyk2])
```
2) apply the force field parameters
```
interchange_ff_x = Interchange.from_smirnoff(topology, ff_x)
interchange_ff_y = Interchange.from_smirnoff(topology, ff_y)
````
3) export
```
interchange_x.to_gromacs(...)
```
seems simple and logical enough.
Having an all-encompassing type that you build up by calling several methods is also a common design pattern
Context is important though. If you view an Interchange
as a force field applied to a topology, there's no reason to build it piecewise when you have all of the constituent components ahead of time, especially when all of those components are required to put the object into a valid state. By making the user provide the pieces up front, you are implicitly saying the construction of the object is complete. This is in contrast to something like a graph where it's valid to have a set of empty axes in isolation, then piecewise populate the graph with extra series.
Topology to make setting positions easier and make setting velocities possible
I don't think the topology currently stores positions? Confusingly, it allows it's nested molecules to store conformers, but I don't think this is the same as positions. For example if a molecule stores two conformers, does that mean two copies of a molecule should be in the final system?
I completely agree that what I'm asking for is an increase in implementation complexity, and that's an extremely important consideration, and it would be beyond reasonable to reject this PR on those grounds. I also agree that the distinction between Topology and Interchange makes complete sense, I'll even call it elegant. I'm just going to make sure this idea gets a fair hearing :smile:
I think this is where we probably disagree.
I guess what I keep coming back to is comparing the third and fourth diagrams I prepared earlier. The third seems simpler to me, and yet it conveys more information.
This is in contrast to something like a graph where it's valid to have a set of empty axes in isolation, then piecewise populate the graph with extra series.
Sure but it's no less valid to have an Interchange with no parameters. It's about as useful as a set of empty axes. You could simulate neutrinos with it pretty accurately. Pretty sure you can even write it to a .gro or .pdb file, and I think even Amber's coordinate format. And you can visualize it with nglview, which you can't do with a topology (yet?).
I don't think the topology currently stores positions?
There's an existing convention, which is explicitly used by Interchange.from_smirnoff
and is used heavily in examples, that conformers[0]
is the molecule's position, and multiple copies of a molecule in a topology are represented by separate molecule objects. Other conformers are ignored. It's not space efficient and I would have no problem with switching to the opposite convention but there it is.
One other thing to consider - there are methods that are important for consumers of Interchange that are implemented on Topology. For example, to simulate with OpenMM you need a system (interchange.to_openmm()
) and a topology (interchange.topology.to_openmm()
), the easiest way to iterate over atoms is in the topology, checking atom connectivity and matching smirks codes are both easier with topology, etc. But there's other stuff that has to be done on the interchange and changing the topology has no effect, and worse reading from the topology may be wrong - box vectors, positions, bonds, constraints, etc. The user has to know when they should look at Interchange and when to look at Topology for a method, and they also have to know which methods are safe to call on a topology when it's in an interchange. These seem like pretty massive footguns. If we made topology less user facing this would be simpler - we'd just have to implement all these methods on interchange and users would know not to bother with the topology directly.
As a side note, I'm not proposing removing or deprecating the current way of doing things for people who prefer it. Many of the methods on interchange would just hand off to the equivalent topology method, which would preserve the separation of concerns for developers.
Sure but it's no less valid to have an Interchange with no parameters
Ah sorry, to be clearer, it is if you view the Interchange
as being a representation of a force field applied to a topology (which I'd advocate we should be thinking of it as), rather than as a foot in both pre- and post- parameterisation worlds kind of object.
there are methods that are important for consumers of Interchange that are implemented on Topology. For example, to simulate with OpenMM you need a system (interchange.to_openmm()) and a topology (interchange.topology.to_openmm())
This then falls into the fun world of API crafting! An Interchange
object does not necessarily need to (and probably shouldn't) expose it's inner topology directly, and can instead selectively expose what it is safe to like you mention last by having stub methods that simply call to the inner topology 🙂 This would be in addition to exposing things like virtual sites which shouldn't (and after 0.11 won't) exist on a topology without a force field having been applied.
For the to_openmm
method you mention for example, interchange.to_openmm()
should likely return a tuple as a cleaner way to get both back e.g. openmm_topology, openmm_system = interchange.to_openmm()
. Internally it could for example call the inner topologies to_openmm
method to get the base topology, add any v-sites, and then return this along with the system.
Thanks for bringing this up today in our check in @Yoshanuikabundi, and sorry I haven't chimed in here. To summarize some of the points we discussed:
Interchange.add_pdb
.. Is it bad chemistry, mislabeled residues, missing FF parameters? That would be really tough to debug)Topology.velocities
Topology.visualize
Topology.from_pdb
could be a good idea, but when I first tried to make it I got overwhelmed by the possible permutations of topologies containing mixes of polymers and small molecules, and the way to communicate partial loading successes to users. So maybe we save that for a later release, once we have the Molecule.from_pdb
behavior pinned down. Interchange.to_openmm_topology()
, and possibly other convenience methods too. (I realize now that this is a bad example because the above method will be the ONLY way to get an OpenMM Topology with vsites, so it's a necessity, rather than a convenience). But in general there are some "VIP" methods that we can offer at a higher level rather than making users root around in Interchange.topology.x_method
. Topology.add_particle
(yikes)We can keep thinking about exactly how to handle the footguns around Interchange.topology
. Like, a naive user running Interchange.topology.add_molecule
would be a big problem. Maybe we should cast all of the molecules in a Topology into FrozenMolecules when we make an Interchange, to seal off add_atom
and add_bond
. But users SHOULD be free to access Interchange.topology.atoms
and get a rich view of the underlying chemical graphs. So something like "a topology where you can change the positions but not the number of molecules or the connectivity" would be good. We have a few options for how to implement this, but I think a monkey-patched "view" class could work well:
class TopologyView(Topology):
def __init(self, other: Topology):
# Have self.positions be a REFERENCE to the real positions
self.positions = other.positions
...
def add_molecule(*args, **kwargs):
return NotImplementedError
def add_constraint():
return NotImplementedError
class Interchange:
@property
def topology():
return TopologyView(self._topology)
Sincere thanks for the proposal and discussion @Yoshanuikabundi but I don't think this is the direction we're going to be taking.
Some of these ideas will be/have been implemented, such as Interchange.to_openmm_topology
which exists now and could be renamed/rejiggered in some way.
This is part of my crusade against API consumption of
Topology
:stuck_out_tongue:Would it be crazy to refactor the parts of
from_smirnoff
that apply the force field out into anparametrize_smirnoff
method that everyone can use? I'm imagining it would open up possibilities of adding morefrom
methods that include topology information but not a forcefield, and then applying the force field in a separate step. For example:Alongside
Alongside
And
And
This would mean users could prepare their system however they like, bring the system into Interchange with the appropriate method, and then parameterize it.
An alternative is to keep things how they are now, where these sorts of construction methods are implemented on
Topology
andfrom_smirnoff
plays the role of parameterizing. This means users have to keepTopology
in their mental model.An alternative method name would be
add_smirnoff
,add_foyer
etc.As a side note, does
from_smirnoff
store the entire force field in the Interchange or just the potential handlers that are used in the topology? If it stores the entire force field, you could also provide similar capabilities in the form ofadd_pdb
,add_mdtraj
,add_molecule
etc and make thetopology
argument offrom_smirnoff
optional.