Yoshanuikabundi commented 2 years ago

TODOS:

[x] Rearrange and expand user guide (#428)
[ ] Polish examples

Explanation

Hey @mattwthompson I've just gone through the protein-ligand example from #328 - It looks really good! it's not too long and everything's well explained. I'll organise it with headings and might rephrase or re-organise things but my big notes are about the API.

First, I love that the + operator works with Interchange objects!!!

Secondly, I've talked to @j-wags about this a little, but Interchange seems to me to basically be Topology on steroids (plus an FF and positions). In this example in particular, the Topology is only really constructed to write a file and construct the Interchange. I wonder if allowing Interchange.from_smirnoff() to accept a list of Molecule objects as an alternative to a Topology could simplify the API?

Even if it uses Topology under the hood, my suspicion is that this would replace most usages of Topology by users. I think this would be a huge win for teaching, because Topology is a pretty nebulous term in MD world, and the type at the moment is really the only intermediate type. Without Topology, the main workflow for applying a force field takes Molecule objects, a ForceField object, a position array, and a boxvectors array, and combines them into an Interchange object. This is a beautiful workflow with no intermediate steps where each input corresponds exactly to a comp chem concept and half of them are numpy arrays that everyone has used a million times before.

So my suggestion here is:

 @classmethod
 def from_smirnoff(
     cls,
     force_field: ForceField,
-    topology: Topology,
+    topology: Union[Topology, List[Molecule]],
     box=None,
 ) -> "Interchange":

(And being able to specify positions in this method probably makes sense too, though it wouldn't come up in this example)

Would you accept PRs for this change?

Finally, my energies at the end were a bit strange:

OpenMM Energies:

Bond:               11279.70703125 kJ / mol
Angle:              5386.8837890625 kJ / mol
Torsion:            7612.54052734375 kJ / mol
Nonbonded:          None
vdW:                5095140.218750465 kJ / mol
Electrostatics:     -20698.04012266005 kJ / mol

AMBER Energies:

Bond:               2695.9208 kcalories_per_mole
Angle:              1287.4972 kcalories_per_mole
Torsion:            1819.4409 kcalories_per_mole
Nonbonded:          None
vdW:                5095170.807108 kjoule_per_mole
Electrostatics:     -80669.06598800002 kjoule_per_mole

If my maths is right these are within 30 kJ/mol and <1% of each other (which seems fine) except the electrostatics, which are out by a factor of 4. Do you know what that's about? If its expected we should probably make a note of it.

For the rest of the docs, I'm going to just look at re-arranging the stuff there so that the user guide starts by describing what Interchange is for: (a) Creating Interchange objects (b) writing input files for MD engines (c) tweaking force fields without having to re-apply them all the time (this is missing atm but Jeff said it was a big deal. Some notes on how this is done would be cool but I'll see how far I can get on my own). I'll try to organise the remaining info into a theory page like BespokeFit, and I'll make sure the API docs are using the new stuff in Sphinx 4.4.

mattwthompson commented 2 years ago

Would you accept PRs for this change?

I've thought about this and gone back and forth. At the moment I think I would - under the assumptions that, at least,

blindly calling Topology.from_molecules will work and the atom ordering and positions will line up
box vectors and positions are also provided, otherwise later exports will raise exceptions.

Below the line are some other thoughts I jotted down before caffeine. Nothing of those topics required for you to move forward on this. I think that that one-line diff supported with some test cases and updated documentation should be enough to put my concerns aside.

Do you know what [electrostatics differences] are about?

No clue - thanks for bringing it to our attention.

For the rest of the docs

Looks like a great plan to me. I look forward to seeing what you produce!

I'm not sure to what extent we want to move in the direction of allowing most system prep to happen in one step. In the protein-ligand example, it was both an easy starting point in that we didn't need to modify positions but still required some manipulating the different components through the toolkit. It would have been much simpler (5 or so fewer cells) if doing in vacuo but getting the waters prepared (and in other workflows ions, etc.) is still a hassle.

I want a Topology.from_pdb method that intelligently handles some known use cases such as a PDB with water and/or a docked ligand. I don't really think this changes the impact of your proposed change (from_smirnoff(..., topology=Topology.from_pdb("complex.pdb")), maybe somehow inferring positions) but it's relevant to these sorts of design questions.

Is this a stepping stone to accepting a list of files? We are guaranteed to have users that just want to throw a PDB file and a force field at the wall and get something back. This is also probably how Interchange will interface with a CLI tool we build at some point in the future. I see the value, especially to new users, in reducing the number of steps needed to go from molecule files to simulations. But I worry about adding so much automagic that we end up building a black box.

Most workflows with the current OpenFF stack and data will ask for PME electrostatics to be used but not specify box vectors; currently Interchange raises an exception but this will be disruptive to users who didn't consider specifying the box vectors (either currently by setting them on the topology now or not passing them through Interchange.from_smirnoff). I have a proposal open that I believe would smooth this over; until then this is a kludgy detail.

j-wags commented 2 years ago

I've thought about this and gone back and forth. At the moment I think I would - under the assumptions that, at least,
blindly calling Topology.from_molecules will work and the atom ordering and positions will line up
box vectors and positions are also provided, otherwise later exports will raise exceptions.

Yeah, I just talked with @Yoshanuikabundi about this in our call, and I don't see any obvious dangers to simply passing a list of molecules through to Topology.from_molecules, and I think that users will really appreciate the simplicity of this change.

Yoshanuikabundi commented 2 years ago

Whoops. Turns out I have strong feelings about this.

None of this is blocking or possibly even relevant, but I think API design is important and so here are some thoughts.

I'm not sure to what extent we want to move in the direction of allowing most system prep to happen in one step.

I think its reasonable to expect users to prepare their own systems. We should make it easy to add a box of appropriate shape and dimensions and to solvate stuff, but not do it automatically. I think there's a difference between doing the thinking for the user and just moving a step of busy work from outside a method to inside it. Downstream users can study, provide, and be responsible for simulation protocols using our tools without us blessing any particular approach.

I want a Topology.from_pdb method that intelligently handles some known use cases such as a PDB with water and/or a docked ligand

My take on this would be that Topology.from_pdb should just take the pdb and put it in the topology. Positions from the PDB go into each Molecule's first conformation (and possibly altlocs go into subsequent conformations), box vectors go in to the topology, etc. If you Topology.from_pdb("crystal_structure.pdb") you get a topology for the crystal. If you want to simulate the crystal and the PDB's unit cell isn't MD-happy, we provide a method to generate a simulatable cell by copying, rotating and translating the contents. If you want to simulate something other than the crystal, its up to you to call Topology.solvate() (or Interchange.solvate()). Topology/Interchange.solvate() could look like:

class Solvent:
    """Describes a solvent mixture (including densities), constructed via builder pattern"""

    @classmethod
    def saline(cls, concentration: Quantity) -> "Solvent":
        """Water at STP with `concentration` NaCl"""
        ...

    ...

def solvate(
    solute: Topology,
    solvent: Solvent,
    box: Union[Tuple[Quantity, str], ArrayQuantity],
    *,
    neutralize: bool = True,
    neutralize_ions: Optional[List[Molecule]] = None,
):
    """Place the solute in a periodic box filled with solvent

    Args:
        solute: The topology to solvate.
        solvent: The solvent to use.
        box: A 3x3 array with dimensions of distance defining the box vectors,
            or a tuple defining a buffer distance and box shape. The buffer 
            distance is the distance between periodic images of the volume swept 
            out by the tumbling solute. The box shape is either "Cubic" or 
            "RhombicDodecahedron".
        neutralize: If `True`, modify the solvent by adding or removing ions 
            until the output topology is neutrally charged.
        neutralize_ions: A list of ions that may be removed from or added to the
            solvent in order to neutralize it. If None, defaults to the most
            abundant ion in the solvent, or ions if there is a tie. Set to `[]` 
            to raise an error if the output is charged. Does nothing if 
            `neutralize == False`.

    Examples:
        Add 200 mM saline in a rhombic dodecahedral box with a buffer of 2.0 nm.
        As a default, this example also neutralizes the system by adding or
        removing sodium and chloride ions, which are equally abundant in a
        saline solvent.

        >>> solvate(
        ...     protein_ligand_topology,
        ...     Solvent.saline(0.2 * unit.Molar),
        ...     (2.0 * unit.nm, "RhombicDodecahedron"),
        ... )
    """
    ...

Solvent handles ions as well as solvent mixtures so once you've called solvate you have a topology with all the components ready to simulate. Could have a similar method for adding a lipid bilayer (which would be called before solvate, with the bilayer forming part of the solute), and neutralize could be its own method. This is similar to an API I've used for years to automate my own simulations and it's quite nice. I think it's a good balance between magic (nothing happens without you directly asking for it) and busy work (solvate does all the annoying work of calculating numbers of different solvent components).

EDIT: A nice way to think about this is that solvate is powerful, not magical. It can do a lot, but it's got no surprises. To push this principle really far, neutralize could be a separate method, it might be surprising to some people.

Is this a stepping stone to accepting a list of files? ... But I worry about adding so much automagic that we end up building a black box.

I agree this seems like too much magic. Any magic we add we have to be responsible for. I think the trick is powerful APIs that allow the user to do what they want easily, without doing it for them. If the first sentence in the docstring doesn't make clear to an expert in the field what a function does, maybe that function does too many things. It should be up to the user to decide what they want to do with a system.

currently Interchange raises an exception but this will be disruptive to users who didn't consider specifying the box vectors

I think this is good! I would much rather get an exception that says I forgot to define my periodicity than find out after I finish my simulation that my system had evaporated in the first nanosecond. The solvate design above assumes the user wants a periodic system (solvents are pretty useless otherwise) exactly so that it can warn the user if they've made a mistake. I think APIs should set clear expectations and insist that the user be explicit about what they mean, rather than try to guess.

mattwthompson commented 2 years ago

Thanks for putting so much thought and time into writing this up. I'm generally in alignment with and in favor of everything you've described.

I'm possibly less optimistic than you that a balance can be had between automagically saving users from busy work and enabling power users to do crazy things. It's hard for me to speak in broad terms, and I think we'll just have to hash it out case by case. One of the only non-negotiables from my perspective is that the high-level methods doing many things in one step much be flexible enough that users can rip them apart, modify them, etc., or even bypass them altogether. I don't think we're remotely close to that here and I want to ensure we don't take any steps in that direction. I think in general your solvation mock-up is the sort of thing I have in mind - for relatively vanilla use cases, it should save users some grief in preparing common systems, but users with more exotic system preparation idea are still fully capable of preparing their own topologies how they wish.

My take on this would be that Topology.from_pdb should ...

I agree with your perspective here. This particular feature is probably a little farther out on the roadmap but I hope it will smoothen some common user stories (again, while not prohibiting other routes in system preparation).

mattwthompson commented 2 years ago

I think this is resolved now - feel free to open new issues to advance discussions that stemmed from the original topics

openforcefield / openff-interchange

Docs updates #408

TODOS:

Explanation