openmm / openmm-cookbook

The OpenMM Cookbook and Tutorials
35 stars 10 forks source link

Add Enhanced sampling simulation cookbooks #21

Open sukritsingh opened 1 year ago

sukritsingh commented 1 year ago

Here's an initial draft of 4 cookbooks for running enhanced sampling methods (AMD, GAMD, MetaD, and Umbrella Sampling) using OpenMM. Each of them are located in a subdirectory inside cookbook/Enhanced sampling methods. A couple of sample log. files are included (which are used to set up hyper parameters for these simulations).

Any and all feedback is welcome! The notebook code itself runs but it may need cleanup/additional documentation.

(This PR is a re-opening after PR #16 ) to deal with merge conflicts.

This submission is response to a larger effort to add more tutorials to the OpenMM-cookbook (#12)

peastman commented 1 year ago

These aren't really cookbook entries as such. Take a look at the existing entries to get a sense for how they're written. A cookbook entry should be short, to the point, and completely self contained. It should start by clearly stating the problem, then present the minimal amount of code needed to solve it, while clearly explaining the code so the user can adapt it to their own problem.

These notebooks are more like tutorials based on their size and complexity. And they don't have much explanation, just large blocks of code.

sukritsingh commented 1 year ago

A cookbook entry should be short, to the point, and completely self contained. [...] And they don't have much explanation, just large blocks of code.

These notebooks are self-contained and run using the villin.pdb file in the main cookbook directory. It just requires a system/simulation object to be set up (and CVForces defined).

I was basing it off a discussion in another thread where you mentioned:

The cookbook is a reference. It contains small, self contained pieces of code for performing specific tasks. They can assume you already know what you want to do, and you're just trying to find out how to do it.

So I wrote it assuming someone would know what the theory behind any of these enhanced sampling methods are and just need the code to run them on any system of their choice. I figure that the "tutorials" would have more in depth discussion, but happy to add more documentation/problem statements if that would be preferred.

Admittedly these aren't super small notebooks, but I think they are still descriptively minimal examples? Partially because setting up and running customIntegrators can just take up a few cells (especially given that it starts from just the villin.pdb). Happy to also merge cells together to improve clarity if that's more inline with cookbooks.

Either way I definitely want to document these a lot better and flesh out the details more but I just wanted to get something out there that "runs" in case y'all had any thoughts. Appreciate this discussion!

These notebooks are more like tutorials based on their size and complexity.

I'm also happy to migrate these to the tutorials directly instead rather than having them as cookbooks, if that's the preference here as well!

sukritsingh commented 1 year ago

These comments are super helpful - thanks so much! In looking through the examples, I didn't appreciate that it didn't have to be a proper "robust" simulation code so I included everything (and it's from an older "chicken scratch" notebook).

I didn't go through the others, because very similar comments apply to them too.

Indeed - initially this was actually one bigger notebook with all four separate approaches applying to the same system.

Thanks again! I'll work on resolving the comments you made, and apply them to the other notebooks as well. I'll try to ensure that these are much more lightweight examples.

sukritsingh commented 1 year ago

@peastman Would you be willing to take a look at these cookbooks now and let me know if you think they've been sufficiently cleaned up? I've cleaned up all four notebooks to ensure it's just as minimal a simulation as possible.

For each notebook, I also tried to document how one could save outputs (without using reporters) in a way that I like to do it, but I wasn't sure it added too much complexity or not. It doesn't involve too much overhead and it's a nice way to only save what you need to track the sampling of your simulation or generate the free energy projections. Any and all comments are welcome!

sukritsingh commented 1 year ago

@sef43 also wanted to tag you in case you may have had any thoughts/edits here!

peastman commented 9 months ago

Sorry this PR got forgotten! I think it just got lost in everyone's inboxes. Let's get back to it and see if we can finish cleaning it up. What you have now is closer to what we're looking for, but I think we can improve it. If you read through the existing cookbook entries, you'll see they tend to follow a formula.

  1. State what problem we're going to solve.
  2. State what mechanism we're going to use to solve it.
  3. Demonstrate how to do it with the smallest amount of code possible.
  4. Explain any pieces of the code that may not be clear and point out how the reader may need to modify it to suite their own needs.

For comparison, look at your umbrella sampling entry again. Instead of explaining what problem we're solving, it goes straight into code. The only explanation is vague titles like, "Create system + CV Forces with harmonic restraints." When the reader finally gets to some text, they don't have the context needed to understand it. "Ultimately we want to track the value of these distances..." But you haven't told them what these distances are or why we would want to track them.

My best suggestion is really just to read through the existing cookbook entries. I think that will make it clear. They tend to have at least as much text as code, and they never present code without first explaining what the code is for.

sukritsingh commented 9 months ago

No worries @peastman - totally understand that things come up and/or get buried! Thanks for pointing out the additional notebooks and the guidance of the Four Parts.

I've tried out an "example" of the documenting on the Umbrella sampling notebook, since you brought it up - Could you take a look at it and let me know if this is enough/needs editing/additional fleshing out still? I tried adding more context without adding more math or anything and kept the code to the minimum. Hopefully this provides sufficient context? If not I can definitely keep expanding!

If it looks good here then I have a starting point to edit the other notebooks and can push changes to them accordingly - just want to make sure we converge on "one good notebook" first!

peastman commented 9 months ago

That's a lot better. Actually, what you have now is more of a tutorial than a cookbook recipe. But that's ok, let's just call it a tutorial instead. Umbrella sampling is a big enough topic to justify one.

I'm not sure the dist_measurer force is a good idea. It computes the distances on every step, even when you don't need them. You set the energy expression to "0" so it won't affect the simulation, but the CustomCVForce still computes them before evaluating its energy expression. Of course, just computing two distances should take negligible time. But this is a really inefficient way of doing it. The CustomCVForce maintains its own internal Context. On each step it first copies over the current state from the main Context to the internal one, then does a separate energy evaluation for each CV. That's significant overhead, even if the actual CV computation is negligible. So you add significant cost to every step, just to have the distances available when you want them.

One solution is to put the CustomCVForce into its own force group, then call setIntegrationForceGroups() on the Integrator to prevent it from including that group. That will prevent the CustomCVForce from being evaluated except when you ask for it.

But all of this may be unnecessary complexity. What about adding getPositions=True to your getState() call, then computing the distances in Python? The code should be simpler and more obvious. This also has overhead, of course, but possibly not very much. It depends how often you record the distances. In a real simulation would you actually do it every 10 steps, or would it probably be less frequent?

sukritsingh commented 9 months ago

But that's ok, let's just call it a tutorial instead.

I think there's already an Umbrella sampling tutorial that goes into a lot more of the theoretical details of both the running and the analysis. I can reference it more directly if that's preferred but figured that this would just act as a "minimal code" example (which is what I thought the cookbook was). What would be the best way to resolve this then? Should I remove the umbrella sampling as a cookbook or make any other changes to make it more cookbook-like?

Of course, just computing two distances should take negligible time. But this is a really inefficient way of doing it. The CustomCVForce maintains its own internal Context. On each step it first copies over the current state from the main Context to the internal one, then does a separate energy evaluation for each CV.

Yea I was trying to figure out a way to do it while importing as few libraries as possible (maintaining the
"minimal" example idea), but I agree it's not the most efficient way to do it. One neat thing it offers is a way for a user to track the results of the simulation as it runs(which was my original objective).

One solution is to put the CustomCVForce into its own force group, then call setIntegrationForceGroups() on the Integrator to prevent it from including that group. That will prevent the CustomCVForce from being evaluated except when you ask for it.

I could try adding this into the notebook to help reduce computational cost! Would it just be as simple as setIntegrationForceGroup(integrator)?

What about adding getPositions=True to your getState() call, then computing the distances in Python? The code should be simpler and more obvious. This also has overhead, of course, but possibly not very much. It depends how often you record the distances. In a real simulation would you actually do it every 10 steps, or would it probably be less frequent?

I think this is a great approach and was what I first thought of doing but a) I wasn't sure if it would be more "minimal" than just having it print out the distances live (totally defer to you on that) b) I think my philosophy here that I wanted to convey is that it's possible to track how much your simulation has sampled (in a relevant CV-space) as the simulation runs so you can know whether or not you've sampled your relevant states.

I think evaluating distances after the simulation as run (which would require a restart from the user) is just a different, but equally valid, approach to evaluating sampling quality.

I'm happy to include a separate code block to show how one would evaluate the distances after the simulation if you think that would be helpful!

peastman commented 9 months ago

I think there's already an Umbrella sampling tutorial

Good point. Do you think this one adds anything beyond what's in the existing tutorial? I generally think of the cookbook as reference material for answering very specific questions that wouldn't justify a whole tutorial. Things like, "How do I change the temperature during a simulation?" or, "How do I decompose the total energy into contributions from different interactions?" The user already knows exactly what they want to do. They just want to find out what code to write to do it.

I think evaluating distances after the simulation as run (which would require a restart from the user) is just a different, but equally valid, approach to evaluating sampling quality.

I didn't mean to move the calculation to after the simulation is run. It would still be computed live, just by a different method. You would replace

d1,d2 = dist_measurer.getCollectiveVariableValues(simulation.context)

with

positions = state.getPositions(asNumpy=True)
d1 = norm(positions[d1_atom1_ind]-positions[d1_atom2_ind])
d2 = norm(positions[d2_atom1_ind]-positions[d2_atom2_ind])
sukritsingh commented 9 months ago

Do you think this one adds anything beyond what's in the existing tutorial? I generally think of the cookbook as reference material for answering very specific questions that wouldn't justify a whole tutorial.

So I think mine is a lot more "straight to the point" as far as providing code to setup and run a single window for a user who knows Umbrella sampling but just wants the OpenMM syntax, if that makes sense? The other notebook has a lot more theoretical details and of running a full Umbrella sampling simulation and provides a lot of details on system setup and analysis. This notebook would be much more "Assuming you know what you want, here is the code to apply and run a single window."

I could link to specific codeblocks and/or include more details/example code on how to run using multiple windows?

Another option is I could remove this notebook but keep the additional other three notebooks (on accelerated MD, Gaussian accelerated MD, and Metadynamics)?

It would still be computed live, just by a different method.

Oh I see! Sorry I totally misinterpreted what you were meaning, but yes the way you described doing it is equally good! I hadn't done it this way because I avoided importing numpy in the notebook (and by norm I think you mean numpy.linalg.norm?) , but if you think that this way is "better" then I can do it this way instead!

peastman commented 9 months ago

Actually norm() is openmm.unit.norm, which is already imported. It knows how to deal with positions that have units.

@sef43 do you have any thoughts on this, since you wrote the existing tutorial?

sukritsingh commented 9 months ago

Actually norm() is openmm.unit.norm, which is already imported. It knows how to deal with positions that have units.

Oh! Phenomenal! I've gone ahead and updated the notebook to print it using norm and took out the whole dist_measurer thing entirely! This is a much simpler way of doing this for sure! Thanks for that pointer!

It depends how often you record the distances. In a real simulation would you actually do it every 10 steps, or would it probably be less frequent?

In a real simulation it would be much less frequent than recording it 10 steps, but the sample cookbook loop I have is only running a 100 step simulation anyways. I've made a point of noting that that the recording_frequency set here is super frequent and should be set at higher values for realistic systems.