openmm / openmm-ml

High level API for using machine learning models in OpenMM simulations
Other
75 stars 25 forks source link

`createMixedSystem` can yield unstable dynamics due to CVForce's `mmForce1` (HarmonicBondForce) #36

Closed dominicrufa closed 1 year ago

dominicrufa commented 1 year ago

I've been playing around with CustomCVForce-equipped openmm.System object returned by openmm_ml.mlpotential.MLPotential.createMixedSystem with ani2x and I'm seeing some nans in my simulations; I dug through the offending force objects that were contributing to unusually large forces (>1e5 kJ/nm*mol). I found that the ML-treated subsystem has atoms with some of these unusually large forces. specifically, I found that the mmForce1 (HarmonicBondForce) see here is contributing to the large forces when lambda_interpolate=0.88888888. since this is a harmonic bond, it suggests that the stiff spring constant is to blame for some reason?

given this, would it make sense to add another global lambda parameter that allows for the softening of these spring constants?

The serialized state of the nanned simulations is attached; however, I cannot attach the TorchForce.pt since it is too large for github xml.tar.gz

cc: @jchodera @peastman

peastman commented 1 year ago

It's not surprising if the MM force field and ML potential disagree significantly on bonded interactions. But the question is how you're getting to a state where the forces are so large. For any value of lambda_interpolate, you have a well defined potential. If you run a simulation in that potential, it should always remain in regions where the forces are reasonable (as long as your step size is appropriate to the potential).

Can you describe the simulation protocol that led you to a state where the forces are too large?

dominicrufa commented 1 year ago

It's not surprising if the MM force field and ML potential disagree significantly on bonded interactions. But the question is how you're getting to a state where the forces are so large. For any value of lambda_interpolate, you have a well defined potential. If you run a simulation in that potential, it should always remain in regions where the forces are reasonable (as long as your step size is appropriate to the potential).

@peastman , yes, i totally agree with all of those points. i think i jumped ahead in proposing a solution. I am running a LangevinMiddleIntegrator simulation with a timestep of 0.5fs and a collision rate of 1/ps at lambda=0.888888. after ~100ps, the simulation nans with the last (nanned) frame rendered. I know that this isn't a constraint problem since i am making assertions that constraints don't exist in the ML-treated region (just the small molecule), and the MM forcefield i am using is openff-2.0.0 w/ tip3p for solvent.

One final thing i noticed looking at the nanned frame is that the offending forces exist on the atoms between the longest hydrogen bond you can see in the image. Is it possible that the bond in the cv force-wrapped harmonicbondforce is not registering pbcs for some reason? I am not too familiar with the intricacies of what happes with customcvforces.

image

peastman commented 1 year ago

That certainly looks like a problem in wrapping. I wonder if it's actually wrapped internally, or if it just happens at the output stage? Can you try disabling wrapping on output, for example by specifying enforcePeriodicBox=False on the reporter?

From looking at the code, I suspect it's just output. Wrapping for computational purposes is determined by the ComputeForceInfo. The one for CustomCVForce correctly propagates information from the forces it contains, so the HarmonicBondForce should prevent wrapping. Wrapping on output uses a different mechanism determined by how the ForceImpl implements getBondedParticles(). CustomCVForceImpl should propagate information from the contained forces, but currently it doesn't. We should fix that.

dominicrufa commented 1 year ago

That certainly looks like a problem in wrapping. I wonder if it's actually wrapped internally, or if it just happens at the output stage? Can you try disabling wrapping on output, for example by specifying enforcePeriodicBox=False on the reporter?

i was actually using context.getState to retrieve positions/box vectors and mdtraj to report (passing unitcell_lengths and unitcell_angles) to the mdtraj.Trajectory object.

Wrapping on output uses a different mechanism determined by how the ForceImpl implements getBondedParticles(). CustomCVForceImpl should propagate information from the contained forces, but currently it doesn't. We should fix that.

are you saying that the long hbond in the image is just a red herring? i found it peculiar that the offending forces were only on the atoms across this bond.

dominicrufa commented 1 year ago

CustomCVForceImpl should propagate information from the contained forces, but currently it doesn't. We should fix that.

i missed this last bit. I'm a bit confused as to whether this is just an imaging problem or something more serious.

peastman commented 1 year ago

I'm a bit confused as to whether this is just an imaging problem or something more serious.

That's the first question to figure out. When you called getState(), what did you specify for enforcePeriodicBox?

jchodera commented 1 year ago

If the bug in CustomCVForceImpl is not causing the molecule to be appropriately imaged so that the CalcHarmonicBondForceKernel computed in the CustomCVForce sees the molecule as a whole, since your inner context HarmonicBondForce sets usesPeriodic=0, you will get a crazy large erroneous energy for that bond within the CustomCVForce. It sounds like this is exactly what is going on.

dominicrufa commented 1 year ago

I'm a bit confused as to whether this is just an imaging problem or something more serious.

That's the first question to figure out. When you called getState(), what did you specify for enforcePeriodicBox?

@peastman . sorry, enforcePeriodicBox is not specified, so it defaults to False. that is what i am using.

peastman commented 1 year ago

Ok, that sounds like it's the real problem, not just output. So we need to figure out why it's happening. Do you have a reproducible test case for this? I want to figure out whether it's identifying the molecules correctly, and if not, why not.

dominicrufa commented 1 year ago

@peastman , the simulation fails stochastically, but there is a tar.gz of the state, system, integrator of a failed timestep on my first comment. loading the openmm.Torch .pt might be a problem as i cannot post it to github (it's too big for github), though the ml force is not the problem as far as i am aware. i can try to email it to you if you need it.

dominicrufa commented 1 year ago

@peastman , I'm not sure if this is related at all to the issue, but it seems like even having usesPeriodicBoundaryConditions=True for the inner cv HarmonicBondForce results in large forces between the two atoms across the unit cell.

peastman commented 1 year ago

Loading your system I get this error.

Unknown type name '__torch__.torch.classes.NNPOpsANISymmetryFunctions.Holder':
Serialized   File "code/__torch__/NNPOps/SymmetryFunctions.py", line 7
  _is_full_backward_hook : Optional[bool]
  num_species : int
  holder : __torch__.torch.classes.NNPOpsANISymmetryFunctions.Holder
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
  triu_index : Tensor
  def forward(self: __torch__.NNPOps.SymmetryFunctions.TorchANISymmetryFunctions,

Any idea what I'm doing wrong? I have the latest NNPOps installed. from NNPOps.SymmetryFunctions import TorchANISymmetryFunctions works correctly. Possibly your file requires a different version of PyTorch? I have 1.11.

dominicrufa commented 1 year ago

my import for nnpops was from NNPOps import OptimizedTorchANI., not the TorchAniSymmetryFunctions, though that shouldnt be the issue based on your traceback.

pytorch:

pytorch                   1.11.0          cuda112py39ha0cca9b_1    conda-forge
pytorch-gpu               1.11.0          cuda112py39h68407e5_1    conda-forge
peastman commented 1 year ago

It looks like the problem might be related to https://github.com/pytorch/TensorRT/discussions/642. I'll try adding the linker flags suggested there and see if it helps.

peastman commented 1 year ago

No success with that. For both NNPOps and OpenMM-Torch I modified CMAKE_SHARED_LINKER_FLAGS to specify --no-as-needed instead of --as-needed. It didn't help.

peastman commented 1 year ago

I found a workaround. If I add the line

from NNPOps.SymmetryFunctions import *

at the top of my script then it works. Just importing NNPOps is not enough. It looks like this may be something to do with classes not getting registered until the right module is loaded. @raimis is there a way we can make PyTorch automatically recognize and load the necessary module if the user hasn't done it already?

dominicrufa commented 1 year ago

I should clarify that you should be able to reproduce the large force errors even without the .pt file. when i remove the .pt from the same dir, there is no error thrown about not being able to make a torchforce. like i previously mentioned the large force comes from the HarmonicBondForce inside the customcv, not the TorchForce.

dominicrufa commented 1 year ago

I found a workaround.

sorry, missed this.

peastman commented 1 year ago

I assume the bond in question is between atoms 2 and 32? Do you have an earlier state from before the problem appeared? In the state you included, wrapping has already happened and the atoms are far apart. I need to reproduce how they came to be far apart in the first place.

I confirmed that it's internally treating atoms 0-32 as a single molecule, so it should never split them apart. What sorts of operations did you perform before the problem appeared? Did you just have the integrator take steps, or could it have come from some higher level operation? For example, did you ever query and reset the positions with setPositions()?

dominicrufa commented 1 year ago

I assume the bond in question is between atoms 2 and 32? Do you have an earlier state from before the problem appeared? In the state you included, wrapping has already happened and the atoms are far apart. I need to reproduce how they came to be far apart in the first place

correct, and it is attached (only positions and bvs) state.tar.gz

For example, did you ever query and reset the positions with setPositions()?

It looks like there is a setPositions I called before taking integration steps that i used here: https://github.com/choderalab/openmmtools/blob/cec8c3d6627d50e638daf79c759ec670a4fddff3/openmmtools/states.py#L2278

peastman commented 1 year ago

Where did those positions come from? Were they retrieved from another (or the same) context, or did they come from a file, or somewhere else?

So far I can't figure out any way wrapping could have occurred during normal integration. It's treating the atoms as a molecule and keeping them together. I do see a way they could have gotten wrapped by a call to getState() with enforcePeriodicBox=True. That is normally supposed to keep molecules together too, but CustomCVForceImpl doesn't implement getBondedParticles(). I'll get in a fix for that. But if the positions didn't come from a call to getState(enforcePeriodicBox=True), there must be something else going on and I need to keep looking.

dominicrufa commented 1 year ago

Where did those positions come from? Were they retrieved from another (or the same) context, or did they come from a file, or somewhere else?

they came from a checkpoint file from running replica exchange in openmmtools from before nanning, but when i got the state to report to you, I left pbcs=False.

I do see a way they could have gotten wrapped by a call to getState() with enforcePeriodicBox=True. That is normally supposed to keep molecules together too, but CustomCVForceImpl doesn't implement getBondedParticles(). I'll get in a fix for that. But if the positions didn't come from a call to getState(enforcePeriodicBox=True), there must be something else going on and I need to keep looking.

if this is true, i am convinced this is at least one culprit; in openmmtools, we wrap openmm.integrators into .apply moves and pass it positions/box_vectors to run a prespecified number of integrator steps. afterwards, the state is called here which will enforce pbcs. if the system is periodic. so yes, the positions did come from a getState(enforcePeriodicBox=True) call.

dominicrufa commented 1 year ago

@peastman , would you be able to extend the simulation with your fix to see if it nans (if tests pass)? previously, it should have failed every time one of the molecule's bonds passed the unit cell. I cannot build from source. alternatively, i can run it if i can conda install this fix from somewhere.

jchodera commented 1 year ago

@dominicrufa We've streamlined the OpenMM build instructions, so it should be possible to build/install OpenMM in your conda environment. Here are the current instructions.

If you're able to start from a fresh conda env, I've had luck with using a YAML file to install a working development environment: https://github.com/openmm/openmm/issues/3378#issuecomment-993034180

We're still working on refining the build instructions here https://github.com/openmm/openmm/pull/3127

jchodera commented 1 year ago

Alternatively, once https://github.com/openmm/openmm/pull/3711 is merged (since it contains a useful bugfix regardless!), I have managed to get the nightly dev builds working again.

peastman commented 1 year ago

would you be able to extend the simulation with your fix to see if it nans (if tests pass)?

The change I made won't have any effect on what happens while running a simulation. It will only affect calls to getState().

dominicrufa commented 1 year ago

@jchodera , @peastman:

openmm                    7.8              py39_cuda110_1    omnia-dev/label/cuda110

doesn't seem to have solved the problem. again, the offending forces seem to be the HarmonicBondForce _inside_the CustomCVForce. will dig into this a bit more today.

peastman commented 1 year ago

Are you still starting from the same positions as before? If you give it positions in which the atom has already gotten wrapped, you'll get huge forces. The change I made hopefully fixes the problem that caused the position to get wrapped in the first place, but it won't fix anything once the wrapping has taken place.

dominicrufa commented 1 year ago

Are you still starting from the same positions as before?

no, i ran a completely new simulation.

peastman commented 1 year ago

Can you provide files to reproduce it?

dominicrufa commented 1 year ago

@peastman , I can. the problem is that I am getting this result from a replica exchange simulation in openmmtools and the inner workings are a bit opaque. Let me try to generate a minimal working example of the failure for you.

jchodera commented 1 year ago

@dominicrufa : Can you quickly check which version you installed?

$ python -c "import openmm; print(openmm.version.version)"
7.7.0.dev-0644f05

I think the hash should be ae34183

dominicrufa commented 1 year ago

@jchodera , I must have installed the wrong version then. which of these options corresponds to the the hash you mentioned? https://anaconda.org/omnia-dev/openmm;

It appears i was using '7.7.0.dev-a39fa14'

dominicrufa commented 1 year ago

ah, i found it.

dominicrufa commented 1 year ago

@peastman , @jchodera , after running the correct version I still see nans (still the same force as mentioned before). I am going to run an openmm simulation and a sequence of mcmc moves (from openmmtools) separately to see if i can reproduce the error more minimally.

dominicrufa commented 1 year ago

@peastman, I might have found at least one of the culprits.

If create a context with the files i gave you earlier and an openmm.openmm.LangevinIntegrator(300., 1., 1e-3) or a openmm.openmm.LangevinMiddleIntegrator, integrator, you should find that the following code should run without nans for the full number of iterations:

for i in range(100):
  print(i)
   integrator.step(1000)
   intermed_state = context.getState(getPositions=True, enforcePeriodicBox=True)
   context.setPositions(intermed_state.getPositions())
   context.setPeriodicBoxVectors(*intermed_state.getPeriodicBoxVectors())

however, if you run the same code with a CustomIntegrator, like i had been doing under the hood with openmmtools.integrators.LangevinSplittingDynamics (which is the same as the LangevinMiddleIntegrator with a few more options), you'll find that the above code will crash out around ~20000 steps. I think this is another issue with omm8 not working well with CustomIntegrators.

This isn't a blocker for me, strictly speaking (unless i run into another issue down the line), but I am going to revert back to using a LangevinMiddleIntegrator for the time being.

unfortunately, the customcvforce is still not registering pbcs and nanning in my repex simulations, so something else is causing problems downstream.

peastman commented 1 year ago

The state file in that comment already has wrapped positions in it. There's a bond between atom 3 at position

        <Position x="2.5399420261383057" y="2.8996822834014893" z="1.27876615524292"/>

and atom 31 at position

        <Position x="2.3951101303100586" y=".019522514194250107" z="1.2324200868606567"/>
dominicrufa commented 1 year ago

@peastman , this is the file i used that causes the nan.

periodic_state.tar.gz

peastman commented 1 year ago

What System does that file go with? When I try to use it with the System you provided before, I get the error

openmm.OpenMMException: Called setParameter() with invalid parameter name: AndersenCollisionFrequency
peastman commented 1 year ago

I managed to reproduce it by combining the system and integrator from https://github.com/openmm/openmm-ml/issues/36#issue-1315205257, the state from https://github.com/openmm/openmm-ml/issues/36#issuecomment-1194771716, then overwriting just the positions with the ones from https://github.com/openmm/openmm-ml/issues/36#issuecomment-1197511617. Looking into it now.

peastman commented 1 year ago

I take it back. When I reproduced it, I was using an older version of the code without the fix. Using the latest revision, I've run it for several hundred thousand steps with no sign of problems.

jchodera commented 1 year ago

@peastman @raimis : @dominicrufa has come up with a minimal CustomIntegrator only test case that reproducibly NaNs with the OpenMM nightly dev builds. He'll post the test case to reproduce a bit later today.

peastman commented 1 year ago

If you can post the test case, that would be great.

dominicrufa commented 1 year ago

pack.tar.gz the system and state here should be right this time. you can repeat this experiment with the .pt you have access to.

peastman commented 1 year ago

I ran 500,000 steps with that system and state. No sign of any problems.

dominicrufa commented 1 year ago

@peastman, alright i suspected this may be a versioning problem. can you send me your conda list?

peastman commented 1 year ago

I compiled it from source, so conda won't show the version number. What output do you get from this?

import openmm
print(openmm.version.git_revision)
dominicrufa commented 1 year ago

@peastman, it is '7.7.0.dev-a39fa14' (openmm.version.version) . to be clear, LangevinMiddleIntegrator should be working, openmmtools.integrators.LangevinIntegrator should not.

peastman commented 1 year ago

That's the revision just before the fix was merged.

https://github.com/openmm/openmm/commits/master

dominicrufa commented 1 year ago

mamba install -c omnia-dev openmm installation gives me '7.7.0.dev-0644f05'. perhaps im not understanding the installation versioning.