Open peastman opened 1 year ago
Things I would try first:
If you can provide an example I am happy to take a look
Here's a bit more information.
#"Step","Potential Energy (kJ/mole)","Kinetic Energy (kJ/mole)","Temperature (K)","Speed (ns/day)"
...
25,-5374000.105669722,548358.342327144,482.35639886791785,0.623
26,-5373635.169209611,548826.6602447805,482.76834873142826,0.631
27,-5370600.139194695,546261.4701012508,480.51191204662575,0.654
28,24362204.57730076,578217.3027816577,508.62145134739,0.676
29,591276527108.6628,52792615009164.625,46438348242.45094,0.697
It might be related to the pytorch version. I have two environments. In the first environment, all packages are installed with conda. In that environment, the problem always happens. Here are the versions of the most relevant packages.
nnpops 0.4 cuda112py310hd4d1af5_0 conda-forge
openmm 8.0.0 py310h5728c26_0 conda-forge
openmm-ml 1.0 pyhd8ed1ab_0 conda-forge
openmm-torch 1.0 cuda112py310hbd91edb_0 conda-forge
pytorch 1.13.1 cuda112py310he33e0d6_200 conda-forge
torchani 2.2.2 cuda112py310haf08e2f_7 conda-forge
In the other environment, all the OpenMM related packages are installed from source. In that environment the problem usually does not happen. I have occasionally seen NaNs in it, but they're much less frequent. Here are the package versions.
pytorch 1.11.0 cuda112py39ha0cca9b_1 conda-forge
pytorch-gpu 1.11.0 cuda112py39h68407e5_1 conda-forge
torchani 2.2.2 cuda112py39h0dd23f4_5 conda-forge
I tried to downgrade pytorch to 1.11 in the first environment to see if that would fix the problem, but I'm getting version conflicts.
I don’t know if this is related to your problem, but when using createMixedSystem()
for switching a whole waterbox (375 atoms) I ran into errors with boxes blowing up (https://github.com/openmm/openmm-ml/issues/52). I think in my case it might be related to the CustomBondForce
(https://github.com/openmm/openmm-ml/blob/c3d8c28eb92bf5c4b16efb81ad7a44b707fc5907/openmmml/mlpotential.py#LL298C14-L298C14) not taking PBC into account for the atoms described only by ML. I’m not sure, but could this cause problems here for you too?
That doesn't sound related. In my case, the error only happens if we use TorchANI instead of NNPOps to implement the TorchForce.
Are you using NNPOps before or after this PR: https://github.com/openmm/NNPOps/pull/83 ? Before this the NNPops and TorchnANI implementations would have differing results. Otherwise I don't really understand why they would be so different. You could do something similar to what Raul did here to narrow down where the difference comes from: https://github.com/openmm/NNPOps/issues/82#issuecomment-1438254627
The problem isn't in NNPOps. It works correctly. The error happens when using torchani instead.
I'm making progress toward narrowing it down. Here is my current simplest code for reproducing it. It creates a mixed system, evaluates the forces, and then immediately evaluates the forces again. When using torchani 2.2.2 and pytorch 1.13.1, they come out different.
The error requires that only a small part of the system is modeled with ML. In this script I have 2000 atoms, with only 50 being ML. If I reduce it to 1000 atoms it works correctly.
The error requires there to be a NonbondedForce in the system, and for it to use a cutoff. It does not need to use periodic boundary conditions, though.
from openmm import *
from openmm.app import *
from openmm.unit import *
from openmmml import MLPotential
import numpy as np
potential = MLPotential('ani2x')
numParticles = 2000
topology = Topology()
chain = topology.addChain()
residue = topology.addResidue('UNK', chain)
system = System()
nb = NonbondedForce()
nb.setNonbondedMethod(NonbondedForce.CutoffNonPeriodic)
system.addForce(nb)
elements = np.random.choice([element.hydrogen, element.carbon, element.nitrogen, element.oxygen], numParticles)
for i in range(numParticles):
system.addParticle(1.0)
nb.addParticle(0, 1, 0)
topology.addAtom(f'{i}', elements[i], residue)
pos = np.random.random((numParticles, 3))
ml_atoms = list(range(numParticles-50, numParticles))
system2 = potential.createMixedSystem(topology, system, ml_atoms, implementation='torchani')
integrator = LangevinMiddleIntegrator(300*kelvin, 1/picosecond, 0.002*picoseconds)
context = Context(system2, integrator)
context.setPositions(pos)
f1 = context.getState(getForces=True).getForces(asNumpy=True)._value
f2 = context.getState(getForces=True).getForces(asNumpy=True)._value
for i in ml_atoms:
print(f1[i], f2[i])
This seems to be a bug in PyTorch 1.13. The JIT profile guided optimisation does something to the torchani aev_computer
functions that results in incorrect gradients on the position tensor. It does not happen with pytorch 1.12.1, it does not happen if the torchani aev_computer
is replaced by NNPOps TorchANISymmetryFunctions
. It only happens using CUDA device. It does not occur if the JIT profile guided optimisations are disabled. The below script, when run on a CUDA gpu, will demonstrate this.
from openmm import *
from openmm.app import *
from openmm.unit import *
from openmmml import MLPotential
import numpy as np
import torch
# make the test system
potential = MLPotential('ani2x')
numParticles = 2000
topology = Topology()
chain = topology.addChain()
residue = topology.addResidue('UNK', chain)
system = System()
nb = NonbondedForce()
nb.setNonbondedMethod(NonbondedForce.CutoffNonPeriodic)
system.addForce(nb)
elements = np.random.choice([element.hydrogen, element.carbon, element.nitrogen, element.oxygen], numParticles)
for i in range(numParticles):
system.addParticle(1.0)
nb.addParticle(0, 1, 0)
topology.addAtom(f'{i}', elements[i], residue)
pos = np.random.random((numParticles, 3))
ml_atoms = list(range(numParticles-50, numParticles))
system2 = potential.createMixedSystem(topology, system, ml_atoms, implementation='torchani')
# load the pytorch model
# CPU version for reference forces
model_cpu = torch.jit.load("animodel.pt", map_location="cpu")
pos_cpu = torch.tensor(pos, requires_grad=True, dtype=torch.float32, device="cpu")
e_cpu = model_cpu(pos_cpu)
e_cpu.backward()
f_cpu = -pos_cpu.grad
# load in a CUDA version of the model
model_cuda = torch.jit.load("animodel.pt",map_location="cuda")
# turn on JIT profile guided optimizations (These will be on by default I think)
torch._C._jit_set_profiling_executor(True)
torch._C._jit_set_profiling_mode(True)
# later calls (the optimised ones) will fail to compute correct forces
forces_cuda_jitopt = []
N=5 # num reps
for n in range(N):
pos_cuda = torch.tensor(pos, requires_grad=True, dtype=torch.float32, device="cuda")
e_cuda = model_cuda(pos_cuda)
e_cuda.backward()
f_cuda = -pos_cuda.grad
forces_cuda_jitopt.append(f_cuda.cpu().numpy())
# compare
print("compare cuda forces with JIT profile guided optimization enabled")
for n in range(N):
if np.allclose(f_cpu, forces_cuda_jitopt[n], rtol=1e-3):
print("n =",n, "forces are correct")
else:
print("n = ",n, "forces are wrong!")
# now do the same but disable JIT profile guided optimisations
# load in a CUDA version of the model
model_cuda = torch.jit.load("animodel.pt",map_location="cuda")
torch._C._jit_set_profiling_executor(False)
torch._C._jit_set_profiling_mode(False)
forces_cuda = []
N=5 # num reps
for n in range(N):
pos_cuda = torch.tensor(pos, requires_grad=True, dtype=torch.float32, device="cuda")
e_cuda = model_cuda(pos_cuda)
e_cuda.backward()
f_cuda = -pos_cuda.grad
forces_cuda.append(f_cuda.cpu().numpy())
# compare
print("compare cuda forces with JIT profile guided optimization disabled")
for n in range(N):
if np.allclose(f_cpu, forces_cuda[n], rtol=1e-3):
print("n =",n, "forces are correct")
else:
print("n = ",n, "forces are wrong!")
The output I get on a RTX3090 with Pytorch 1.13.1 and CUDA 11.7 is this:
compare cuda forces with JIT profile guided optimization enabled
n = 0 forces are correct
n = 1 forces are correct
n = 2 forces are wrong!
n = 3 forces are wrong!
n = 4 forces are wrong!
compare cuda forces with JIT profile guided optimization disabled
n = 0 forces are correct
n = 1 forces are correct
n = 2 forces are correct
n = 3 forces are correct
n = 4 forces are correct
This seems to be fixable for me by changing a **
operation to torch.float_power
: https://github.com/aiqm/torchani/commit/172b6fe85d3ab2acd3faa7a025b5aded22f2537c
@peastman do you get correct forces if you use my fork of torchani?
to install in an existing environment:
conda remove --force torchani
pip install git+https://github.com/sef43/torchani.git@patch_openmmml_issue50
Your fix works for me. Fantastic work tracking this down! Hopefully they'll release an update soon.
recommended workaround is turning off NVFuser https://github.com/aiqm/torchani/issues/628#issuecomment-1496550800
(I don't know why changing the ** to a float_power seems to fix it )
At this moment, I would recommend disabling NVFuser by running the following:
torch._C._jit_set_nvfuser_enabled(False)
This is the relevant pytorch issue: https://github.com/pytorch/pytorch/issues/84510
According to this comment https://github.com/pytorch/pytorch/issues/84510#issuecomment-1688637763 NVFuser is being replaced by NNC. These means in future PyTorch releases the default TorchScript settings will be to use NNC, but for the current pytorch 2.0 we will need to tell people to switch NVFuser to NNC if they want to use Torchani without NNPOps.
I'm running simulations of mixed ML/MM systems where part is computed with ANI-2x and part with Amber. As long as I specify
implementation='nnpops'
in the call tocreateMixedSystem()
it works well. But if I specifyimplementation='torchani'
, the simulation immediately blows up with NaN coordinates. I tried a few molecules and the result is the same for all of them.Does anyone have an idea what could be causing this? I can put together a test case to reproduce the problem, if that's helpful. My current system is too big to post here. Here are the relevant packages from my conda environment.