Closed wiederm closed 1 year ago
The relevant packages in my environment are
openmm 8.0.0 py310h5728c26_0 conda-forge
openmm-torch 1.0 cuda112py310hdb05021_0 conda-forge
openmmml 1.0 pypi_0 pypi
openmmtools 0.21.4 pyhd8ed1ab_0 conda-forge
nnpops 0.2 cuda112py310h85a0d14_4 conda-forge
@RaulPPelaez could you try to reproduce and verify the issue?
Out of curiosity, what are you using for visualization?
Could you try running with nnpops=0.3?
I am using nglview
in combination with ipywidgets
.
I have started simulations with nnpops=0.3
, will update in a few hours with the results.
Just looking at the first few ps simulation time, I see qualitatively the exact same behavior in NpT and NVT with ani2x with nnpops=0.3
:
Relevant packages:
openmm 8.0.0 py310h5728c26_0 conda-forge
openmm-torch 1.0 cuda112py310hbd91edb_0 conda-forge
openmmml 1.0 pypi_0 pypi
openmmtools 0.21.5 pyhd8ed1ab_0 conda-forge
nnpops 0.3 cuda112py310hd4d1af5_2 conda-forge
with the updated nnpos=0.3
, simulations explode with torchani
that were running fine previously (with nnpops
they run as described above). The following script runs fine with nnpops=0.2
, but if I upgrade to nnpops=0.3
temperature increases significantly and hydrogen-oxygen bonds are broken.
This is a minimum example to reproduce the issue:
from openmm import unit, LangevinIntegrator, Platform
from openmm.app import Simulation, StateDataReporter
from mdtraj.reporters import HDF5Reporter
from openmmml import MLPotential
from openmmtools.testsystems import WaterBox
from openmm.unit import Quantity
from sys import stdout
waterbox = WaterBox(15 * unit.angstrom, constrained=False)
potential = MLPotential('ani2x')
system = potential.createSystem(waterbox.topology, implementation='torchani', removeConstraints=True)
####################
# define simulation parameters
stepsize = Quantity(1., unit.femto * unit.seconds)
collision_rate = 1 / Quantity(1, unit.pico * unit.second)
temperature = Quantity(300, unit.kelvin)
integrator = LangevinIntegrator(
temperature, collision_rate, stepsize
)
platform = Platform.getPlatformByName('CUDA')
simulation = Simulation(waterbox.topology, system, integrator,platform=platform, platformProperties={'Precision':'mixed'})
simulation.context.setPositions(waterbox.positions)
simulation.reporters.append(
HDF5Reporter(
"./tmp.h5",
10,
)
)
simulation.reporters.append(StateDataReporter(stdout, 10, step=True, potentialEnergy=True, temperature=True))
simulation.step(5_000)
I am able to reproduce this. We believe it might be a problem with periodic boundary conditions in nnpops.
I compared the norms of the forces provided by both implementations using this code:
from openmm.app import Simulation
from openmm import unit, LangevinIntegrator, Platform
from openmmml import MLPotential
from openmmtools.testsystems import WaterBox
import numpy as np
box_edge = 15 * unit.angstrom
testsystem = WaterBox(box_edge, cutoff=7 * unit.angstrom)
potential = MLPotential("ani2x")
platform = Platform.getPlatformByName("CUDA")
prop = dict(CudaPrecision="mixed")
forces={}
positions=[]
for s in ("nnpops","torchani"):
system = potential.createSystem(
testsystem.topology, implementation=s
)
integrator = LangevinIntegrator(300 * unit.kelvin, 1 / unit.picosecond, 0 * unit.picoseconds)
simulation=Simulation(testsystem.topology, system, integrator, platform, prop)
simulation.context.setPositions(testsystem.positions)
forces[s] = simulation.context.getState(getForces=True).getForces().value_in_unit(unit.kilojoules/unit.mole/unit.nanometer)
positions = simulation.context.getState(getPositions=True, enforcePeriodicBox=True).getPositions().value_in_unit(unit.nanometer)
fnorms_nnpops=np.linalg.norm(forces["nnpops"],axis=1)
fnorms_torchani=np.linalg.norm(forces["torchani"],axis=1)
with open("fnorms.dat", 'w') as f:
for p,i,j in zip(positions, fnorms_nnpops, fnorms_torchani):
f.write(f"{p.x} {p.y} {p.z} {i} {j}\n")
error = np.abs((fnorms_nnpops - fnorms_torchani)/fnorms_torchani)
Then, I take all the particles with height near the center of the domain (z\in [0.55:0.95] nm) and plot a heatmap of the relative error in the force norm:
This could be consistent with a bug in the PBC
I turned off one by one each of the optimized components here: https://github.com/openmm/NNPOps/blob/3c96f5be36e950543131eaa004533df819930f66/src/pytorch/OptimizedTorchANI.py#L41-L45 The error arises when using the aev_computer, replacing the line by:
self.aev_computer = model.aev_computer
Results in relative error around machine precision as one would expect.
Thanks for tracking this down! I will give it a try!
Many of the CUDA tests in NNPOps fail for me, @peastman @raimis, could you confirm?
Test project /shared/raul/NNPOps/build
Start 1: TestCpuANISymmetryFunctions
1/13 Test #1: TestCpuANISymmetryFunctions ...... Passed 1.50 sec
Start 2: TestCpuCFConv
2/13 Test #2: TestCpuCFConv .................... Passed 0.48 sec
Start 3: TestCudaANISymmetryFunctions
3/13 Test #3: TestCudaANISymmetryFunctions .....Subprocess aborted***Exception: 1.80 sec
Start 4: TestCudaCFConv
4/13 Test #4: TestCudaCFConv ...................Subprocess aborted***Exception: 1.82 sec
Start 5: TestBatchedNN
5/13 Test #5: TestBatchedNN ....................***Failed 203.79 sec
Start 6: TestCFConv
6/13 Test #6: TestCFConv ....................... Passed 5.28 sec
Start 7: TestCFConvNeighbors
7/13 Test #7: TestCFConvNeighbors .............. Passed 3.38 sec
Start 8: TestEnergyShifter
8/13 Test #8: TestEnergyShifter ................ Passed 108.41 sec
Start 9: TestOptimizedTorchANI
9/13 Test #9: TestOptimizedTorchANI ............***Failed 216.59 sec
Start 10: TestSpeciesConverter
10/13 Test #10: TestSpeciesConverter ............. Passed 111.40 sec
Start 11: TestSymmetryFunctions
11/13 Test #11: TestSymmetryFunctions ............***Failed 129.94 sec
Start 12: TestNeighbors
12/13 Test #12: TestNeighbors ....................***Exception: SegFault 5.57 sec
Start 13: TestGetNeighborPairs
13/13 Test #13: TestGetNeighborPairs ............. Passed 2.47 sec
54% tests passed, 6 tests failed out of 13
Total Test time (real) = 792.56 sec
The following tests FAILED:
3 - TestCudaANISymmetryFunctions (Subprocess aborted)
4 - TestCudaCFConv (Subprocess aborted)
5 - TestBatchedNN (Failed)
9 - TestOptimizedTorchANI (Failed)
11 - TestSymmetryFunctions (Failed)
12 - TestNeighbors (SEGFAULT)
Mine mostly pass:
Start 1: TestCpuANISymmetryFunctions
1/13 Test #1: TestCpuANISymmetryFunctions ...... Passed 2.22 sec
Start 2: TestCpuCFConv
2/13 Test #2: TestCpuCFConv .................... Passed 0.87 sec
Start 3: TestCudaANISymmetryFunctions
3/13 Test #3: TestCudaANISymmetryFunctions ..... Passed 2.65 sec
Start 4: TestCudaCFConv
4/13 Test #4: TestCudaCFConv ................... Passed 3.03 sec
Start 5: TestBatchedNN
5/13 Test #5: TestBatchedNN ....................***Failed 155.24 sec
Start 6: TestCFConv
6/13 Test #6: TestCFConv .......................***Failed 4.81 sec
Start 7: TestCFConvNeighbors
7/13 Test #7: TestCFConvNeighbors .............. Passed 4.92 sec
Start 8: TestEnergyShifter
8/13 Test #8: TestEnergyShifter ................ Passed 97.31 sec
Start 9: TestOptimizedTorchANI
9/13 Test #9: TestOptimizedTorchANI ............ Passed 163.42 sec
Start 10: TestSpeciesConverter
10/13 Test #10: TestSpeciesConverter ............. Passed 105.67 sec
Start 11: TestSymmetryFunctions
11/13 Test #11: TestSymmetryFunctions ............ Passed 120.57 sec
Start 12: TestNeighbors
12/13 Test #12: TestNeighbors .................... Passed 9.12 sec
Start 13: TestGetNeighborPairs
13/13 Test #13: TestGetNeighborPairs ............. Passed 2.79 sec
85% tests passed, 2 tests failed out of 13
Total Test time (real) = 672.64 sec
The following tests FAILED:
5 - TestBatchedNN (Failed)
6 - TestCFConv (Failed)
The failures are just a couple of floating point assertion errors that might be stochastic
This was using a build environment created with NNPOPs/environment.yaml
but with cudatoolkit=11.7 to match my installed cuda
I think these lines in SymmetryFunctions.cpp
if (device.is_cpu()) {
impl = std::make_shared<CpuANISymmetryFunctions>(numAtoms, numSpecies, Rcr, Rca, false, atomSpecies_, radialFunctions, angularFunctions, true);
^^^^^
#ifdef ENABLE_CUDA
} else if (device.is_cuda()) {
// PyTorch allow to chose GPU with "torch.device", but it doesn't set as the default one.
CHECK_CUDA_RESULT(cudaSetDevice(device.index()));
impl = std::make_shared<CudaANISymmetryFunctions>(numAtoms, numSpecies, Rcr, Rca, false, atomSpecies_, radialFunctions, angularFunctions, true);
^^^^^
#endif
} else
mean that the periodic
flag is hard coded and always set to false
and the templated <PERIODIC>
displacement/distance calculations in CpuANISymmetryFunctions.cpp
/ CudaANISymmetryFunctions.cu
are never actually used. Or am I wrongly understanding how the .cpp/.cu code is called from the python?
I believe you're correct about that. If we replace false
with cellPtr != nullptr
, that should pass the correct value for whether to use periodic boundary conditions.
Amazing catch @sef43 ! @peastman's suggestion fixes the issue. The following test now passes:
from openmm.app import Simulation
from openmm import unit, LangevinIntegrator, Platform
from openmmml import MLPotential
from openmmtools.testsystems import WaterBox
import numpy as np
box_edge = 15 * unit.angstrom
testsystem = WaterBox(box_edge, cutoff=7 * unit.angstrom)
potential = MLPotential("ani2x")
platform = Platform.getPlatformByName("CPU")
prop = dict(CudaPrecision="mixed")
forces={}
positions=[]
for s in ("nnpops","torchani"):
system = potential.createSystem(
testsystem.topology, implementation=s
)
print(f"Implementation {s}")
file=open(f"{s}.dat", 'w')
integrator = LangevinIntegrator(300 * unit.kelvin, 1 / unit.picosecond, 0 * unit.picoseconds)
simulation=Simulation(testsystem.topology, system, integrator, platform, prop)
simulation.context.setPositions(testsystem.positions)
forces[s] = simulation.context.getState(getForces=True).getForces().value_in_unit(unit.kilojoules/unit.mole/unit.nanometer)
positions = simulation.context.getState(getPositions=True, enforcePeriodicBox=True).getPositions().value_in_unit(unit.nanometer)
fnorms_nnpops=np.linalg.norm(forces["nnpops"],axis=1)
fnorms_torchani=np.linalg.norm(forces["torchani"],axis=1)
error = np.abs((fnorms_nnpops - fnorms_torchani)/fnorms_torchani)
print(f"Maximum error: {np.max(error)}")
print(f"Mean error: {np.mean(error)}")
print(f"Std error: {np.std(error)}")
Printing:
Maximum error: 1.1337166912735413e-05
Mean error: 1.2383139611804328e-06
Std error: 1.5981482745315712e-06
this is great, thank you!
Hi,
I have been getting surprising results running waterbox simulations with the
torchani
vs thennpops
implementation ofani2x
. I used theopenmmtools
waterbox testsystem with an edge length of 20 A, and a 1 fs timestep, and simulated for 1 ns using a Langevin integrator with 1/ps collision rate. The system was set up withpotential.createSystem()
.When I run simulations in NpT with the
torchani
implementation at 300 K everything looks relatively normal (density is a bit too high, and the rdf has some surprising signal though): When I perform the same simulation with thennpops
implementation I see this: and the simulation box has shrunk (the initial box size is the yellow outlined square). Also, note the difference in the y-axis for the potential energy.In NVT I observe vacuum bubbles with
nnpos
https://user-images.githubusercontent.com/31651017/218335894-0254ed80-e51f-4189-9bfc-ae94637cfd85.mp4compared to the same simulation with
torchani
https://user-images.githubusercontent.com/31651017/218335817-3e911757-d19d-4f71-b922-8b9de913237e.mp4I attached a minimal example to reproduce the simulations.
min_example.py.zip