Surprising differences between nnpops and torchani

wiederm commented 1 year ago

Hi,

I have been getting surprising results running waterbox simulations with the torchani vs the nnpops implementation of ani2x. I used the openmmtools waterbox testsystem with an edge length of 20 A, and a 1 fs timestep, and simulated for 1 ns using a Langevin integrator with 1/ps collision rate. The system was set up with potential.createSystem().

When I run simulations in NpT with the torchani implementation at 300 K everything looks relatively normal (density is a bit too high, and the rdf has some surprising signal though): When I perform the same simulation with the nnpops implementation I see this: and the simulation box has shrunk (the initial box size is the yellow outlined square). Also, note the difference in the y-axis for the potential energy.

In NVT I observe vacuum bubbles with nnpos https://user-images.githubusercontent.com/31651017/218335894-0254ed80-e51f-4189-9bfc-ae94637cfd85.mp4

compared to the same simulation with torchani https://user-images.githubusercontent.com/31651017/218335817-3e911757-d19d-4f71-b922-8b9de913237e.mp4

I attached a minimal example to reproduce the simulations.

min_example.py.zip

wiederm commented 1 year ago

The relevant packages in my environment are

openmm                    8.0.0           py310h5728c26_0    conda-forge
openmm-torch              1.0             cuda112py310hdb05021_0    conda-forge
openmmml                  1.0                      pypi_0    pypi
openmmtools               0.21.4             pyhd8ed1ab_0    conda-forge
nnpops                    0.2             cuda112py310h85a0d14_4    conda-forge

raimis commented 1 year ago

@RaulPPelaez could you try to reproduce and verify the issue?

RaulPPelaez commented 1 year ago

Out of curiosity, what are you using for visualization?

RaulPPelaez commented 1 year ago

Could you try running with nnpops=0.3?

wiederm commented 1 year ago

I am using nglview in combination with ipywidgets.

I have started simulations with nnpops=0.3, will update in a few hours with the results.

wiederm commented 1 year ago

Just looking at the first few ps simulation time, I see qualitatively the exact same behavior in NpT and NVT with ani2x with nnpops=0.3:

Relevant packages:

openmm                    8.0.0           py310h5728c26_0    conda-forge
openmm-torch              1.0             cuda112py310hbd91edb_0    conda-forge
openmmml                  1.0                      pypi_0    pypi
openmmtools               0.21.5             pyhd8ed1ab_0    conda-forge
nnpops                    0.3             cuda112py310hd4d1af5_2    conda-forge

wiederm commented 1 year ago

with the updated nnpos=0.3, simulations explode with torchani that were running fine previously (with nnpops they run as described above). The following script runs fine with nnpops=0.2, but if I upgrade to nnpops=0.3 temperature increases significantly and hydrogen-oxygen bonds are broken.

This is a minimum example to reproduce the issue:

from openmm import unit, LangevinIntegrator, Platform
from openmm.app import Simulation, StateDataReporter
from mdtraj.reporters import HDF5Reporter
from openmmml import MLPotential
from openmmtools.testsystems import WaterBox
from openmm.unit import Quantity
from sys import stdout

waterbox = WaterBox(15 * unit.angstrom, constrained=False)
potential = MLPotential('ani2x')
system = potential.createSystem(waterbox.topology, implementation='torchani', removeConstraints=True)
####################
# define simulation parameters
stepsize = Quantity(1., unit.femto * unit.seconds)
collision_rate = 1 / Quantity(1, unit.pico * unit.second)
temperature = Quantity(300, unit.kelvin)

integrator = LangevinIntegrator(
    temperature, collision_rate, stepsize
)

platform = Platform.getPlatformByName('CUDA')
simulation = Simulation(waterbox.topology, system, integrator,platform=platform, platformProperties={'Precision':'mixed'})
simulation.context.setPositions(waterbox.positions)

simulation.reporters.append(
    HDF5Reporter(
        "./tmp.h5",
        10,
    )
)
simulation.reporters.append(StateDataReporter(stdout, 10, step=True, potentialEnergy=True, temperature=True))
simulation.step(5_000)

RaulPPelaez commented 1 year ago

I am able to reproduce this. We believe it might be a problem with periodic boundary conditions in nnpops.

RaulPPelaez commented 1 year ago

I compared the norms of the forces provided by both implementations using this code:

from openmm.app import Simulation
from openmm import unit, LangevinIntegrator, Platform
from openmmml import MLPotential
from openmmtools.testsystems import WaterBox
import numpy as np

box_edge = 15 * unit.angstrom
testsystem = WaterBox(box_edge, cutoff=7 * unit.angstrom)
potential = MLPotential("ani2x")
platform = Platform.getPlatformByName("CUDA")
prop = dict(CudaPrecision="mixed")
forces={}
positions=[]
for s in ("nnpops","torchani"):
    system = potential.createSystem(
        testsystem.topology, implementation=s
    )
    integrator = LangevinIntegrator(300 * unit.kelvin, 1 / unit.picosecond, 0 * unit.picoseconds)
    simulation=Simulation(testsystem.topology, system, integrator, platform, prop)
    simulation.context.setPositions(testsystem.positions)
    forces[s] = simulation.context.getState(getForces=True).getForces().value_in_unit(unit.kilojoules/unit.mole/unit.nanometer)
    positions = simulation.context.getState(getPositions=True, enforcePeriodicBox=True).getPositions().value_in_unit(unit.nanometer)

fnorms_nnpops=np.linalg.norm(forces["nnpops"],axis=1)
fnorms_torchani=np.linalg.norm(forces["torchani"],axis=1)
with open("fnorms.dat", 'w') as f:
    for p,i,j in zip(positions, fnorms_nnpops, fnorms_torchani):
        f.write(f"{p.x} {p.y} {p.z} {i} {j}\n")

error = np.abs((fnorms_nnpops - fnorms_torchani)/fnorms_torchani)

Then, I take all the particles with height near the center of the domain (z\in [0.55:0.95] nm) and plot a heatmap of the relative error in the force norm:

This could be consistent with a bug in the PBC

RaulPPelaez commented 1 year ago

I turned off one by one each of the optimized components here: https://github.com/openmm/NNPOps/blob/3c96f5be36e950543131eaa004533df819930f66/src/pytorch/OptimizedTorchANI.py#L41-L45 The error arises when using the aev_computer, replacing the line by:

self.aev_computer = model.aev_computer

Results in relative error around machine precision as one would expect.

wiederm commented 1 year ago

Thanks for tracking this down! I will give it a try!

RaulPPelaez commented 1 year ago

Many of the CUDA tests in NNPOps fail for me, @peastman @raimis, could you confirm?

Test project /shared/raul/NNPOps/build
      Start  1: TestCpuANISymmetryFunctions
 1/13 Test  #1: TestCpuANISymmetryFunctions ......   Passed    1.50 sec
      Start  2: TestCpuCFConv
 2/13 Test  #2: TestCpuCFConv ....................   Passed    0.48 sec
      Start  3: TestCudaANISymmetryFunctions
 3/13 Test  #3: TestCudaANISymmetryFunctions .....Subprocess aborted***Exception:   1.80 sec
      Start  4: TestCudaCFConv
 4/13 Test  #4: TestCudaCFConv ...................Subprocess aborted***Exception:   1.82 sec
      Start  5: TestBatchedNN

 5/13 Test  #5: TestBatchedNN ....................***Failed  203.79 sec
      Start  6: TestCFConv
 6/13 Test  #6: TestCFConv .......................   Passed    5.28 sec
      Start  7: TestCFConvNeighbors
 7/13 Test  #7: TestCFConvNeighbors ..............   Passed    3.38 sec
      Start  8: TestEnergyShifter
 8/13 Test  #8: TestEnergyShifter ................   Passed  108.41 sec
      Start  9: TestOptimizedTorchANI
 9/13 Test  #9: TestOptimizedTorchANI ............***Failed  216.59 sec
      Start 10: TestSpeciesConverter
10/13 Test #10: TestSpeciesConverter .............   Passed  111.40 sec
      Start 11: TestSymmetryFunctions
11/13 Test #11: TestSymmetryFunctions ............***Failed  129.94 sec
      Start 12: TestNeighbors
12/13 Test #12: TestNeighbors ....................***Exception: SegFault  5.57 sec
      Start 13: TestGetNeighborPairs
13/13 Test #13: TestGetNeighborPairs .............   Passed    2.47 sec

54% tests passed, 6 tests failed out of 13

Total Test time (real) = 792.56 sec

The following tests FAILED:
          3 - TestCudaANISymmetryFunctions (Subprocess aborted)
          4 - TestCudaCFConv (Subprocess aborted)
          5 - TestBatchedNN (Failed)
          9 - TestOptimizedTorchANI (Failed)
         11 - TestSymmetryFunctions (Failed)
         12 - TestNeighbors (SEGFAULT)

sef43 commented 1 year ago

Mine mostly pass:

      Start  1: TestCpuANISymmetryFunctions
 1/13 Test  #1: TestCpuANISymmetryFunctions ......   Passed    2.22 sec
      Start  2: TestCpuCFConv
 2/13 Test  #2: TestCpuCFConv ....................   Passed    0.87 sec
      Start  3: TestCudaANISymmetryFunctions
 3/13 Test  #3: TestCudaANISymmetryFunctions .....   Passed    2.65 sec
      Start  4: TestCudaCFConv
 4/13 Test  #4: TestCudaCFConv ...................   Passed    3.03 sec
      Start  5: TestBatchedNN
 5/13 Test  #5: TestBatchedNN ....................***Failed  155.24 sec
      Start  6: TestCFConv
 6/13 Test  #6: TestCFConv .......................***Failed    4.81 sec
      Start  7: TestCFConvNeighbors
 7/13 Test  #7: TestCFConvNeighbors ..............   Passed    4.92 sec
      Start  8: TestEnergyShifter
 8/13 Test  #8: TestEnergyShifter ................   Passed   97.31 sec
      Start  9: TestOptimizedTorchANI
 9/13 Test  #9: TestOptimizedTorchANI ............   Passed  163.42 sec
      Start 10: TestSpeciesConverter
10/13 Test #10: TestSpeciesConverter .............   Passed  105.67 sec
      Start 11: TestSymmetryFunctions
11/13 Test #11: TestSymmetryFunctions ............   Passed  120.57 sec
      Start 12: TestNeighbors
12/13 Test #12: TestNeighbors ....................   Passed    9.12 sec
      Start 13: TestGetNeighborPairs
13/13 Test #13: TestGetNeighborPairs .............   Passed    2.79 sec

85% tests passed, 2 tests failed out of 13

Total Test time (real) = 672.64 sec

The following tests FAILED:
      5 - TestBatchedNN (Failed)
      6 - TestCFConv (Failed)

The failures are just a couple of floating point assertion errors that might be stochastic

This was using a build environment created with NNPOPs/environment.yaml but with cudatoolkit=11.7 to match my installed cuda

sef43 commented 1 year ago

I think these lines in SymmetryFunctions.cpp

            if (device.is_cpu()) {
                impl = std::make_shared<CpuANISymmetryFunctions>(numAtoms, numSpecies, Rcr, Rca, false, atomSpecies_, radialFunctions, angularFunctions, true);
                                                                                                 ^^^^^

#ifdef ENABLE_CUDA
            } else if (device.is_cuda()) {
                // PyTorch allow to chose GPU with "torch.device", but it doesn't set as the default one.
                CHECK_CUDA_RESULT(cudaSetDevice(device.index()));
                impl = std::make_shared<CudaANISymmetryFunctions>(numAtoms, numSpecies, Rcr, Rca, false, atomSpecies_, radialFunctions, angularFunctions, true);
                                                                                                 ^^^^^
#endif
            } else

mean that the periodic flag is hard coded and always set to false and the templated <PERIODIC> displacement/distance calculations in CpuANISymmetryFunctions.cpp / CudaANISymmetryFunctions.cu are never actually used. Or am I wrongly understanding how the .cpp/.cu code is called from the python?

peastman commented 1 year ago

I believe you're correct about that. If we replace false with cellPtr != nullptr, that should pass the correct value for whether to use periodic boundary conditions.

RaulPPelaez commented 1 year ago

Amazing catch @sef43 ! @peastman's suggestion fixes the issue. The following test now passes:

from openmm.app import Simulation
from openmm import unit, LangevinIntegrator, Platform
from openmmml import MLPotential
from openmmtools.testsystems import WaterBox
import numpy as np

box_edge = 15 * unit.angstrom
testsystem = WaterBox(box_edge, cutoff=7 * unit.angstrom)
potential = MLPotential("ani2x")
platform = Platform.getPlatformByName("CPU")
prop = dict(CudaPrecision="mixed")
forces={}
positions=[]
for s in ("nnpops","torchani"):
    system = potential.createSystem(
        testsystem.topology, implementation=s
    )
    print(f"Implementation {s}")
    file=open(f"{s}.dat", 'w')
    integrator = LangevinIntegrator(300 * unit.kelvin, 1 / unit.picosecond, 0 * unit.picoseconds)
    simulation=Simulation(testsystem.topology, system, integrator, platform, prop)
    simulation.context.setPositions(testsystem.positions)
    forces[s] = simulation.context.getState(getForces=True).getForces().value_in_unit(unit.kilojoules/unit.mole/unit.nanometer)
    positions = simulation.context.getState(getPositions=True, enforcePeriodicBox=True).getPositions().value_in_unit(unit.nanometer)

fnorms_nnpops=np.linalg.norm(forces["nnpops"],axis=1)
fnorms_torchani=np.linalg.norm(forces["torchani"],axis=1)

error = np.abs((fnorms_nnpops - fnorms_torchani)/fnorms_torchani)
print(f"Maximum error: {np.max(error)}")
print(f"Mean error: {np.mean(error)}")
print(f"Std error: {np.std(error)}")

Printing:

Maximum error: 1.1337166912735413e-05
Mean error: 1.2383139611804328e-06
Std error: 1.5981482745315712e-06

wiederm commented 1 year ago

this is great, thank you!

openmm / NNPOps

Surprising differences between nnpops and torchani #82