Closed Yangxinsix closed 1 year ago
Which tutorial are you running? Can you provide the exact code that leads to the error?
from openmmtorch import TorchForce
import sys
import openmm
from openmm import LangevinMiddleIntegrator
from openmm.app import Simulation, StateDataReporter, Topology, Modeller
from openmm import unit
from openmm.app.element import Element
import torch
from torch import nn
from PaiNN.data import NeighborList
from PaiNN.model import PainnModel
from ase.io import read, write
import numpy as np
# create simulation system
atoms = read('work/dataset/corrected_ads_images.traj', 100)
pos = atoms.get_positions() / 10
box_vectors = atoms.get_cell() / 10
elements = atoms.get_chemical_symbols()
# Create a topology object
topology = Topology()
# Add atoms to the topology
chain = topology.addChain()
res = topology.addResidue("mace_system", chain)
for i, (element, position) in enumerate(zip(elements, pos)):
e = Element.getBySymbol(element)
topology.addAtom(str(i), e, res)
# if there is a periodic box specified add it to the Topology
if np.all(atoms.pbc):
topology.setPeriodicBoxVectors(vectors=box_vectors)
# Create a modeller object
modeller = Modeller(topology, pos)
# Create a system object
system = openmm.System()
if topology.getPeriodicBoxVectors() is not None:
system.setDefaultPeriodicBoxVectors(*topology.getPeriodicBoxVectors())
for atom in topology.atoms():
if atom.element is None:
system.addParticle(0)
else:
system.addParticle(atom.element.mass)
# Wrapper model for simulation
class PainnOpenmm(nn.Module):
def __init__(self, elements: torch.Tensor, model: PainnModel) -> None:
super().__init__()
self.neigh_list = NeighborList(model.cutoff)
self.model = model
self.register_buffer('elems', elements)
def forward(self, positions: torch.Tensor, cell: torch.Tensor):
print(f'Device of positions: {positions.device}')
pairs, pair_diff, pair_dist = self.neigh_list(positions, cell)
input_dict = {
'pairs': pairs,
'n_diff': pair_diff,
'n_dist': pair_dist,
'num_atoms': torch.tensor([positions.shape[0]], dtype=pairs.dtype, device=pairs.device),
'num_pairs': torch.tensor([pairs.shape[0]], dtype=pairs.dtype, device=pairs.device),
'elems': self.elems,
}
output = self.model(input_dict)
return (output['energy'], output['forces'])
# load trained model
state_dict = torch.load('/work3/xinyang/work/models/ads_images/128_node_3_layer.pth')
model = PainnModel(
num_interactions=state_dict['num_layer'],
hidden_state_size=state_dict['node_size'],
cutoff=state_dict['cutoff'],
normalization=False,
)
model.load_state_dict(state_dict['model'])
# model deploy
elems = torch.from_numpy(atoms.get_atomic_numbers())
positions = torch.from_numpy(atoms.get_positions()).float()
cell = torch.from_numpy(atoms.cell[:]).float()
openmm_ff = PainnOpenmm(elements=elems, model=model)
openmm_ff.cuda()
torch.jit.script(openmm_ff).save('deployed_model')
# load force field
force = TorchForce('deployed_model')
force.setUsesPeriodicBoundaryConditions(True)
force.setOutputsForces(True)
system.addForce(force)
# set up initial parameters
temperature = 298.15 * unit.kelvin
frictionCoeff = 1 / unit.picosecond
timeStep = 1 * unit.femtosecond
integrator = LangevinMiddleIntegrator(temperature, frictionCoeff, timeStep)
# setup simulations
simulation = Simulation(topology, system, integrator)
simulation.context.setPositions(modeller.getPositions())
reporter = StateDataReporter(file=sys.stdout, reportInterval=1, step=True, time=True, potentialEnergy=True, temperature=True)
simulation.reporters.append(reporter)
All above code ran successfully. And I also tested MD simulation using the model via ASE, it is absolutely fine for running more than 10 million steps. No OOM problem shows even using the model on my own laptop with a 4 GB memory GPU.
The following two lines showed the CUDA out of memory:
state = simulation.context.getState(getEnergy=True)
simulation.step(100)
I also tried to run the above code on a Tesla A100 GPU with 40 GB of memory. Now it gives the same error:
RuntimeError: CUDA out of memory. Tried to allocate 4.08 GiB (GPU 0; 39.43 GiB total capacity; 34.11 GiB already allocated; 724.31 MiB free; 37.83 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
So, I'm quite sure there may be some memory leaking.
One possible reason for your error could be units: I think ASE has positions in Angstroms while in OpenMM positions will be in nanometers. I notice you do pos = atoms.get_positions() / 10
in the setup, I assume this is to turn ASE Angstroms into OpenMM nanometers? In the forward
method you do no unit conversions. The positions passed by OpenMM into forward will be in nanometers. Is this what the model is expecting? or should they be converted into Angstrom? Also check the energy and force units, you may need to put conversions in the forward method. OpenMM uses kJ/mol for energy. What does the model you are using use?
(edit: I initially incorrectly wrote kcal/mol, here are OpenMM units: http://docs.openmm.org/latest/userguide/theory/01_introduction.html#units)
Thanks a lot for your explanation! The error is due to units: When using nm, the constructed neighbor list became much larger, so much more memory is requested by the model.
I'm trying to follow the NNOPs tutorial. But the tutorial fails at the second step; it installs nothing and always makes my colab crashed.
Then I tried to follow this tutorial on my own laptop to create a pytorch force field by myself. Fortunately, the installation finally works. But the simulation fails with the
CUDA out of memory
error, without running for even a single step.I tried to use an extremely small model with only ~100 parameters and tested if there is any accumulated computational graph by running it multiple times (more than 100). But it still gives me this error.
I'm not sure if there is any memory leaking problem in the plugin or if it is just the required memory of
openmm
is too large. Could you help me check that?Thanks a lot.
This is the error information:
This is my conda environment: