CUDA out of memory #99

Yangxinsix commented 1 year ago

I'm trying to follow the NNOPs tutorial. But the tutorial fails at the second step; it installs nothing and always makes my colab crashed.

Then I tried to follow this tutorial on my own laptop to create a pytorch force field by myself. Fortunately, the installation finally works. But the simulation fails with the CUDA out of memory error, without running for even a single step.

I tried to use an extremely small model with only ~100 parameters and tested if there is any accumulated computational graph by running it multiple times (more than 100). But it still gives me this error.

I'm not sure if there is any memory leaking problem in the plugin or if it is just the required memory of openmm is too large. Could you help me check that?

Thanks a lot.

This is the error information:

Traceback of TorchScript, serialized code (most recent call last):
  File "code/", line 25, in forward
    input_dict = {"pairs": _2, "n_diff": _3, "n_dist": _4, "num_atoms": _5, "num_pairs": _6, "elems": elems}
    model = self.model
    output = (model).forward(input_dict, True, )
              ~~~~~~~~~~~~~~ <--- HERE
    _7 = (torch.detach(output["energy"]), torch.detach(output["forces"]))
    return _7
  File "code/__torch__/PaiNN/", line 53, in forward
    _11 = getattr(update_layers, "1")
    _21 = getattr(update_layers, "2")
    _8 = (_00).forward(node_scalar, node_vector, edge0, edge_diff, edge_dist, )
          ~~~~~~~~~~~~ <--- HERE
    def forward(self, input: Tensor) -> Tensor:
        return F.linear(input, self.weight, self.bias)
               ~~~~~~~~ <--- HERE
RuntimeError: CUDA out of memory. Tried to allocate 4.96 GiB (GPU 0; 4.00 GiB total capacity; 903.65 MiB already allocated; 1.89 GiB free; 1.27 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

This is my conda environment:

peastman commented 1 year ago

Which tutorial are you running? Can you provide the exact code that leads to the error?

Yangxinsix commented 1 year ago
from openmmtorch import TorchForce
import sys
import openmm
from openmm import LangevinMiddleIntegrator
from import Simulation, StateDataReporter, Topology, Modeller
from openmm import unit
from import Element

import torch
from torch import nn
from import NeighborList
from PaiNN.model import PainnModel
from import read, write
import numpy as np

# create simulation system
atoms = read('work/dataset/corrected_ads_images.traj', 100)

pos = atoms.get_positions() / 10
box_vectors = atoms.get_cell() / 10
elements = atoms.get_chemical_symbols()

# Create a topology object
topology = Topology()

# Add atoms to the topology
chain = topology.addChain()
res = topology.addResidue("mace_system", chain)
for i, (element, position) in enumerate(zip(elements, pos)):
    e = Element.getBySymbol(element)
    topology.addAtom(str(i), e, res)
# if there is a periodic box specified add it to the Topology
if np.all(atoms.pbc):

# Create a modeller object
modeller = Modeller(topology, pos)

# Create a system object
system = openmm.System()
if topology.getPeriodicBoxVectors() is not None:
for atom in topology.atoms():
    if atom.element is None:

# Wrapper model for simulation
class PainnOpenmm(nn.Module):
    def __init__(self, elements: torch.Tensor, model: PainnModel) -> None:
        self.neigh_list = NeighborList(model.cutoff)
        self.model = model
        self.register_buffer('elems', elements)

    def forward(self, positions: torch.Tensor, cell: torch.Tensor):
        print(f'Device of positions: {positions.device}')
        pairs, pair_diff, pair_dist = self.neigh_list(positions, cell)
        input_dict = {
            'pairs': pairs,
            'n_diff': pair_diff,
            'n_dist': pair_dist,
            'num_atoms': torch.tensor([positions.shape[0]], dtype=pairs.dtype, device=pairs.device),
            'num_pairs': torch.tensor([pairs.shape[0]], dtype=pairs.dtype, device=pairs.device),
            'elems': self.elems,
        output = self.model(input_dict)

        return (output['energy'], output['forces'])

# load trained model
state_dict = torch.load('/work3/xinyang/work/models/ads_images/128_node_3_layer.pth')
model = PainnModel(

# model deploy
elems = torch.from_numpy(atoms.get_atomic_numbers())
positions = torch.from_numpy(atoms.get_positions()).float()
cell = torch.from_numpy(atoms.cell[:]).float()

openmm_ff = PainnOpenmm(elements=elems, model=model)

# load force field
force = TorchForce('deployed_model')

# set up initial parameters
temperature = 298.15 * unit.kelvin
frictionCoeff = 1 / unit.picosecond
timeStep = 1 * unit.femtosecond
integrator = LangevinMiddleIntegrator(temperature, frictionCoeff, timeStep)

# setup simulations
simulation = Simulation(topology, system, integrator)
reporter = StateDataReporter(file=sys.stdout, reportInterval=1, step=True, time=True, potentialEnergy=True, temperature=True)

All above code ran successfully. And I also tested MD simulation using the model via ASE, it is absolutely fine for running more than 10 million steps. No OOM problem shows even using the model on my own laptop with a 4 GB memory GPU.

The following two lines showed the CUDA out of memory:

state = simulation.context.getState(getEnergy=True)

I also tried to run the above code on a Tesla A100 GPU with 40 GB of memory. Now it gives the same error:

RuntimeError: CUDA out of memory. Tried to allocate 4.08 GiB (GPU 0; 39.43 GiB total capacity; 34.11 GiB already allocated; 724.31 MiB free; 37.83 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

So, I'm quite sure there may be some memory leaking.

sef43 commented 1 year ago

One possible reason for your error could be units: I think ASE has positions in Angstroms while in OpenMM positions will be in nanometers. I notice you do pos = atoms.get_positions() / 10 in the setup, I assume this is to turn ASE Angstroms into OpenMM nanometers? In the forward method you do no unit conversions. The positions passed by OpenMM into forward will be in nanometers. Is this what the model is expecting? or should they be converted into Angstrom? Also check the energy and force units, you may need to put conversions in the forward method. OpenMM uses kJ/mol for energy. What does the model you are using use?

(edit: I initially incorrectly wrote kcal/mol, here are OpenMM units:

Yangxinsix commented 1 year ago

Thanks a lot for your explanation! The error is due to units: When using nm, the constructed neighbor list became much larger, so much more memory is requested by the model.