Minimization fails due to a device change

raimis commented 11 months ago

Create a system with two hydrogen atoms. Artificially scale up the energy (and forces) of ANI-2x and minimize the system.

import torch as pt
from NNPOps import OptimizedTorchANI
from openmm import Context, LocalEnergyMinimizer, Platform, System, VerletIntegrator
from openmmtorch import TorchForce
from torchani.models import ANI2x

scale = 1e10
platform = "CUDA"

class Model(pt.nn.Module):
    def __init__(self, scale):
        super().__init__()
        self.scale = scale
        self.species = pt.tensor([1, 1]).unsqueeze(0)
        self.model = ANI2x(periodic_table_index=True)
        self.model = OptimizedTorchANI(self.model, self.species)
    def forward(self, positions):
        positions = positions.unsqueeze(0).to(pt.float32)
        return self.scale * self.model.forward((self.species, positions))[1]

force = TorchForce(pt.jit.script(Model(scale)))

system = System()
system.addForce(force)
for _ in range(2):
    system.addParticle(1)

platform = Platform.getPlatformByName(platform)
context = Context(system, VerletIntegrator(1), platform)

context.setPositions([[0, 0, 0], [1, 0, 0]])
LocalEnergyMinimizer.minimize(context)

The script crashes with a peculiar error:

RuntimeError: The device of "positions" has changed

/scratch/users/raimis/opt/mambaforge/envs/nnpops/lib/python3.11/site-packages/torchani/__init__.py:55: UserWarning: Dependency not satisfied, torchani.ase will not be available warnings.warn("Dependency not satisfied, torchani.ase will not be available") Warning: importing 'simtk.openmm' is deprecated. Import 'openmm' instead. /scratch/users/raimis/opt/mambaforge/envs/nnpops/lib/python3.11/site-packages/torchani/resources/ Traceback (most recent call last): File "/home/raimis/nnpops.git/debug.py", line 34, in LocalEnergyMinimizer.minimize(context) File "/scratch/users/raimis/opt/mambaforge/envs/nnpops/lib/python3.11/site-packages/openmm/openmm.py", line 17208, in minimize return _openmm.LocalEnergyMinimizer_minimize(context, tolerance, maxIterations) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ openmm.OpenMMException: The following operation failed in the TorchScript interpreter. Traceback of TorchScript, serialized code (most recent call last): File "code/__torch__.py", line 15, in forward model = self.model species = self.species _0 = (model).forward((species, positions0), None, None, ) ~~~~~~~~~~~~~~ <--- HERE return torch.mul((_0)[1], scale) File "code/__torch__/NNPOps/OptimizedTorchANI.py", line 17, in forward species_coordinates0 = (species_converter).forward(species_coordinates, None, None, ) aev_computer = self.aev_computer species_aevs = (aev_computer).forward(species_coordinates0, cell, pbc, ) ~~~~~~~~~~~~~~~~~~~~~ <--- HERE neural_networks = self.neural_networks species_energies = (neural_networks).forward(species_aevs, ) File "code/__torch__/NNPOps/SymmetryFunctions.py", line 37, in forward cell0 = cell holder = self.holder _4 = ops.NNPOpsANISymmetryFunctions.operation(holder, torch.select(positions, 0, 0), cell0) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE radial, angular, = _4 features = torch.unsqueeze(torch.cat([radial, angular], 1), 0) Traceback of TorchScript, original code (most recent call last): File "/home/raimis/nnpops.git/debug.py", line 21, in forward def forward(self, positions): positions = positions.unsqueeze(0).to(pt.float32) return self.scale * self.model.forward((self.species, positions))[1] ~~~~~~~~~~~~~~~~~~ <--- HERE File "/scratch/users/raimis/opt/mambaforge/envs/nnpops/lib/python3.11/site-packages/NNPOps/OptimizedTorchANI.py", line 52, in forward species_coordinates = self.species_converter(species_coordinates) species_aevs = self.aev_computer(species_coordinates, cell=cell, pbc=pbc) ~~~~~~~~~~~~~~~~~ <--- HERE species_energies = self.neural_networks(species_aevs) species_energies = self.energy_shifter(species_energies) File "/scratch/users/raimis/opt/mambaforge/envs/nnpops/lib/python3.11/site-packages/NNPOps/SymmetryFunctions.py", line 120, in forward raise ValueError('Only fully periodic systems are supported, i.e. pbc = [True, True, True]') radial, angular = operation(self.holder, positions[0], cell) ~~~~~~~~~ <--- HERE features = torch.cat((radial, angular), dim=1).unsqueeze(0) RuntimeError: The device of "positions" has changed

If the energy scale is reduced or changed to the CPU platform the error disappears:

scale = 1e9
platform = "CUDA"

scale = 1e10
platform = "CPU"

raimis commented 11 months ago

The minimizer changes to CPU, when forces are too large for GPU:

https://github.com/openmm/openmm/blob/065e34ab56259d56dae81655f14fcfcbdb077788/openmmapi/src/LocalEnergyMinimizer.cpp#L112-L124

raimis commented 11 months ago

While the symmetry operation setups a device once and reuse it:

https://github.com/openmm/NNPOps/blob/d15cb9196e283b6b55f88a93d85232458f64fa18/src/pytorch/SymmetryFunctions.cpp#L104-L143

RaulPPelaez commented 11 months ago

Quick solution would be to have a map with both implementations (initialized on demand) and simply choose the correct one depending on the position device. Lets see if doing this in SymmetryFunctions is enough to get rid of the issue.

RaulPPelaez commented 11 months ago

This works, but just makes the same kind of error pop up a bit later on. Let me evaluate how hard it would be to apply the same thing to the rest of OptimizedTorchAni... In the meantime you can use this as a quick workaround:

    def forward(self, species_coordinates: Tuple[Tensor, Tensor],
                cell: Optional[Tensor] = None,
                pbc: Optional[Tensor] = None) -> SpeciesEnergies:
        #send to cuda
        species_coordinates = (species_coordinates[0].cuda(), species_coordinates[1].cuda())
        if cell is not None:
            cell = cell.cuda()
        if pbc is not None:
            pbc = pbc.cuda()
        species_coordinates = self.species_converter(species_coordinates)
        species_aevs = self.aev_computer(species_coordinates, cell=cell, pbc=pbc)
        species_energies = self.neural_networks(species_aevs)
        species_energies = self.energy_shifter(species_energies)

        return species_energies

RaulPPelaez commented 11 months ago

The minimizer changes to CPU, when forces are too large for GPU:

https://github.com/openmm/openmm/blob/065e34ab56259d56dae81655f14fcfcbdb077788/openmmapi/src/LocalEnergyMinimizer.cpp#L112-L124

While I understand that the problem here is an unexpected device migration and I also see the rationale behind this recomputation in the CPU, I do not see how these lines you shared actually change the device of the positions. What I gather from those lines is that the energy is recomputed CPU side, but without changing anything else. Could you share some more context?

raimis commented 11 months ago

I have no idea where the transfer happens.

@peastman should know.

RaulPPelaez commented 11 months ago

Lets move the discussion to #113 .

RaulPPelaez commented 9 months ago

I think this could be closed, since the issue was solved in OpenMM-Torch

openmm / NNPOps

Minimization fails due to a device change #112