JSLJ23 commented 11 months ago

CUDA error: no kernel image

Hi NNPOps developers, I was trying to run the example of NNPOps but with the alanine dipeptide example but I am running into CUDA RuntimeErrors indicating that there are no kernel images available. Not sure how to go about debugging this so I was hoping to get some help on this.

Best regards, Joshua

Full error:

RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Environment:

mamba list torch
>>> # Name                    Version                   Build  Channel
>>> openmm-torch              1.1             cuda112py310h43efcb7_0    conda-forge
>>> pytorch                   2.0.0           cuda112py310he33e0d6_200    conda-forge
>>> pytorch-gpu               2.0.0           cuda112py310h9871d0b_200    conda-forge
>>> torchani                  2.2.3           cuda112py310h73eb55c_2    conda-forge

mamba list cuda
>>> # Name                    Version                   Build  Channel
>>> cuda-version              11.2                 hb11dac2_2    conda-forge
>>> cudatoolkit               11.2.2              hc23eb0c_12    conda-forge

mamba list nnpops
>>> # Name                    Version                   Build  Channel
>>> nnpops                    0.6             cuda112py310h427b095_0    conda-forge

Code I am trying to run:

Imports:

import openmmtools
import torch
from NNPOps import OptimizedTorchANI
from torchani.models import ANI2x

Ensure CUDA is available

print(torch.cuda.is_available())
>>> true
device = torch.device("cuda")

Use alanine dipeptide as a test system


# Get the system of alanine dipeptide
ala2 = openmmtools.testsystems.AlanineDipeptideVacuum(constraints=None)

Remove MM forces

while ala2.system.getNumForces() > 0: ala2.system.removeForce(0)

The system should not contain any additional force and constrains

assert ala2.system.getNumConstraints() == 0 assert ala2.system.getNumForces() == 0


4. Force and energy evaluation with `NNPOps OptimizedTorchANI`

species = torch.tensor([[atom.element.atomic_number for atom in ala2.topology.atoms()]], device=device) positions = torch.tensor([ala2.positions.tolist()], dtype=torch.float32, requires_grad=True, device=device)

Alternatively, all the optimizations can be applied with OptimizedTorchANI

nnp = ANI2x(periodic_table_index=True).to(device) nnp = OptimizedTorchANI(nnp, species).to(device)

Compute energy and forces again

energy = nnp((species, positions)).energies positions.grad.zero_() energy.backward() forces = -positions.grad.clone()

print(energy, forces)


## Full error stacktrace:

RuntimeError Traceback (most recent call last) Cell In[5], line 9 6 nnp = OptimizedTorchANI(nnp, species).to(device) 8 # Compute energy and forces again ----> 9 energy = nnp((species, positions)).energies 10 positions.grad.zero_() 11 energy.backward()

File ~/mambaforge/envs/nnff_py310/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, *kwargs) 1496 # If we don't have any hooks, we want to skip the rest of the logic in 1497 # this function, and just call forward. 1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks 1499 or _global_backward_pre_hooks or _global_backward_hooks 1500 or _global_forward_hooks or _global_forward_pre_hooks): -> 1501 return forward_call(args, **kwargs) 1502 # Do not call functions when jit is used 1503 full_backward_hooks, non_full_backward_hooks = [], []

File ~/mambaforge/envs/nnff_py310/lib/python3.10/site-packages/NNPOps/OptimizedTorchANI.py:52, in OptimizedTorchANI.forward(self, species_coordinates, cell, pbc) 47 def forward(self, species_coordinates: Tuple[Tensor, Tensor], 48 cell: Optional[Tensor] = None, 49 pbc: Optional[Tensor] = None) -> SpeciesEnergies: 51 species_coordinates = self.species_converter(species_coordinates) ---> 52 species_aevs = self.aev_computer(species_coordinates, cell=cell, pbc=pbc) 53 species_energies = self.neural_networks(species_aevs) 54 species_energies = self.energy_shifter(species_energies)

File ~/mambaforge/envs/nnff_py310/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, *kwargs) 1496 # If we don't have any hooks, we want to skip the rest of the logic in 1497 # this function, and just call forward. 1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks 1499 or _global_backward_pre_hooks or _global_backward_hooks 1500 or _global_forward_hooks or _global_forward_pre_hooks): -> 1501 return forward_call(args, **kwargs) 1502 # Do not call functions when jit is used 1503 full_backward_hooks, non_full_backward_hooks = [], []

File ~/mambaforge/envs/nnff_py310/lib/python3.10/site-packages/NNPOps/SymmetryFunctions.py:121, in TorchANISymmetryFunctions.forward(self, species_positions, cell, pbc) 118 raise ValueError('Only fully periodic systems are supported, i.e. pbc = [True, True, True]') 120 radial, angular = operation(self.holder, positions[0], cell) --> 121 features = torch.cat((radial, angular), dim=1).unsqueeze(0) 123 return species, features

RuntimeError: CUDA error: no kernel image is available for execution on the device CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

JSLJ23 commented 11 months ago

I have also tried this in a separate environment with CUDA 11.8 and still ran into the same issue...

mamba list cuda
>>> # Name                    Version                   Build  Channel
>>> cuda-version              11.8                 h70ddcb2_2    conda-forge
>>> cudatoolkit               11.8.0              h4ba93d1_12    conda-forge

RaulPPelaez commented 11 months ago

What is the output of nvidia-smi? BTW, this is not the first time we encounter this error. See related discussion here https://github.com/openmm/openmm-torch/pull/106 Not sure we really got a closure on that...

RaulPPelaez commented 11 months ago

For additional context, as far as I gathered the offending kernel comes from NNPOpsANISymmetryFunctions in that case. Perhaps we can check this by turning off the optimized symmetry function here: https://github.com/openmm/NNPOps/blob/d15cb9196e283b6b55f88a93d85232458f64fa18/src/pytorch/OptimizedTorchANI.py#L43 The problem is that the error was kind of elusive last time, so it was hard to debug.

JSLJ23 commented 11 months ago

Tue Aug 15 14:32:02 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.98                 Driver Version: 535.98       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2080 ...    Off | 00000000:01:00.0 Off |                  N/A |
| N/A   45C    P0              26W /  90W |      6MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A       950      G   /usr/lib/Xorg                                 4MiB |
+---------------------------------------------------------------------------------------+

RaulPPelaez commented 11 months ago

Your script runs without issue in a machine with a 4090. Both using the nnpops 0.6 conda package and the manually compiled master.

However, running on a 2080Ti like yours produces the error.

Replacing the offending component as suggested: self.aev_computer = model.aev_computer does indeed fix the error.

RaulPPelaez commented 11 months ago

It seems like the compiler is not considering the sm_75 arch. That mixed with some kernel in SymmetryFunctions that I guess requires some arch-specific directives produces the issue.

cuobjdump /home/raul/mambaforge/envs/nnpops06/lib/python3.11/site-packages/NNPOps/libNNPOpsPyTorch.so | grep sm_ | sort | uniq
arch = sm_35
arch = sm_50
arch = sm_80
arch = sm_86

Furthermore, the reason why it works on the CI is that by default torch sets the architectures to the ones native to the system. The issue then lies in the conda feedstock not choosing the correct archs.

RaulPPelaez commented 11 months ago

The bug requires a new build in the fedstock. Hopefully https://github.com/conda-forge/nnpops-feedstock/pull/26 does the trick.

JSLJ23 commented 11 months ago

class NNP(torch.nn.Module):

  def __init__(self, atomic_numbers):

    super().__init__()
    # Store the atomic numbers
    self.atomic_numbers = torch.tensor(atomic_numbers, device=device).unsqueeze(0)
    # Create an ANI-2x model
    self.ani2x = ANI2x(periodic_table_index=True).to(device)
    # Accelerate the model
    self.model = OptimizedTorchANI(self.ani2x, self.atomic_numbers).to(device)
    # AEV computer fix
    self.aev_computer = self.ani2x.aev_computer

  def forward(self, positions):
    # Prepare the positions
    positions = positions.unsqueeze(0).float() * 10 # nm --> Å
    # Run ANI-2x
    result = self.model((self.atomic_numbers, positions))
    # Get the potential energy
    energy = result.energies[0] * 2625.5 # Hartree --> kJ/mol
    return energy

# Create an instance of the model
nnp = NNP(atomic_numbers)

I've tried modifying the NNP object to take the ani2x's AEV computer but it still gives the CUDA error. May I know exactly how you are doing self.aev_computer = model.aev_computer?

RaulPPelaez commented 11 months ago

Sorry, I should have been more clear. I changed the definition in the constructor of OptimizedTorchANI.py directly. I think the moment you do ".to(device)" the error will pop out. I changed this line there: self.aev_computer = TorchANISymmetryFunctions(model.species_converter, model.aev_computer, atomicNumbers) by this: self.aev_computer = model.aev_computer You can have this solution now by just defining your own OptimizedTorchAni, which is a short class currently defined as:

import torch
from torch import Tensor
from typing import Optional, Tuple

from NNPOps.BatchedNN import TorchANIBatchedNN
from NNPOps.EnergyShifter import TorchANIEnergyShifter, SpeciesEnergies
from NNPOps.SpeciesConverter import TorchANISpeciesConverter
from NNPOps.SymmetryFunctions import TorchANISymmetryFunctions

class OptimizedTorchANI(torch.nn.Module):

    from torchani.models import BuiltinModel # https://github.com/openmm/NNPOps/issues/44

    def __init__(self, model: BuiltinModel, atomicNumbers: Tensor) -> None:

        super().__init__()

        # Optimize the components of an ANI model
        self.species_converter = TorchANISpeciesConverter(model.species_converter, atomicNumbers)
        self.aev_computer = TorchANISymmetryFunctions(model.species_converter, model.aev_computer, atomicNumbers)
        self.neural_networks = TorchANIBatchedNN(model.species_converter, model.neural_networks, atomicNumbers)
        self.energy_shifter = TorchANIEnergyShifter(model.species_converter, model.energy_shifter, atomicNumbers)

    def forward(self, species_coordinates: Tuple[Tensor, Tensor],
                cell: Optional[Tensor] = None,
                pbc: Optional[Tensor] = None) -> SpeciesEnergies:

        species_coordinates = self.species_converter(species_coordinates)
        species_aevs = self.aev_computer(species_coordinates, cell=cell, pbc=pbc)
        species_energies = self.neural_networks(species_aevs)
        species_energies = self.energy_shifter(species_energies)

        return species_energies

If I am not missing something, you should be able to just copy paste this into your script:

import torch
from torch import Tensor
from typing import Optional, Tuple

from NNPOps.BatchedNN import TorchANIBatchedNN
from NNPOps.EnergyShifter import TorchANIEnergyShifter, SpeciesEnergies
from NNPOps.SpeciesConverter import TorchANISpeciesConverter
from NNPOps.SymmetryFunctions import TorchANISymmetryFunctions

class OptimizedTorchANI(torch.nn.Module):

    from torchani.models import BuiltinModel # https://github.com/openmm/NNPOps/issues/44

    def __init__(self, model: BuiltinModel, atomicNumbers: Tensor) -> None:

        super().__init__()

        # Optimize the components of an ANI model
        self.species_converter = TorchANISpeciesConverter(model.species_converter, atomicNumbers)
        self.aev_computer = model.aev_computer
        self.neural_networks = TorchANIBatchedNN(model.species_converter, model.neural_networks, atomicNumbers)
        self.energy_shifter = TorchANIEnergyShifter(model.species_converter, model.energy_shifter, atomicNumbers)

    def forward(self, species_coordinates: Tuple[Tensor, Tensor],
                cell: Optional[Tensor] = None,
                pbc: Optional[Tensor] = None) -> SpeciesEnergies:

        species_coordinates = self.species_converter(species_coordinates)
        species_aevs = self.aev_computer(species_coordinates, cell=cell, pbc=pbc)
        species_energies = self.neural_networks(species_aevs)
        species_energies = self.energy_shifter(species_energies)

        return species_energies

and remove the NNPops import from NNPOps import OptimizedTorchANI.

JSLJ23 commented 11 months ago

Ok I managed to get hold of RTX 4090 system to test the NNPOps and like you mentioned, it works seamlessly there.

Wed Aug 16 17:40:57 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090         On | 00000000:01:00.0 Off |                  Off |
|  0%   37C    P3               51W / 450W|   7269MiB / 24564MiB |     12%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2726      G   /usr/lib/xorg/Xorg                         1144MiB |
|    0   N/A  N/A      2867      G   /usr/bin/gnome-shell                        557MiB |
|    0   N/A  N/A      4362      G   ...sion,SpareRendererForSitePerProcess      193MiB |
|    0   N/A  N/A      6695      G   ...5017253,17452613814888696451,262144      392MiB |
|    0   N/A  N/A      8268      C   ...mbaforge/envs/nnff_py310/bin/python     2186MiB |
|    0   N/A  N/A      8864      C   ...mbaforge/envs/nnff_py310/bin/python     2470MiB |
+---------------------------------------------------------------------------------------+

JSLJ23 commented 11 months ago

However I am noticing something very very strange when running the custom OptimizedTorchANI with the self.aev_computer = model.aev_computer versus the default OptimizedTorchANI imported from NNPOps. The custom OptimizedTorchANI seems to cause the system to explode with extremely high temperatures.

Custom `OptimizedTorchANI`

#"Step","Time (ps)","Potential Energy (kJ/mole)","Temperature (K)"
100,0.10000000000000007,-1299286.6337109958,4157.272221653436
200,0.20000000000000015,-1298719.9741632198,9269.906210046393
300,0.3000000000000002,-1297707.8114830707,11445.388596274423
400,0.4000000000000003,-1298399.7325442587,12468.574661346423
500,0.5000000000000003,-1298875.3397286986,12172.953122353025
600,0.6000000000000004,-1299012.1397098755,10125.893963085939
700,0.7000000000000005,-1299297.581734464,9565.762065101066
800,0.8000000000000006,-1298897.1822553729,6906.9651615600405
900,0.9000000000000007,-1299240.9848396038,7325.153176464823
1000,1.0000000000000007,-1299094.3014499997,5827.495844342631

Default `OptimizedTorchANI`

#"Step","Time (ps)","Potential Energy (kJ/mole)","Temperature (K)"
100,0.10000000000000007,-1301536.311651804,29.84507966482548
200,0.20000000000000015,-1301530.3516933029,51.569098455429895
300,0.3000000000000002,-1301530.2664442887,83.21092035049655
400,0.4000000000000003,-1301522.8738798206,104.69981264325087
500,0.5000000000000003,-1301516.4535609935,115.24051176337566
600,0.6000000000000004,-1301508.6040790235,157.23755800264192
700,0.7000000000000005,-1301503.9827139233,157.7469660590436
800,0.8000000000000006,-1301514.2675243174,205.74598795558802
900,0.9000000000000007,-1301497.1392407422,163.9890396519284
1000,1.0000000000000007,-1301499.3586102133,201.60413670511966

This is from the Alanine dipeptide test system. Any idea why this might be happening? I've submitted this as an issue on the OpenMM-Torch repo as well.

RaulPPelaez commented 11 months ago

Well my solution ignores the species converter, I guess one cannot do so that happily... I am confused, both the default and the custom explode but in different ways?

Is this also the case if you use the original ani2x? EDIT: I see the openmm-torch issue shows the same thing happens with original ANI2x

RaulPPelaez commented 11 months ago

You could do as the example in the readme does:

# Construct ANI-2x and replace its operations with the optimized ones
nnp = torchani.models.ANI2x(periodic_table_index=True).to(device)
nnp.species_converter = TorchANISpeciesConverter(nnp.species_converter, species).to(device)
#nnp.aev_computer = TorchANISymmetryFunctions(nnp.species_converter, nnp.aev_computer, species).to(device)
nnp.neural_networks = TorchANIBatchedNN(nnp.species_converter, nnp.neural_networks, species).to(device)
nnp.energy_shifter = TorchANIEnergyShifter(nnp.species_converter, nnp.energy_shifter, species).to(device)

This should not be necessary when the new build drops hopefully later today. Then if the original error is solved we can move the discussion to the issue you opened in openmm-torch.

JSLJ23 commented 11 months ago

The second one with the default OptimizedTorchANI actually works okay, the trajectory of the dipeptide stays intact and the reported temperatures are within the ballpark of the ones reported in the collab notebook example. It takes awhile to warm up but does eventually reach 290K or so. Yes the same thing happens when I just use the original ANI2x natively without the OptimizedTorchANI.

JSLJ23 commented 11 months ago

Yea the original issue was solved so I'll close this.

RaulPPelaez commented 11 months ago

New build is out, you should be able to run on a 2080.

JSLJ23 commented 11 months ago

Ok just downloaded and tested it, works pefectly! Thank you!

openmm / NNPOps

CUDA error: no kernel image #110

CUDA error: no kernel image

Full error:

Environment:

Code I am trying to run:

Remove MM forces

The system should not contain any additional force and constrains

Alternatively, all the optimizations can be applied with OptimizedTorchANI

Compute energy and forces again

Custom `OptimizedTorchANI`

Default `OptimizedTorchANI`

openmm / NNPOps

CUDA error: no kernel image #110

CUDA error: no kernel image

Full error:

Environment:

Code I am trying to run:

Remove MM forces

The system should not contain any additional force and constrains

Alternatively, all the optimizations can be applied with OptimizedTorchANI

Compute energy and forces again

Custom OptimizedTorchANI

Default OptimizedTorchANI

Custom `OptimizedTorchANI`

Default `OptimizedTorchANI`