CPU support with CUDA build

sirmarcel commented 7 months ago

Hi! I've built sphericart with CUDA support (with pip install .[torch] in a clone of this repo). Now, whenever I run tests on a compute node without GPU, the CPU fallback appears to not be working, and the program crashes with

CUDA error at /home/langer/software/sphericart/sphericart-torch/sphericart/src/sphericart_cuda.cu:42 - no CUDA-capable device is detected

Apparently, sphericart is trying to go through the GPU code path despite all involved tensors being on the cpu device. This in in contrast with the docs, which state "Depending on the device the tensor is stored on, [...], the calculations will be performed [...] using the CPU or CUDA implementation."

It'd be nice to fix this, as this makes it rather annoying to debug when a GPU is not readily available.

Cheers!

frostedoyster commented 7 months ago

I checked and I managed to reproduce the bug

nickjbrowning commented 7 months ago

I’ll try to get this fixed within the next few hours, will keep you updated.

nickjbrowning commented 7 months ago

@sirmarcel Could you check that this PR resolves this particular issue for you?

https://github.com/lab-cosmo/sphericart/pull/118

sirmarcel commented 7 months ago

Yep, I will check tomorrow. Thanks for the fast turnaround!

sirmarcel commented 7 months ago

Okay, this is a weird one. If I directly import the calculator from sphericart, this fix works.

In my own class, which wraps it, I get strange behaviour. My tests fail with stuff like:

double free or corruption (out)
Aborted (core dumped)

Then I tried to write a minimal example:

import torch
from torch.nn import Module

from sphericart.torch import SphericalHarmonics as Sph

class SphericalHarmonics(Module):

    def __init__(self, max_angular):
        """Initialise SphericalHarmonics.

        Args:
            max_angular (int): The maximum spherical harmonic order ``l`` to compute.
        """
        super().__init__()

        self.max_angular = max_angular

        self.m_per_l = [2 * l + 1 for l in range(max_angular + 1)]

        self.sph = Sph(l_max=self.max_angular, normalized=True)

    def forward(self, R):
        """Compute spherical harmonics for input vectors.

        Args:
            R (Tensor): Input vectors of shape ``[pair, 3]``.

        Returns:
            Spherical harmonics of shape
                ``[pair, sum([2*l+1 for i in range(max_angular+1)])]``.

        """
        # R: [pair, 3 (x,y,z)]
        # todo: make special case for "mps" backend (need to move to `cpu`)
        return self.sph.compute(R)  # -> [pair, (l=0 m=0) (l=1 m=-1) (l=1 m=0) ...]

og = Sph(3)

sph = SphericalHarmonics(3)

a = torch.rand(10, 3)

print(f"og out: {og.compute(a)}")
print(f"wrapped out: {sph(a)}")

This, fascinatingly, prints the correct output but then fails:

og out: ... (correct-looking)
wrapped out: ... (correct-looking)
CUDA error at /home/langer/software/sphericart/sphericart-torch/sphericart/src/sphericart_cuda.cu:84 - no CUDA-capable device is detected

I'd guess something weird happens if you store a reference to the calculator inside some other python construct? No idea.

EDIT:

This also happens on GPU (i.e. moving a to cuda above). I get errors looking like:

free(): double free detected in tcache 2
Aborted (core dumped)

nickjbrowning commented 7 months ago

OK I'll take another look at this tomorrow and do a deeper dive on whats happening. Thanks for the detailed bug report!

sirmarcel commented 7 months ago

Thanks @nickjbrowning !

nickjbrowning commented 7 months ago

Hey @sirmarcel

I've just made another commit to #118 that should fix this issue. There was some constructor/destructor juggling that was happening previously since we instantiate the spherical harmonics (cuda) class with a default constructor, and the destructor was creating some oddities.

I've now changed this to be pointer-based which allows much cleaner memory management and should fix the double-free that was occurring.

I've ran your code sample and it works both for export CUDA_VISIBLE_DEVICES=0 and export CUDA_VISIBLE_DEVICES=.

sirmarcel commented 6 months ago

Hi @nickjbrowning , sorry for the delay -- I've got testing this on the todo list, but so far haven't had any luck getting an interactive node on our GPU cluster. I'll update you as soon as I get it built + tested. Thanks so much!

sirmarcel commented 6 months ago

Hi -- built it and ran my test suite. Looks good. No crashes!

lab-cosmo / sphericart

CPU support with CUDA build #114