Closed sirmarcel closed 5 months ago
I checked and I managed to reproduce the bug
I’ll try to get this fixed within the next few hours, will keep you updated.
@sirmarcel Could you check that this PR resolves this particular issue for you?
Yep, I will check tomorrow. Thanks for the fast turnaround!
Okay, this is a weird one. If I directly import the calculator from sphericart
, this fix works.
In my own class, which wraps it, I get strange behaviour. My tests fail with stuff like:
double free or corruption (out)
Aborted (core dumped)
Then I tried to write a minimal example:
import torch
from torch.nn import Module
from sphericart.torch import SphericalHarmonics as Sph
class SphericalHarmonics(Module):
def __init__(self, max_angular):
"""Initialise SphericalHarmonics.
Args:
max_angular (int): The maximum spherical harmonic order ``l`` to compute.
"""
super().__init__()
self.max_angular = max_angular
self.m_per_l = [2 * l + 1 for l in range(max_angular + 1)]
self.sph = Sph(l_max=self.max_angular, normalized=True)
def forward(self, R):
"""Compute spherical harmonics for input vectors.
Args:
R (Tensor): Input vectors of shape ``[pair, 3]``.
Returns:
Spherical harmonics of shape
``[pair, sum([2*l+1 for i in range(max_angular+1)])]``.
"""
# R: [pair, 3 (x,y,z)]
# todo: make special case for "mps" backend (need to move to `cpu`)
return self.sph.compute(R) # -> [pair, (l=0 m=0) (l=1 m=-1) (l=1 m=0) ...]
og = Sph(3)
sph = SphericalHarmonics(3)
a = torch.rand(10, 3)
print(f"og out: {og.compute(a)}")
print(f"wrapped out: {sph(a)}")
This, fascinatingly, prints the correct output but then fails:
og out: ... (correct-looking)
wrapped out: ... (correct-looking)
CUDA error at /home/langer/software/sphericart/sphericart-torch/sphericart/src/sphericart_cuda.cu:84 - no CUDA-capable device is detected
I'd guess something weird happens if you store a reference to the calculator inside some other python construct? No idea.
EDIT:
This also happens on GPU (i.e. moving a
to cuda
above). I get errors looking like:
free(): double free detected in tcache 2
Aborted (core dumped)
OK I'll take another look at this tomorrow and do a deeper dive on whats happening. Thanks for the detailed bug report!
Thanks @nickjbrowning !
Hey @sirmarcel
I've just made another commit to #118 that should fix this issue. There was some constructor/destructor juggling that was happening previously since we instantiate the spherical harmonics (cuda) class with a default constructor, and the destructor was creating some oddities.
I've now changed this to be pointer-based which allows much cleaner memory management and should fix the double-free that was occurring.
I've ran your code sample and it works both for export CUDA_VISIBLE_DEVICES=0
and export CUDA_VISIBLE_DEVICES=
.
Hi @nickjbrowning , sorry for the delay -- I've got testing this on the todo list, but so far haven't had any luck getting an interactive node on our GPU cluster. I'll update you as soon as I get it built + tested. Thanks so much!
Hi -- built it and ran my test suite. Looks good. No crashes!
Hi! I've built
sphericart
with CUDA support (withpip install .[torch]
in a clone of this repo). Now, whenever I run tests on a compute node without GPU, the CPU fallback appears to not be working, and the program crashes withApparently,
sphericart
is trying to go through the GPU code path despite all involved tensors being on thecpu
device. This in in contrast with the docs, which state "Depending on the device the tensor is stored on, [...], the calculations will be performed [...] using the CPU or CUDA implementation."It'd be nice to fix this, as this makes it rather annoying to debug when a GPU is not readily available.
Cheers!