sagfault in scatter_min with ROCM

jychoi-hpc commented 4 months ago

I am trying to run pytorch_scatter with ROCM but keep getting segfault. I installed pytorch rocm version (stable 2.2) with pip and then built pytorch scatter from the source code in the master branch (the last commit is c095c62). However, I got segfault with the following case:

import torch
from torch_scatter import scatter_min

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

src = torch.Tensor([[-2, 0, -1, -4, -3], [0, -2, -1, -3, -4]]).to(device)
index = torch.tensor([[ 4, 5,  4,  2,  3], [0,  0,  2,  2,  1]]).to(device)
out = src.new_zeros((2, 6)).to(device)

out, argmin = scatter_min(src, index, out=out)

print(out)
print(argmin)

It gave a core file and it shows the following traces:

#0  0x00007f8f0c02a22c in std::map<std::string, ReductionType, std::less<std::string>, std::allocator<std::pair<std::string const, ReductionType> > >::at(std::string const&) const ()
   from /lustre/orion/cph161/world-shared/jyc/frontier/sw/anaconda3/2022.10/envs/py38-rocm571/lib/python3.8/site-packages/torch_scatter-2.1.2-py3.8-linux-x86_64.egg/torch_scatter/_scatter_cuda.so
#1  0x00007f8f0c01768b in scatter_cuda(at::Tensor, at::Tensor, long, std::optional<at::Tensor>, std::optional<long>, std::string) ()
   from /lustre/orion/cph161/world-shared/jyc/frontier/sw/anaconda3/2022.10/envs/py38-rocm571/lib/python3.8/site-packages/torch_scatter-2.1.2-py3.8-linux-x86_64.egg/torch_scatter/_scatter_cuda.so
#2  0x00007f8f0c02f951 in scatter_fw (src=..., index=..., dim=1, optional_out=..., dim_size=..., reduce=...) at csrc/scatter_hip.cpp:42
#3  0x00007f8f0c044039 in ScatterMin::forward (ctx=ctx@entry=0x6ab3b08, src=..., index=..., dim=<optimized out>, optional_out=..., dim_size=...) at csrc/scatter_hip.cpp:175
#4  0x00007f8f0c044dca in torch::autograd::Function<ScatterMin>::apply<ScatterMin, at::Tensor&, at::Tensor&, long&, std::optional<at::Tensor>&, std::optional<long>&> ()
    at /lustre/orion/world-shared/cph161/jyc/frontier/sw/anaconda3/2022.10/envs/py38-rocm571/lib/python3.8/site-packages/torch/include/torch/csrc/autograd/custom_function.h:305
#5  0x00007f8f0c0311b5 in scatter_min (src=..., index=..., dim=<optimized out>, optional_out=..., dim_size=...) at csrc/scatter_hip.cpp:261

I appreciate any advice in advance.

rusty1s commented 4 months ago

@Looong01 Do you see similar issues when installing torch-scatter on ROCM? Do you know what might cause this?

Looong01 commented 4 months ago

I am trying to run pytorch_scatter with ROCM but keep getting segfault. I installed pytorch rocm version (stable 2.2) with pip and then built pytorch scatter from the source code in the master branch (the last commit is c095c62). However, I got segfault with the following case:

import torch
from torch_scatter import scatter_min

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

src = torch.Tensor([[-2, 0, -1, -4, -3], [0, -2, -1, -3, -4]]).to(device)
index = torch.tensor([[ 4, 5,  4,  2,  3], [0,  0,  2,  2,  1]]).to(device)
out = src.new_zeros((2, 6)).to(device)

out, argmin = scatter_min(src, index, out=out)

print(out)
print(argmin)

It gave a file and it shows the following traces:core

#0  0x00007f8f0c02a22c in std::map<std::string, ReductionType, std::less<std::string>, std::allocator<std::pair<std::string const, ReductionType> > >::at(std::string const&) const ()
   from /lustre/orion/cph161/world-shared/jyc/frontier/sw/anaconda3/2022.10/envs/py38-rocm571/lib/python3.8/site-packages/torch_scatter-2.1.2-py3.8-linux-x86_64.egg/torch_scatter/_scatter_cuda.so
#1  0x00007f8f0c01768b in scatter_cuda(at::Tensor, at::Tensor, long, std::optional<at::Tensor>, std::optional<long>, std::string) ()
   from /lustre/orion/cph161/world-shared/jyc/frontier/sw/anaconda3/2022.10/envs/py38-rocm571/lib/python3.8/site-packages/torch_scatter-2.1.2-py3.8-linux-x86_64.egg/torch_scatter/_scatter_cuda.so
#2  0x00007f8f0c02f951 in scatter_fw (src=..., index=..., dim=1, optional_out=..., dim_size=..., reduce=...) at csrc/scatter_hip.cpp:42
#3  0x00007f8f0c044039 in ScatterMin::forward (ctx=ctx@entry=0x6ab3b08, src=..., index=..., dim=<optimized out>, optional_out=..., dim_size=...) at csrc/scatter_hip.cpp:175
#4  0x00007f8f0c044dca in torch::autograd::Function<ScatterMin>::apply<ScatterMin, at::Tensor&, at::Tensor&, long&, std::optional<at::Tensor>&, std::optional<long>&> ()
    at /lustre/orion/world-shared/cph161/jyc/frontier/sw/anaconda3/2022.10/envs/py38-rocm571/lib/python3.8/site-packages/torch/include/torch/csrc/autograd/custom_function.h:305
#5  0x00007f8f0c0311b5 in scatter_min (src=..., index=..., dim=<optimized out>, optional_out=..., dim_size=...) at csrc/scatter_hip.cpp:261

I appreciate any advice in advance.

Sorry I recieve no errors when I run these codes. My device is Radeon RX 7900XTX and RX6700XT. And the codes run smoothly on both of them. My env is python 3.10 and rocm 6.0.2 This is my screenshot:

Looong01 commented 4 months ago

What is the error it showed in bash?
What is the type of your GPU?
Maybe you could update your ROCm version and then test it again.

P.S. I also test it on python 3.8 and I meet no errors.

jychoi-hpc commented 4 months ago

It just a segfault.

$ python test.py 
scatter_min: torch.Size([2, 5])
Segmentation fault (core dumped)

AMD MI250X (gfx90a)
Will try. Thank you for the advice.

jychoi-hpc commented 4 months ago

I have one question. I am trying to some simple debug. Can you give some advice which source file I can look and put some debugging information, based on the traces from core dump?

Looong01 commented 4 months ago

I have one question. I am trying to some simple debug. Can you give some advice which source file I can look and put some debugging information, based on the traces from core dump?

Actually, I cannot understand and try to debug this kind of problems because torch_scatter consists of CUDA & C++. And "core dump" usually is a kind of error that everything may be possible (as experience of myself). I think it definitely due to the module of CUDA&C++. So the only suggestion I can give u is 1. try to reinstall a brand new OS, and 2. use Docker to use a brand new OS env.

jychoi-hpc commented 4 months ago

Thank you for the advice. Unfortunately, I cannot install a new OS. If I find any clue, I will post here.

ashwinma commented 4 months ago

I am trying something similar with ROCm 6. I am getting errors for scatter_min and scatter_max -- but scatter_mean and scatter_sum work fine!

I installed PT for ROCm 6 like below

pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.0

>>> from torch_scatter import scatter_min
>>> scatter_min(src, index)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/lustre/orion/proj-shared/ven114/ashwinaji/miniconda3/envs/pyg_rocm60/lib/python3.10/site-packages/torch_scatter/scatter.py", line 65, in scatter_min
    return torch.ops.torch_scatter.scatter_min(src, index, dim, out, dim_size)
  File "/lustre/orion/proj-shared/ven114/ashwinaji/miniconda3/envs/pyg_rocm60/lib/python3.10/site-packages/torch/_ops.py", line 825, in __call__
    return self_._op(*args, **(kwargs or {}))
torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 699051.85 GiB. GPU
>>> from torch_scatter import scatter_max
>>> scatter_max(src, index)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/lustre/orion/proj-shared/ven114/ashwinaji/miniconda3/envs/pyg_rocm60/lib/python3.10/site-packages/torch_scatter/scatter.py", line 72, in scatter_max
    return torch.ops.torch_scatter.scatter_max(src, index, dim, out, dim_size)
  File "/lustre/orion/proj-shared/ven114/ashwinaji/miniconda3/envs/pyg_rocm60/lib/python3.10/site-packages/torch/_ops.py", line 825, in __call__
    return self_._op(*args, **(kwargs or {}))
torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 699051.66 GiB. GPU

>>> from torch_scatter import scatter_sum
>>> out, argmin = scatter_sum(src, index)
>>> scatter_sum(src, index)
tensor([[ 0.,  0., -4., -3., -3.,  0.],
        [-2., -4., -4.,  0.,  0.,  0.]], device='cuda:0')
>>> from torch_scatter import scatter_mean
>>> scatter_mean(src, index)
tensor([[ 0.0000,  0.0000, -4.0000, -3.0000, -1.5000,  0.0000],
        [-1.0000, -4.0000, -2.0000,  0.0000,  0.0000,  0.0000]],
       device='cuda:0')

ashwinma commented 4 months ago

I must note that this issue is not there when I use PT2.0.1+rocm5.3. It only manifests when we try PT2.2+ROCm5.7 or 2.3.0.dev20240219+rocm6.0

Looong01 commented 4 months ago

I must note that this issue is not there when I use PT2.0.1+rocm5.3. It only manifests when we try PT2.2+ROCm5.7 or 2.3.0.dev20240219+rocm6.0

Did u try the wheels I compiled?

ashwinma commented 4 months ago

I must note that this issue is not there when I use PT2.0.1+rocm5.3. It only manifests when we try PT2.2+ROCm5.7 or 2.3.0.dev20240219+rocm6.0

Did u try the wheels I compiled?

Yes I did. I tried just to import torch_scatter but it gave me the below GLIBC error

>>> import torch
>>> from torch_scatter import scatter_min
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/lustre/orion/proj-shared/ven114/ashwinaji/miniconda3/envs/pyg_rocm57/lib/python3.10/site-packages/torch_scatter/__init__.py", line 16, in <module>
    torch.ops.load_library(spec.origin)
  File "/lustre/orion/proj-shared/ven114/ashwinaji/miniconda3/envs/pyg_rocm57/lib/python3.10/site-packages/torch/_ops.py", line 933, in load_library
    ctypes.CDLL(path)
  File "/lustre/orion/proj-shared/ven114/ashwinaji/miniconda3/envs/pyg_rocm57/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by /lustre/orion/ven114/proj-shared/ashwinaji/miniconda3/envs/pyg_rocm57/lib/python3.10/site-packages/torch_scatter/_version_cuda.so)

Looong01 commented 4 months ago

Well, u need to update ur glibc version. May be g++-12. U can see another discussion in this repo, which I answer how to deal with this problem

ashwinma commented 4 months ago

The GCC/G++ version is indeed 12

> gcc --version
gcc (GCC) 12.2.0 20220819 (HPE)

The OS is SUSE Linux Enterprise Server 15 SP4

On which OS have you built your wheels?

Looong01 commented 4 months ago

The GCC/G++ version is indeed 12
> gcc --version
gcc (GCC) 12.2.0 20220819 (HPE)
The OS is SUSE Linux Enterprise Server 15 SP4

On which OS have you built your wheels?

https://github.com/Looong01/pyg-rocm-build/issues/3

ashwinma commented 4 months ago

@Looong01 unfortunately, I do not have root access and do not have the privileges to upgrade the OS on this cluster. Can you suggest any alternatives?

rusty1s / pytorch_scatter

sagfault in scatter_min with ROCM #420