Open jychoi-hpc opened 4 months ago
@Looong01 Do you see similar issues when installing torch-scatter
on ROCM? Do you know what might cause this?
I am trying to run pytorch_scatter with ROCM but keep getting segfault. I installed pytorch rocm version (stable 2.2) with pip and then built pytorch scatter from the source code in the master branch (the last commit is c095c62). However, I got segfault with the following case:
import torch from torch_scatter import scatter_min device = torch.device("cuda" if torch.cuda.is_available() else "cpu") src = torch.Tensor([[-2, 0, -1, -4, -3], [0, -2, -1, -3, -4]]).to(device) index = torch.tensor([[ 4, 5, 4, 2, 3], [0, 0, 2, 2, 1]]).to(device) out = src.new_zeros((2, 6)).to(device) out, argmin = scatter_min(src, index, out=out) print(out) print(argmin)
It gave a file and it shows the following traces:
core
#0 0x00007f8f0c02a22c in std::map<std::string, ReductionType, std::less<std::string>, std::allocator<std::pair<std::string const, ReductionType> > >::at(std::string const&) const () from /lustre/orion/cph161/world-shared/jyc/frontier/sw/anaconda3/2022.10/envs/py38-rocm571/lib/python3.8/site-packages/torch_scatter-2.1.2-py3.8-linux-x86_64.egg/torch_scatter/_scatter_cuda.so #1 0x00007f8f0c01768b in scatter_cuda(at::Tensor, at::Tensor, long, std::optional<at::Tensor>, std::optional<long>, std::string) () from /lustre/orion/cph161/world-shared/jyc/frontier/sw/anaconda3/2022.10/envs/py38-rocm571/lib/python3.8/site-packages/torch_scatter-2.1.2-py3.8-linux-x86_64.egg/torch_scatter/_scatter_cuda.so #2 0x00007f8f0c02f951 in scatter_fw (src=..., index=..., dim=1, optional_out=..., dim_size=..., reduce=...) at csrc/scatter_hip.cpp:42 #3 0x00007f8f0c044039 in ScatterMin::forward (ctx=ctx@entry=0x6ab3b08, src=..., index=..., dim=<optimized out>, optional_out=..., dim_size=...) at csrc/scatter_hip.cpp:175 #4 0x00007f8f0c044dca in torch::autograd::Function<ScatterMin>::apply<ScatterMin, at::Tensor&, at::Tensor&, long&, std::optional<at::Tensor>&, std::optional<long>&> () at /lustre/orion/world-shared/cph161/jyc/frontier/sw/anaconda3/2022.10/envs/py38-rocm571/lib/python3.8/site-packages/torch/include/torch/csrc/autograd/custom_function.h:305 #5 0x00007f8f0c0311b5 in scatter_min (src=..., index=..., dim=<optimized out>, optional_out=..., dim_size=...) at csrc/scatter_hip.cpp:261
I appreciate any advice in advance.
Sorry I recieve no errors when I run these codes.
My device is Radeon RX 7900XTX and RX6700XT. And the codes run smoothly on both of them.
My env is python 3.10 and rocm 6.0.2
This is my screenshot:
P.S. I also test it on python 3.8 and I meet no errors.
$ python test.py
scatter_min: torch.Size([2, 5])
Segmentation fault (core dumped)
I have one question. I am trying to some simple debug. Can you give some advice which source file I can look and put some debugging information, based on the traces from core
dump?
I have one question. I am trying to some simple debug. Can you give some advice which source file I can look and put some debugging information, based on the traces from
core
dump?
Actually, I cannot understand and try to debug this kind of problems because torch_scatter consists of CUDA & C++. And "core dump" usually is a kind of error that everything may be possible (as experience of myself). I think it definitely due to the module of CUDA&C++. So the only suggestion I can give u is 1. try to reinstall a brand new OS, and 2. use Docker to use a brand new OS env.
Thank you for the advice. Unfortunately, I cannot install a new OS. If I find any clue, I will post here.
I am trying something similar with ROCm 6. I am getting errors for scatter_min
and scatter_max
-- but scatter_mean
and scatter_sum
work fine!
I installed PT for ROCm 6 like below
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.0
>>> from torch_scatter import scatter_min
>>> scatter_min(src, index)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/lustre/orion/proj-shared/ven114/ashwinaji/miniconda3/envs/pyg_rocm60/lib/python3.10/site-packages/torch_scatter/scatter.py", line 65, in scatter_min
return torch.ops.torch_scatter.scatter_min(src, index, dim, out, dim_size)
File "/lustre/orion/proj-shared/ven114/ashwinaji/miniconda3/envs/pyg_rocm60/lib/python3.10/site-packages/torch/_ops.py", line 825, in __call__
return self_._op(*args, **(kwargs or {}))
torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 699051.85 GiB. GPU
>>> from torch_scatter import scatter_max
>>> scatter_max(src, index)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/lustre/orion/proj-shared/ven114/ashwinaji/miniconda3/envs/pyg_rocm60/lib/python3.10/site-packages/torch_scatter/scatter.py", line 72, in scatter_max
return torch.ops.torch_scatter.scatter_max(src, index, dim, out, dim_size)
File "/lustre/orion/proj-shared/ven114/ashwinaji/miniconda3/envs/pyg_rocm60/lib/python3.10/site-packages/torch/_ops.py", line 825, in __call__
return self_._op(*args, **(kwargs or {}))
torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 699051.66 GiB. GPU
>>> from torch_scatter import scatter_sum
>>> out, argmin = scatter_sum(src, index)
>>> scatter_sum(src, index)
tensor([[ 0., 0., -4., -3., -3., 0.],
[-2., -4., -4., 0., 0., 0.]], device='cuda:0')
>>> from torch_scatter import scatter_mean
>>> scatter_mean(src, index)
tensor([[ 0.0000, 0.0000, -4.0000, -3.0000, -1.5000, 0.0000],
[-1.0000, -4.0000, -2.0000, 0.0000, 0.0000, 0.0000]],
device='cuda:0')
I must note that this issue is not there when I use PT2.0.1+rocm5.3. It only manifests when we try PT2.2+ROCm5.7 or 2.3.0.dev20240219+rocm6.0
I must note that this issue is not there when I use PT2.0.1+rocm5.3. It only manifests when we try PT2.2+ROCm5.7 or 2.3.0.dev20240219+rocm6.0
Did u try the wheels I compiled?
I must note that this issue is not there when I use PT2.0.1+rocm5.3. It only manifests when we try PT2.2+ROCm5.7 or 2.3.0.dev20240219+rocm6.0
Did u try the wheels I compiled?
Yes I did. I tried just to import torch_scatter but it gave me the below GLIBC error
>>> import torch
>>> from torch_scatter import scatter_min
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/lustre/orion/proj-shared/ven114/ashwinaji/miniconda3/envs/pyg_rocm57/lib/python3.10/site-packages/torch_scatter/__init__.py", line 16, in <module>
torch.ops.load_library(spec.origin)
File "/lustre/orion/proj-shared/ven114/ashwinaji/miniconda3/envs/pyg_rocm57/lib/python3.10/site-packages/torch/_ops.py", line 933, in load_library
ctypes.CDLL(path)
File "/lustre/orion/proj-shared/ven114/ashwinaji/miniconda3/envs/pyg_rocm57/lib/python3.10/ctypes/__init__.py", line 374, in __init__
self._handle = _dlopen(self._name, mode)
OSError: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by /lustre/orion/ven114/proj-shared/ashwinaji/miniconda3/envs/pyg_rocm57/lib/python3.10/site-packages/torch_scatter/_version_cuda.so)
Well, u need to update ur glibc version. May be g++-12. U can see another discussion in this repo, which I answer how to deal with this problem
The GCC/G++ version is indeed 12
> gcc --version
gcc (GCC) 12.2.0 20220819 (HPE)
The OS is SUSE Linux Enterprise Server 15 SP4
On which OS have you built your wheels?
The GCC/G++ version is indeed 12
> gcc --version gcc (GCC) 12.2.0 20220819 (HPE)
The OS is
SUSE Linux Enterprise Server 15 SP4
On which OS have you built your wheels?
@Looong01 unfortunately, I do not have root access and do not have the privileges to upgrade the OS on this cluster. Can you suggest any alternatives?
I am trying to run pytorch_scatter with ROCM but keep getting segfault. I installed pytorch rocm version (stable 2.2) with pip and then built pytorch scatter from the source code in the master branch (the last commit is c095c62). However, I got segfault with the following case:
It gave a
core
file and it shows the following traces:I appreciate any advice in advance.