rusty1s / pytorch_scatter

PyTorch Extension Library of Optimized Scatter Operations
https://pytorch-scatter.readthedocs.io
MIT License
1.5k stars 178 forks source link

RuntimeError: CUDA error: an illegal memory access was encountered (for max and softmax) #431

Closed thoglu closed 2 months ago

thoglu commented 3 months ago

Hi, I am getting an error for max and softmax :

RuntimeError: CUDA error: an illegal memory access was encountered

System: pytorch 2.2.0 torch_scatter 2.1.2 cuda 12.1

The following code reproduces the error:

` from torch_scatter import scatter_max, scatter_softmax import torch

used_device="cuda:1"

vals=torch.randn(20).to(device=used_device) index=torch.arange(20).to(device=used_device)

res=scatter_softmax(vals, index) ` EDIT: The scripts works with "cuda:0", but not with "cuda:n" where n>0

rusty1s commented 3 months ago

Works fine for me. Not totally sure what goes wrong. How did you install torch-scatter?

thoglu commented 3 months ago

Yeah I am guessing there is some incompatibility.. although I am getting this error on two different systems with different GPUs and independent of pytorch 2.2.1 or 2.2.0.

I installed it using pip install torch-scatter -f https://data.pyg.org/whl/torch-2.2.0+cu121.html

It does not depend on pytorch-geometric or any other of the sparse libraries right?

rusty1s commented 3 months ago

It does not depend on pytorch-geometric or any other of the sparse libraries right?

Yeah, it does not.

What's your OS and what does nvidia-smi return?

thoglu commented 3 months ago

Ok I think I found the issue.

On both systems (one is Ubuntu, the other RHEL 7), there are different GPUs. The above script was actually not what I used, I used "cuda:1", or "cuda:2" etc.. , i.e. not the first GPU on the system (I edited the original post). If I use the above script with "cuda:0", it actually works on both systems (taking the first GPU). Any other GPU other than the first fails! Any idea why that would be?

nvidia-smi (on one of the machines) is : NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1

rusty1s commented 3 months ago

I am not totally sure. Last time I checked torch-scatter worked fine on multi GPUs. I would need to grab a system with multi-GPUs again to confirm.

thoglu commented 3 months ago

apparently it seems to also not affect scatter_mean etc.. mostly max and thereby other functions like logsumexp (which uses max)

rusty1s commented 2 months ago

I tried to reproduce this on a multi-GPU instance but failed to do so. It works fine on my end on PyTorch 2.2.0. Can you do me a favor and try to install from source to see if this issue is still present?

pip uninstall torch-scatter
pip install torch-scatter
rusty1s commented 2 months ago

9173 suggests that it only occurs on CUDA 12. Let me try to reproduce with this setting as well.

rusty1s commented 2 months ago

Seems to be related to https://github.com/pytorch/pytorch/blob/e3ac61587aa368c613ef01df1f328a396b64cd5d/c10/cuda/CUDAFunctions.cpp#L193.

rusty1s commented 2 months ago

Fix is here: https://github.com/rusty1s/pytorch_scatter/pull/436