pyg-team / pytorch_geometric

Graph Neural Network Library for PyTorch
https://pyg.org
MIT License
21.39k stars 3.67k forks source link

It works on cpu but not on cuda #1376

Closed hyhy01 closed 2 years ago

hyhy01 commented 4 years ago

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

class GraphCNN(nn.Module):
    def __init__(self, ):
        super(GraphCNN, self).__init__()
        self.conv1 = pyg_nn.GraphUNet(in_channels=4,hidden_channels=2 ,out_channels=2,depth=2)

    def forward(self, ):
        x=torch.tensor([[48496.,  1256.,  1404., 10245.]], device='cuda:0',dtype=torch.float)# [2 ,E]
        edge_index=torch.tensor([[0],
                                 [0]], device='cuda:0',dtype=torch.int64)# [2 ,E]
        print(x)
        out = self.conv1(x=x, edge_index=edge_index)  # [N, D]
        return out
def main():
    my_net = GraphCNN()
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    #device = "cpu"
    my_net = my_net.to(device)
    my_net.train()
    for epoch in range(500):
        output = my_net()

if __name__ == '__main__':
    main()

error

tensor([[48496.,  1256.,  1404., 10245.]], device='cuda:0')
tensor([[48496.,  1256.,  1404., 10245.]], device='cuda:0')
tensor([[48496.,  1256.,  1404., 10245.]], device='cuda:0')
/opt/conda/conda-bld/pytorch_1579022034529/work/aten/src/ATen/native/cuda/IndexKernel.cu:60: lambda [](int)->auto::operator()(int)->auto: block: [0,0,0], thread: [0,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
Traceback (most recent call last):
  File "/home/hy/py/mycode/unet.py", line 38, in <module>
    main()
  File "/home/hy/py/mycode/unet.py", line 35, in main
    output = my_net()
  File "/home/hy/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/hy/py/mycode/unet.py", line 24, in forward
    out = self.conv1(x=x, edge_index=edge_index)  # [N, D]
  File "/home/hy/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/hy/anaconda3/envs/py36/lib/python3.6/site-packages/torch_geometric/nn/models/graph_unet.py", line 85, in forward
    x, edge_index, edge_weight, batch)
  File "/home/hy/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/hy/anaconda3/envs/py36/lib/python3.6/site-packages/torch_geometric/nn/pool/topk_pool.py", line 158, in forward
    num_nodes=score.size(0))
  File "/home/hy/anaconda3/envs/py36/lib/python3.6/site-packages/torch_geometric/nn/pool/topk_pool.py", line 60, in filter_adj
    row, col = row[mask], col[mask]
RuntimeError: copy_if failed to synchronize: cudaErrorAssert: device-side assert triggered

Process finished with exit code 1

Expected behavior

same behavior between cpu and cuda

Environment

Additional context

mattiasmar commented 3 years ago

Any insights to this problem?

rusty1s commented 3 years ago

Sorry, I missed this issue. This error is usually related to how cusparse handles sparse-sparse matrix multiplication. In particular, sparse matrix multiplication does not work in case edge_index holds duplicated entries.

mattiasmar commented 3 years ago

I tried to apply coalesce() on the edge_index

edge_index.coalesce()
y = self.gunet(x,edge_index)

but it gave the error: RuntimeError: Could not run 'aten::coalesce' with arguments from the 'CUDA' backend. 'aten::coalesce' is only available for these backends: [SparseCPU, SparseCUDA, Autograd, Profiler, Tracer].

Any suggestions on how to test for and remove duplicate entries?

rusty1s commented 3 years ago

You need to use the coalesce function provided by torch-sparse: torch_sparse.coalesce

mattiasmar commented 3 years ago

Thanks. Could you elaborate a little bit on the sparse_sizes arguments of coalesce? Documentation says:

m (int) - The first dimension of corresponding dense matrix.
n (int) - The second dimension of corresponding dense matrix.

What are these dimensions? What do they represent?

rusty1s commented 3 years ago

Those are just the dimensions of your sparse matrix, e.g., the max values in edge_index[0] and edge_index[1]. In your case, those are the number of nodes in your graph:

edge_index, _ = torch_sparse.coalesce(edge_index, None, num_nodes, num_nodes)
mattiasmar commented 3 years ago

I get an assertion error for that:

  File "/usr/local/lib/python3.6/dist-packages/torch_sparse/storage.py", line 79, in __init__
    assert value.size(0) == col.size(0)

code:

from torch_sparse import coalesce
# type(state) == torch_geometric.data.data.Data
# state == Data(edge_attr=[785, 12], edge_index=[2, 7577], label=[785], x=[785, 12])
n = int(torch.max(state.edge_index )) #784
index, value = coalesce(state.edge_index , state.x , m=n, n=n)
mattiasmar commented 3 years ago

Sorry, if I leave out the value argument the code snippet above runs well, howeverI still get the same runtime error as initially:

  File "/usr/local/lib/python3.6/dist-packages/torch_geometric/nn/pool/topk_pool.py", line 59, in filter_adj
    row, col = row[mask], col[mask]
RuntimeError: copy_if failed to synchronize: cudaErrorAssert: device-side assert triggered
mattiasmar commented 3 years ago

It could be related to batches. The call to gunet that fails uses a batch of N graphs.

mattiasmar commented 3 years ago

Error message when gunet is passed a batch (with index processed by torch_sparse.coalesce)

#state== Batch(batch=[3925], edge_attr=[3925, 12], edge_index=[2, 37885], label=[5], x=[3925, 12])
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [139,0,0], thread: [103,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [139,0,0], thread: [71,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [139,0,0], thread: [7,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [139,0,0], thread: [39,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [68,0,0], thread: [71,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.

...

  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch_geometric/nn/models/graph_unet.py", line 85, in forward
    x, edge_index, edge_weight, batch)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch_geometric/nn/pool/topk_pool.py", line 157, in forward
    num_nodes=score.size(0))
  File "/usr/local/lib/python3.6/dist-packages/torch_geometric/nn/pool/topk_pool.py", line 59, in filter_adj
    row, col = row[mask], col[mask]
RuntimeError: copy_if failed to synchronize: cudaErrorAssert: device-side assert triggered
mattiasmar commented 3 years ago

I tried with applying the coalesce on the graphs before adding them to the batch. Reached this state:

#state == > Batch(batch=[3925], edge_attr=[3925, 12], edge_index=[2, 37885], label=[5], x=[3925, 12])
#state.is_coalesced() ==>  True

Yet, gunet fails:

x = self.gunet(state.x, state.edge_index)

Error:

/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [101,0,0], thread: [100,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [30,0,0], thread: [100,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [66,0,0], thread: [34,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
...
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [103,0,0], thread: [87,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [137,0,0], thread: [90,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [65,0,0], thread: [126,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 3343, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-18-4af83572aa32>", line 1, in <module>
    x = self.gunet(state.x, state.edge_index)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch_geometric/nn/models/graph_unet.py", line 85, in forward
    x, edge_index, edge_weight, batch)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch_geometric/nn/pool/topk_pool.py", line 157, in forward
    num_nodes=score.size(0))
  File "/usr/local/lib/python3.6/dist-packages/torch_geometric/nn/pool/topk_pool.py", line 59, in filter_adj
    row, col = row[mask], col[mask]
RuntimeError: copy_if failed to synchronize: cudaErrorAssert: device-side assert triggered
rusty1s commented 3 years ago

Do you have a small example to reproduce? Which dataset are you using?

mattiasmar commented 3 years ago

That is an excellent question, probably better than what you intended... When I save the Batch object to disk using torch.save(state,"/workdisk/state") and then load it in a separate python script using

gunet = GraphUNet(12, 64, 64, 2)
state = torch.load("/workdisk/state")
gunet.cuda()
state.to("cuda")
x = gunet(state.x, state.edge_index)

I don't see a crash anymore.

mattiasmar commented 3 years ago

...however if I call the gunet twice in a row the crash appears. Attached is a test script and associated "state" Batch object.

gunet = GraphUNet(12, 64, 64, 2)
state = torch.load("/workdisk/state")
x = gunet(state.x, state.edge_index)
x = gunet(state.x, state.edge_index)

Error:

/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [30,0,0], thread: [100,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [30,0,0], thread: [8,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [30,0,0], thread: [44,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [30,0,0], thread: [80,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
Traceback (most recent call last):
  File "/workdisk/test_pyg_batch.py", line 21, in <module>
    x = gunet(state.x, state.edge_index)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch_geometric/nn/models/graph_unet.py", line 85, in forward
    x, edge_index, edge_weight, batch)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch_geometric/nn/pool/topk_pool.py", line 157, in forward
    num_nodes=score.size(0))
  File "/usr/local/lib/python3.6/dist-packages/torch_geometric/nn/pool/topk_pool.py", line 59, in filter_adj
    row, col = row[mask], col[mask]
RuntimeError: copy_if failed to synchronize: cudaErrorAssert: device-side assert triggered
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=710 : device-side assert triggered

test_pytorch_graph_batch2.zip

rusty1s commented 3 years ago

Mh, that works for me. You might need to ensure that torch-sparse CUDA kernels work as expected, e.g., by running the torch-sparse test suite.

mattiasmar commented 3 years ago

The test suit python3 setup.py test failed with this error:

`Installed /tmp/pytorch_sparse/.eggs/coverage-5.5-py3.6-linux-x86_64.egg running egg_info creating torch_sparse.egg-info writing torch_sparse.egg-info/PKG-INFO writing dependency_links to torch_sparse.egg-info/dependency_links.txt writing requirements to torch_sparse.egg-info/requires.txt writing top-level names to torch_sparse.egg-info/top_level.txt writing manifest file 'torch_sparse.egg-info/SOURCES.txt' reading manifest file 'torch_sparse.egg-info/SOURCES.txt' reading manifest template 'MANIFEST.in' writing manifest file 'torch_sparse.egg-info/SOURCES.txt' running build_ext building 'torch_sparse._diag_cpu' extension creating build creating build/temp.linux-x86_64-3.6 creating build/temp.linux-x86_64-3.6/tmp creating build/temp.linux-x86_64-3.6/tmp/pytorch_sparse creating build/temp.linux-x86_64-3.6/tmp/pytorch_sparse/csrc creating build/temp.linux-x86_64-3.6/tmp/pytorch_sparse/csrc/cpu x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/tmp/pytorch_sparse/csrc -I/usr/local/lib/python3.6/dist-packages/torch/include -I/usr/local/lib/python3.6/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.6/dist-packages/torch/include/TH -I/usr/local/lib/python3.6/dist-packages/torch/include/THC -I/usr/include/python3.6m -c /tmp/pytorch_sparse/csrc/diag.cpp -o build/temp.linux-x86_64-3.6/tmp/pytorch_sparse/csrc/diag.o -O2 -DAT_PARALLEL_OPENMP -fopenmp -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=_diag_cpu -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14 x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/tmp/pytorch_sparse/csrc -I/usr/local/lib/python3.6/dist-packages/torch/include -I/usr/local/lib/python3.6/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.6/dist-packages/torch/include/TH -I/usr/local/lib/python3.6/dist-packages/torch/include/THC -I/usr/include/python3.6m -c /tmp/pytorch_sparse/csrc/cpu/diag_cpu.cpp -o build/temp.linux-x86_64-3.6/tmp/pytorch_sparse/csrc/cpu/diag_cpu.o -O2 -DAT_PARALLEL_OPENMP -fopenmp -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=_diag_cpu -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14 creating build/lib.linux-x86_64-3.6 creating build/lib.linux-x86_64-3.6/torch_sparse x86_64-linux-gnu-g++ -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-Bsymbolic-functions -Wl,-z,relro -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 build/temp.linux-x86_64-3.6/tmp/pytorch_sparse/csrc/diag.o build/temp.linux-x86_64-3.6/tmp/pytorch_sparse/csrc/cpu/diag_cpu.o -L/usr/local/lib/python3.6/dist-packages/torch/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -o build/lib.linux-x86_64-3.6/torch_sparse/_diag_cpu.so -s building 'torch_sparse._diag_cuda' extension creating build/temp.linux-x86_64-3.6/tmp/pytorch_sparse/csrc/cuda x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -DWITH_CUDA -I/tmp/pytorch_sparse/csrc -I/usr/local/lib/python3.6/dist-packages/torch/include -I/usr/local/lib/python3.6/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.6/dist-packages/torch/include/TH -I/usr/local/lib/python3.6/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.6m -c /tmp/pytorch_sparse/csrc/diag.cpp -o build/temp.linux-x86_64-3.6/tmp/pytorch_sparse/csrc/diag.o -O2 -DAT_PARALLEL_OPENMP -fopenmp -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=_diag_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14 x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -DWITH_CUDA -I/tmp/pytorch_sparse/csrc -I/usr/local/lib/python3.6/dist-packages/torch/include -I/usr/local/lib/python3.6/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.6/dist-packages/torch/include/TH -I/usr/local/lib/python3.6/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.6m -c /tmp/pytorch_sparse/csrc/cpu/diag_cpu.cpp -o build/temp.linux-x86_64-3.6/tmp/pytorch_sparse/csrc/cpu/diag_cpu.o -O2 -DAT_PARALLEL_OPENMP -fopenmp -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=_diag_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14 /usr/local/cuda/bin/nvcc -DWITH_CUDA -I/tmp/pytorch_sparse/csrc -I/usr/local/lib/python3.6/dist-packages/torch/include -I/usr/local/lib/python3.6/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.6/dist-packages/torch/include/TH -I/usr/local/lib/python3.6/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.6m -c /tmp/pytorch_sparse/csrc/cuda/diag_cuda.cu -o build/temp.linux-x86_64-3.6/tmp/pytorch_sparse/csrc/cuda/diag_cuda.o -DCUDA_NO_HALF_OPERATORS -DCUDA_NO_HALF_CONVERSIONS -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -arch=sm_35 --expt-relaxed-constexpr -O2 -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=_diag_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14 In file included from /tmp/pytorch_sparse/csrc/cuda/diag_cuda.cu:3:0: /usr/local/lib/python3.6/dist-packages/torch/include/ATen/cuda/CUDAContext.h:7:10: fatal error: cublas_v2.h: No such file or directory

include

      ^~~~~~~~~~~~~

compilation terminated. error: command '/usr/local/cuda/bin/nvcc' failed with exit status 1`

mattiasmar commented 3 years ago

Calling find /usr/local/ -name cublas_v2.h responds: /usr/local/cuda-10.2/targets/x86_64-linux/include/cublas_v2.h

rusty1s commented 3 years ago

I see. The test suite tries to install the package from source, which fails. How about you try the following script to see whether sparse-sparse matrix multiplication works on your end:

from 
```python
import torch
from torch_sparse import SparseTensor

x = SparseTensor.from_dense(torch.randn(10, 10, device='cuda'))
out = x @ x
mattiasmar commented 3 years ago

I get: test_torchsparse.txt

root@bdtj8:/tmp/pytorch_sparse# /usr/bin/python3 test_torchsparse.py 
Traceback (most recent call last):
  File "test_torchsparse.py", line 4, in <module>
    from torch_sparse import SparseTensor
  File "/tmp/pytorch_sparse/torch_sparse/__init__.py", line 15, in <module>
    f'{library}_{suffix}', [osp.dirname(__file__)]).origin)
AttributeError: 'NoneType' object has no attribute 'origin'
rusty1s commented 3 years ago

Yes, this might happen due to your previously failed installation. Try to uninstall that one and run again:

pip uninstall torch-sparse
mattiasmar commented 3 years ago

After the uninstall, the installation gives these error messages:

image image

I also tried running in a fresh Google Collab notebook: image

rusty1s commented 3 years ago

Please install on Colab via:

!pip install -q torch-scatter -f https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html
!pip install -q torch-sparse -f https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html
!pip install -q torch-geometric