Closed hyhy01 closed 2 years ago
Any insights to this problem?
Sorry, I missed this issue. This error is usually related to how cusparse
handles sparse-sparse matrix multiplication. In particular, sparse matrix multiplication does not work in case edge_index
holds duplicated entries.
I tried to apply coalesce() on the edge_index
edge_index.coalesce()
y = self.gunet(x,edge_index)
but it gave the error:
RuntimeError: Could not run 'aten::coalesce' with arguments from the 'CUDA' backend. 'aten::coalesce' is only available for these backends: [SparseCPU, SparseCUDA, Autograd, Profiler, Tracer].
Any suggestions on how to test for and remove duplicate entries?
You need to use the coalesce
function provided by torch-sparse
: torch_sparse.coalesce
Thanks. Could you elaborate a little bit on the sparse_sizes arguments of coalesce? Documentation says:
m (int) - The first dimension of corresponding dense matrix.
n (int) - The second dimension of corresponding dense matrix.
What are these dimensions? What do they represent?
Those are just the dimensions of your sparse matrix, e.g., the max values in edge_index[0]
and edge_index[1]
. In your case, those are the number of nodes in your graph:
edge_index, _ = torch_sparse.coalesce(edge_index, None, num_nodes, num_nodes)
I get an assertion error for that:
File "/usr/local/lib/python3.6/dist-packages/torch_sparse/storage.py", line 79, in __init__
assert value.size(0) == col.size(0)
code:
from torch_sparse import coalesce
# type(state) == torch_geometric.data.data.Data
# state == Data(edge_attr=[785, 12], edge_index=[2, 7577], label=[785], x=[785, 12])
n = int(torch.max(state.edge_index )) #784
index, value = coalesce(state.edge_index , state.x , m=n, n=n)
Sorry, if I leave out the value argument the code snippet above runs well, howeverI still get the same runtime error as initially:
File "/usr/local/lib/python3.6/dist-packages/torch_geometric/nn/pool/topk_pool.py", line 59, in filter_adj
row, col = row[mask], col[mask]
RuntimeError: copy_if failed to synchronize: cudaErrorAssert: device-side assert triggered
It could be related to batches. The call to gunet that fails uses a batch of N graphs.
Error message when gunet is passed a batch (with index processed by torch_sparse.coalesce)
#state== Batch(batch=[3925], edge_attr=[3925, 12], edge_index=[2, 37885], label=[5], x=[3925, 12])
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [139,0,0], thread: [103,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [139,0,0], thread: [71,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [139,0,0], thread: [7,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [139,0,0], thread: [39,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [68,0,0], thread: [71,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
...
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch_geometric/nn/models/graph_unet.py", line 85, in forward
x, edge_index, edge_weight, batch)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch_geometric/nn/pool/topk_pool.py", line 157, in forward
num_nodes=score.size(0))
File "/usr/local/lib/python3.6/dist-packages/torch_geometric/nn/pool/topk_pool.py", line 59, in filter_adj
row, col = row[mask], col[mask]
RuntimeError: copy_if failed to synchronize: cudaErrorAssert: device-side assert triggered
I tried with applying the coalesce on the graphs before adding them to the batch. Reached this state:
#state == > Batch(batch=[3925], edge_attr=[3925, 12], edge_index=[2, 37885], label=[5], x=[3925, 12])
#state.is_coalesced() ==> True
Yet, gunet fails:
x = self.gunet(state.x, state.edge_index)
Error:
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [101,0,0], thread: [100,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [30,0,0], thread: [100,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [66,0,0], thread: [34,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
...
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [103,0,0], thread: [87,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [137,0,0], thread: [90,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [65,0,0], thread: [126,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 3343, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-18-4af83572aa32>", line 1, in <module>
x = self.gunet(state.x, state.edge_index)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch_geometric/nn/models/graph_unet.py", line 85, in forward
x, edge_index, edge_weight, batch)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch_geometric/nn/pool/topk_pool.py", line 157, in forward
num_nodes=score.size(0))
File "/usr/local/lib/python3.6/dist-packages/torch_geometric/nn/pool/topk_pool.py", line 59, in filter_adj
row, col = row[mask], col[mask]
RuntimeError: copy_if failed to synchronize: cudaErrorAssert: device-side assert triggered
Do you have a small example to reproduce? Which dataset are you using?
That is an excellent question, probably better than what you intended...
When I save the Batch object to disk using torch.save(state,"/workdisk/state")
and then load it in a separate python script using
gunet = GraphUNet(12, 64, 64, 2)
state = torch.load("/workdisk/state")
gunet.cuda()
state.to("cuda")
x = gunet(state.x, state.edge_index)
I don't see a crash anymore.
...however if I call the gunet twice in a row the crash appears. Attached is a test script and associated "state" Batch object.
gunet = GraphUNet(12, 64, 64, 2)
state = torch.load("/workdisk/state")
x = gunet(state.x, state.edge_index)
x = gunet(state.x, state.edge_index)
Error:
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [30,0,0], thread: [100,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [30,0,0], thread: [8,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [30,0,0], thread: [44,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [30,0,0], thread: [80,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
Traceback (most recent call last):
File "/workdisk/test_pyg_batch.py", line 21, in <module>
x = gunet(state.x, state.edge_index)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch_geometric/nn/models/graph_unet.py", line 85, in forward
x, edge_index, edge_weight, batch)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch_geometric/nn/pool/topk_pool.py", line 157, in forward
num_nodes=score.size(0))
File "/usr/local/lib/python3.6/dist-packages/torch_geometric/nn/pool/topk_pool.py", line 59, in filter_adj
row, col = row[mask], col[mask]
RuntimeError: copy_if failed to synchronize: cudaErrorAssert: device-side assert triggered
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=710 : device-side assert triggered
Mh, that works for me. You might need to ensure that torch-sparse
CUDA kernels work as expected, e.g., by running the torch-sparse
test suite.
The test suit python3 setup.py test
failed with this error:
`Installed /tmp/pytorch_sparse/.eggs/coverage-5.5-py3.6-linux-x86_64.egg running egg_info creating torch_sparse.egg-info writing torch_sparse.egg-info/PKG-INFO writing dependency_links to torch_sparse.egg-info/dependency_links.txt writing requirements to torch_sparse.egg-info/requires.txt writing top-level names to torch_sparse.egg-info/top_level.txt writing manifest file 'torch_sparse.egg-info/SOURCES.txt' reading manifest file 'torch_sparse.egg-info/SOURCES.txt' reading manifest template 'MANIFEST.in' writing manifest file 'torch_sparse.egg-info/SOURCES.txt' running build_ext building 'torch_sparse._diag_cpu' extension creating build creating build/temp.linux-x86_64-3.6 creating build/temp.linux-x86_64-3.6/tmp creating build/temp.linux-x86_64-3.6/tmp/pytorch_sparse creating build/temp.linux-x86_64-3.6/tmp/pytorch_sparse/csrc creating build/temp.linux-x86_64-3.6/tmp/pytorch_sparse/csrc/cpu x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/tmp/pytorch_sparse/csrc -I/usr/local/lib/python3.6/dist-packages/torch/include -I/usr/local/lib/python3.6/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.6/dist-packages/torch/include/TH -I/usr/local/lib/python3.6/dist-packages/torch/include/THC -I/usr/include/python3.6m -c /tmp/pytorch_sparse/csrc/diag.cpp -o build/temp.linux-x86_64-3.6/tmp/pytorch_sparse/csrc/diag.o -O2 -DAT_PARALLEL_OPENMP -fopenmp -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=_diag_cpu -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14 x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/tmp/pytorch_sparse/csrc -I/usr/local/lib/python3.6/dist-packages/torch/include -I/usr/local/lib/python3.6/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.6/dist-packages/torch/include/TH -I/usr/local/lib/python3.6/dist-packages/torch/include/THC -I/usr/include/python3.6m -c /tmp/pytorch_sparse/csrc/cpu/diag_cpu.cpp -o build/temp.linux-x86_64-3.6/tmp/pytorch_sparse/csrc/cpu/diag_cpu.o -O2 -DAT_PARALLEL_OPENMP -fopenmp -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=_diag_cpu -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14 creating build/lib.linux-x86_64-3.6 creating build/lib.linux-x86_64-3.6/torch_sparse x86_64-linux-gnu-g++ -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-Bsymbolic-functions -Wl,-z,relro -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 build/temp.linux-x86_64-3.6/tmp/pytorch_sparse/csrc/diag.o build/temp.linux-x86_64-3.6/tmp/pytorch_sparse/csrc/cpu/diag_cpu.o -L/usr/local/lib/python3.6/dist-packages/torch/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -o build/lib.linux-x86_64-3.6/torch_sparse/_diag_cpu.so -s building 'torch_sparse._diag_cuda' extension creating build/temp.linux-x86_64-3.6/tmp/pytorch_sparse/csrc/cuda x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -DWITH_CUDA -I/tmp/pytorch_sparse/csrc -I/usr/local/lib/python3.6/dist-packages/torch/include -I/usr/local/lib/python3.6/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.6/dist-packages/torch/include/TH -I/usr/local/lib/python3.6/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.6m -c /tmp/pytorch_sparse/csrc/diag.cpp -o build/temp.linux-x86_64-3.6/tmp/pytorch_sparse/csrc/diag.o -O2 -DAT_PARALLEL_OPENMP -fopenmp -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=_diag_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14 x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -DWITH_CUDA -I/tmp/pytorch_sparse/csrc -I/usr/local/lib/python3.6/dist-packages/torch/include -I/usr/local/lib/python3.6/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.6/dist-packages/torch/include/TH -I/usr/local/lib/python3.6/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.6m -c /tmp/pytorch_sparse/csrc/cpu/diag_cpu.cpp -o build/temp.linux-x86_64-3.6/tmp/pytorch_sparse/csrc/cpu/diag_cpu.o -O2 -DAT_PARALLEL_OPENMP -fopenmp -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=_diag_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14 /usr/local/cuda/bin/nvcc -DWITH_CUDA -I/tmp/pytorch_sparse/csrc -I/usr/local/lib/python3.6/dist-packages/torch/include -I/usr/local/lib/python3.6/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.6/dist-packages/torch/include/TH -I/usr/local/lib/python3.6/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.6m -c /tmp/pytorch_sparse/csrc/cuda/diag_cuda.cu -o build/temp.linux-x86_64-3.6/tmp/pytorch_sparse/csrc/cuda/diag_cuda.o -DCUDA_NO_HALF_OPERATORS -DCUDA_NO_HALF_CONVERSIONS -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -arch=sm_35 --expt-relaxed-constexpr -O2 -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=_diag_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14 In file included from /tmp/pytorch_sparse/csrc/cuda/diag_cuda.cu:3:0: /usr/local/lib/python3.6/dist-packages/torch/include/ATen/cuda/CUDAContext.h:7:10: fatal error: cublas_v2.h: No such file or directory
^~~~~~~~~~~~~
compilation terminated. error: command '/usr/local/cuda/bin/nvcc' failed with exit status 1`
Calling find /usr/local/ -name cublas_v2.h
responds:
/usr/local/cuda-10.2/targets/x86_64-linux/include/cublas_v2.h
I see. The test suite tries to install the package from source, which fails. How about you try the following script to see whether sparse-sparse matrix multiplication works on your end:
from
```python
import torch
from torch_sparse import SparseTensor
x = SparseTensor.from_dense(torch.randn(10, 10, device='cuda'))
out = x @ x
I get: test_torchsparse.txt
root@bdtj8:/tmp/pytorch_sparse# /usr/bin/python3 test_torchsparse.py
Traceback (most recent call last):
File "test_torchsparse.py", line 4, in <module>
from torch_sparse import SparseTensor
File "/tmp/pytorch_sparse/torch_sparse/__init__.py", line 15, in <module>
f'{library}_{suffix}', [osp.dirname(__file__)]).origin)
AttributeError: 'NoneType' object has no attribute 'origin'
Yes, this might happen due to your previously failed installation. Try to uninstall that one and run again:
pip uninstall torch-sparse
After the uninstall, the installation gives these error messages:
I also tried running in a fresh Google Collab notebook:
Please install on Colab via:
!pip install -q torch-scatter -f https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html
!pip install -q torch-sparse -f https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html
!pip install -q torch-geometric
🐛 Bug
To Reproduce
Steps to reproduce the behavior:
error
Expected behavior
same behavior between cpu and cuda
Environment
Additional context