Open RX28666 opened 8 months ago
Does import torch_sparse
work for you?
Thanks for your reply. Unfortunately, it returns bug:
lib/python3.10/site-packages/torch_sparse/_version_cuda.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev
same as "torch_scatter" although I update it.
@RX28666 What's the error raised by
data.adj_t = data.adj_t.set_diag()
It looks like you have multiple torch-sparse
versions installed (one from conda and one from pip), which might explain this issue.
@RX28666 What's the error raised by
data.adj_t = data.adj_t.set_diag()
Hello,
Thanks for your reply. It returns 'Tensor' object has no attribute 'set_diag'
, it didn't happen when data.adj_t is Sparse Tensor previously. This can be solved by data.adj_t = gcn_norm(data.adj_t, add_self_loops=True)
, however, I concern about the influence like complexity if adjacency matrix is not Sparse Tensor.
It looks like you have multiple
torch-sparse
versions installed (one from conda and one from pip), which might explain this issue.
Hello,
Thanks for you reply. I used conda list
to show all packages in my current environment, it returns:
tokenizers 0.13.3 pypi_0 pypi
torch 2.2.1 pypi_0 pypi
torch-cluster 1.6.3 pypi_0 pypi
torch-geometric 2.5.0 pypi_0 pypi
torch-scatter 2.1.2 pypi_0 pypi
torch-sparse 0.6.18 pypi_0 pypi
torch-spline-conv 1.2.2+pt22cu121 pypi_0 pypi
torchaudio 2.2.1 pypi_0 pypi
torchvision 0.17.1 pypi_0 pypi
tornado 6.1 py310h5764c6d_3 conda-forge
tqdm 4.66.1 pypi_0 pypi
traitlets 5.14.0 pyhd8ed1ab_0 conda-forge
transformers 4.33.2 pypi_0 pypi
triton 2.2.0 pypi_0 pypi
typing-extensions 4.9.0 pypi_0 pypi
typing-inspect 0.9.0 pypi_0 pypi
tzdata 2023.3 pypi_0 pypi
urllib3 2.0.4 pypi_0 pypi
wandb 0.16.0 pypi_0 pypi
wcwidth 0.2.12 pyhd8ed1ab_0 conda-forge
wheel 0.38.4 py310h06a4308_0
xxhash 3.4.1 pypi_0 pypi
xz 5.4.2 h5eee18b_0
yarl 1.9.2 pypi_0 pypi
zeromq 4.3.4 h2531618_0
zlib 1.2.13 h5eee18b_0
After I use pip uninstall torch-sparse
, then no torch-sparse
shown in the package list any more. And No module named 'torch_sparse'
returned after import. I am wondering if I understand you correctly. And I would appreciate if you could give me further suggestions.
Hello,
I repeatedly call pip uninstall torch-sparse
, pip uninstall torch-scatter
until it returns:
WARNING: Ignoring invalid distribution -orch (/anaconda3/envs/LLMGNN/lib/python3.10/site-packages)
WARNING: Skipping torch-sparse as it is not installed.
Then pip install torch-sparse
, the bug keeps:
/anaconda3/envs/LLMGNN/lib/python3.10/site-packages/torch_sparse/_version_cuda.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev
.
FYI, my pytorch version is 2.2.1+cu121
. Could you please help with this?
Can you show me the installation log when running
pip install --verbose torch-sparse
Do you use the -f
option to install from wheels?
Can you show me the installation log when running
pip install --verbose torch-sparse
Do you use the
-f
option to install from wheels?
Hello,
This is the log:
Using pip 23.2.1 from /anaconda3/envs/LLMGNN/lib/python3.10/site-packages/pip (python 3.10)
WARNING: Ignoring invalid distribution -orch (/anaconda3/envs/LLMGNN/lib/python3.10/site-packages)
Collecting torch-sparse
Using cached torch_sparse-0.6.18-cp310-cp310-linux_x86_64.whl
Requirement already satisfied: scipy in ./anaconda3/envs/LLMGNN/lib/python3.10/site-packages (from torch-sparse) (1.11.1)
Requirement already satisfied: numpy<1.28.0,>=1.21.6 in ./anaconda3/envs/LLMGNN/lib/python3.10/site-packages (from scipy->torch-sparse) (1.25.2)
WARNING: Ignoring invalid distribution -orch (/anaconda3/envs/LLMGNN/lib/python3.10/site-packages)
Installing collected packages: torch-sparse
Successfully installed torch-sparse-0.6.18
I double checked after continually runpip uninstall torch-sparse
, there is no package in pip list
And I tried both for installation:
pip install torch-sparse
pip install torch-scatter torch-sparse -f https://data.pyg.org/whl/torch-2.2.1+cu121.html
but they returns same error.
Can you redo with
pip install --verbose --no-cache torch-sparse
Thanks for your reply! It works. However, I met another bug without modifying any parts of my code:
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Then I used:
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
It returns:
RuntimeError Traceback (most recent call last)
Input [In [18]] in <cell line: 3>()
[7]model.reset_parameters()
[8]for epoch in range(args.epochs):
----> [9] loss = train(data)
Input [In [16]], in train(data)
[19]y = batch1.y[:batch1.batch_size][train].to(device)
[20]
---> [21]out = model(x1, adj_t1, id1, batch1.batch_size, args.K_train, args.alpha)[:batch1.batch_size][train]
[22] loss = F.nll_loss(out, y)
File [/lib/python3.10/site-packages/torch/nn/modules/module.py:1511](/lib/python3.10/site-packages/torch/nn/modules/module.py:1511), in Module._wrapped_call_impl(self, *args, **kwargs)
[1509] return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
[1510] else:
-> [1511] return self._call_impl(*args, **kwargs)
File [/lib/python3.10/site-packages/torch/nn/modules/module.py:1520](/lib/python3.10/site-packages/torch/nn/modules/module.py:1520), in Module._call_impl(self, *args, **kwargs)
[1515]# If we don't have any hooks, we want to skip the rest of the logic in
[1516] # this function, and just call forward.
[1517] if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
[1518] or _global_backward_pre_hooks or _global_backward_hooks
[1519] or _global_forward_hooks or _global_forward_pre_hooks):
-> [1520] return forward_call(*args, **kwargs)
[1522] try:
[1523] result = None
Input [In [14]] in Net.forward(self, x, adj, id, size, K, alpha)
[37] z = x.clone()
[38]for i in range(K-1):
---> [39] z = (1 - alpha) * (adj @ z) + alpha * x
File [/lib/python3.10/site-packages/torch_sparse/matmul.py:171), in <lambda>(self, other)
[167]SparseTensor.spspmm = lambda self, other, reduce="sum": spspmm(
[168] self, other, reduce)
[169]SparseTensor.matmul = lambda self, other, reduce="sum": matmul(
[170] self, other, reduce)
--> [171]SparseTensor.__matmul__ = lambda self, other: matmul(self, other, 'sum')
File [/lib/python3.10/site-packages/torch_sparse/matmul.py:160), in matmul(src, other, reduce)
[142] """Matrix product of a sparse tensor with either another sparse tensor or a
[143] dense tensor. The sparse tensor represents an adjacency matrix and is
[144] stored as a list of edges. This method multiplies elements along the rows
(...)
[157] :rtype: (:class:`Tensor`)
[158] """
[159] if isinstance(other, torch.Tensor):
--> [160] return spmm(src, other, reduce)
[161] elif isinstance(other, SparseTensor):
[162] return spspmm(src, other, reduce)
File [/lib/python3.10/site-packages/torch_sparse/matmul.py:83), in spmm(src, other, reduce)
[79] def spmm(src: SparseTensor,
[80] other: torch.Tensor,
[81] reduce: str = "sum") -> torch.Tensor:
[82] if reduce == 'sum' or reduce == 'add':
---> [83] return spmm_sum(src, other)
[84] elif reduce == 'mean':
[85] return spmm_mean(src, other)
File [/lib/python3.10/site-packages/torch_sparse/matmul.py:24), in spmm_sum(src, other)
[22] if other.requires_grad:
[23] row = src.storage.row()
---> [24] csr2csc = src.storage.csr2csc()
[25] colptr = src.storage.colptr()
[27] return torch.ops.torch_sparse.spmm_sum(row, rowptr, col, value, colptr,
[28] csr2csc, other)
File [/lib/python3.10/site-packages/torch_sparse/storage.py:412), in SparseStorage.csr2csc(self)
[409] if csr2csc is not None:
[410] return csr2csc
--> [412] idx = self._sparse_sizes[0] * self._col + self.row()
[413] max_value = self._sparse_sizes[0] * self._sparse_sizes[1]
[414] _, csr2csc = index_sort(idx, max_value)
I also tried to run my other code scripts, the bug also happened with torch-sparse, like:
sr2csc = src.storage.csr2csc()
[25] colptr = src.storage.colptr()
---> [27] return torch.ops.torch_sparse.spmm_sum(row, rowptr, col, value, colptr,
[28] csr2csc, other)
CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
I am wondering if this is because the version, as it never happened before update. Thanks.
Hello Matthias,
The following script can reproduce the bug:
import argparse
import os.path as osp
from typing import Tuple
import numpy as np
import time
import copy
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch import Tensor
from torch.nn import Linear
import scipy.sparse as sp
import torch_geometric
import torch_geometric.transforms as T
from torch_geometric.datasets import Planetoid
from torch_geometric.logging import init_wandb, log
from torch_geometric.utils import to_undirected
from torch_geometric.loader import DataLoader
from torch_geometric.loader import NeighborLoader
from ogb.nodeproppred import PygNodePropPredDataset, Evaluator
from torch_geometric.nn import GCNConv
from torch_geometric.nn.conv.gcn_conv import gcn_norm
def index2mask(idx: Tensor, size: int) -> Tensor:
mask = torch.zeros(size, dtype=torch.bool, device=idx.device)
mask[idx] = True
return mask
def gen_masks(y: Tensor, train_per_class: int = 20, val_per_class: int = 30,
num_splits: int = 20) -> Tuple[Tensor, Tensor, Tensor]:
num_classes = int(y.max()) + 1
train_mask = torch.zeros(y.size(0), num_splits, dtype=torch.bool)
val_mask = torch.zeros(y.size(0), num_splits, dtype=torch.bool)
for c in range(num_classes):
idx = (y == c).nonzero(as_tuple=False).view(-1)
perm = torch.stack(
[torch.randperm(idx.size(0)) for _ in range(num_splits)], dim=1)
idx = idx[perm]
train_idx = idx[:train_per_class]
train_mask.scatter_(0, train_idx, True)
val_idx = idx[train_per_class:train_per_class + val_per_class]
val_mask.scatter_(0, val_idx, True)
test_mask = ~(train_mask | val_mask)
return train_mask, val_mask, test_mask
def get_arxiv():
root='/tmp/datasets'
dataset = PygNodePropPredDataset('ogbn-arxiv', f'{root}/OGB',
pre_transform=T.ToSparseTensor())
data = dataset[0]
data.adj_t = data.adj_t.to_symmetric()
data.node_year = None
data.y = data.y.view(-1)
split_idx = dataset.get_idx_split()
data.train_mask = index2mask(split_idx['train'], data.num_nodes)
data.val_mask = index2mask(split_idx['valid'], data.num_nodes)
data.test_mask = index2mask(split_idx['test'], data.num_nodes)
return data, dataset.num_features, dataset.num_classes
data, in_channels, out_channels = get_arxiv()
dataset = PygNodePropPredDataset(name='ogbn-arxiv')
device = torch.device("cuda:2" if torch.cuda.is_available() else "cpu")
data.adj_t = data.adj_t.set_diag()
data.adj_t = gcn_norm(data.adj_t, add_self_loops=False)
data.n_id = torch.arange(data.num_nodes)
parser = argparse.ArgumentParser()
parser.add_argument('--runs', type=int, default=1)
parser.add_argument('--epochs', type=int, default=2000)
parser.add_argument('--lr', type=float, default=0.01)
parser.add_argument('--weight_decay', type=float, default=0)
parser.add_argument('--early_stopping', type=int, default=0)
parser.add_argument('--hidden', type=int, default=256)
parser.add_argument('--num_layers', type=int, default=3)
parser.add_argument('--dropout', type=float, default=0.5)
parser.add_argument('--normalize_features', action='store_true')
args = parser.parse_args(args=[])
class Net(nn.Module):
def __init__(self, num_features, hidden_channels, num_classes, num_layers, num_nodes, **kwargs):
super(Net, self).__init__()
self.convs = torch.nn.ModuleList()
self.convs.append(GCNConv(num_features, hidden_channels, normalize=False))
for _ in range(num_layers - 2):
self.convs.append(GCNConv(hidden_channels, hidden_channels, normalize=False))
self.convs.append(GCNConv(hidden_channels, num_classes, normalize=False))
self.num_classes = num_classes
self.num_nodes = num_nodes
self.hidden_channels = hidden_channels
def reset_parameters(self):
for conv in self.convs:
conv.reset_parameters()
def forward(self):
data_z = self.convs[0](data.x.to(device), data.adj_t.to(device))
return data_z
model = Net(data.x.shape[1], args.hidden, dataset.num_classes, args.num_layers, data.num_nodes)
model = model.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=args.lr, weight_decay=args.weight_decay)
def train(data):
model.train()
neigh_out = model()
return None
acc = []
best = 0
for j in range(args.runs):
tr = []
val_accs = []
test_accs = []
for epoch in range(args.epochs):
loss = train(data)
It returns:
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
I am wondering if I missed anything here. Thanks.
This runs fine for me, but I had to set cuda:0
since I only have a single GPU. What happens if you do the same?
This runs fine for me, but I had to set
cuda:0
since I only have a single GPU. What happens if you do the same?
I found my code runs fine only for cuda:0
, but it always reports the same bug whatever other GPUs I use. Thanks for pointing out. And it seems that this happens after training several epochs, it doesn't report any bugs for the first several epochs.
Do you have any ideas about the reason for this? Since this only happens after I update the package. Thanks
I am not entirely sure. Can you show me/upload the input of
torch.ops.torch_sparse.spmm_sum(row, rowptr, col, value, colptr, csr2csc, other)
for which this crashes?
🐛 Describe the bug
Hello,
I updated my pytorch to 2.2.1+cu121 using pip, and also updated pyg by
pip install torch_geometric
. Then I found there is a warning when I imported the dataset:/lib/python3.10/site-packages/torch_geometric/utils/sparse.py:268 : Sparse CSR tensor support is in beta state, please submit a feature request to https://github.com/pytorch/pytorch/issues. (Triggered internally at ../aten/src/ATen/SparseCsrTensorImpl.cpp:53.)
adj = torch.sparse_csr_tensor(
This is the code related to this:
It never happened before my updating. Now, my data.adj_t becomes tensor instead of sparse tensor. And some functions like:
data.adj_t = data.adj_t.set_diag()
will return errors. I am wondering how I could fix it.Besides, after updating pyg, it was returning
lib/python3.10/site-packages/torch_cluster/_version_cuda.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev.
I fixed it by updating torch_cluster. I am not sure if this is related.Any help would be appreciated.
Versions
PyTorch version: 2.2.1+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.3 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.27.5 Libc version: glibc-2.35
Python version: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.15.0-94-generic-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: 12.2.140 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA RTX A6000
Nvidia driver version: 535.129.03 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True
Versions of relevant libraries: [pip3] mypy-extensions==1.0.0 [pip3] numpy==1.25.2 [pip3] pytorch-warmup==0.1.1 [pip3] torch==2.2.1 [pip3] torch-cluster==1.6.3 [pip3] torch_geometric==2.5.0 [pip3] torch-scatter==2.1.2 [pip3] torch-sparse==0.6.17 [pip3] torch_spline_conv==1.2.2+pt22cu121 [pip3] torchaudio==2.2.1 [pip3] torchvision==0.17.1 [pip3] triton==2.2.0 [conda] blas 1.0 mkl
[conda] cudatoolkit 11.8.0 h6a678d5_0
[conda] mkl 2023.1.0 h213fc3f_46343
[conda] mkl-service 2.4.0 py310h5eee18b_1
[conda] mkl_fft 1.3.6 py310h1128e8f_1
[conda] mkl_random 1.2.2 py310h1128e8f_1
[conda] numpy 1.26.0 pypi_0 pypi [conda] numpy-base 1.25.2 py310hb5e798b_0
[conda] pytorch-mutex 1.0 cpu pytorch [conda] pytorch-warmup 0.1.1 pypi_0 pypi [conda] torch 2.0.1 pypi_0 pypi [conda] torch-cluster 1.6.3 pypi_0 pypi [conda] torch-geometric 2.5.0 pypi_0 pypi [conda] torch-scatter 2.1.2 pypi_0 pypi [conda] torch-sparse 0.6.17 pypi_0 pypi [conda] torch-spline-conv 1.2.2+pt22cu121 pypi_0 pypi [conda] torchaudio 2.2.1 pypi_0 pypi [conda] torchvision 0.17.1 pypi_0 pypi [conda] triton 2.2.0 pypi_0 pypi