pyg-team / pytorch_geometric

Graph Neural Network Library for PyTorch
https://pyg.org
MIT License
21.4k stars 3.67k forks source link

/lib/python3.10/site-packages/torch_geometric/utils/sparse.py:268 : Sparse CSR tensor support is in beta state #8967

Open RX28666 opened 8 months ago

RX28666 commented 8 months ago

🐛 Describe the bug

Hello,

I updated my pytorch to 2.2.1+cu121 using pip, and also updated pyg by pip install torch_geometric. Then I found there is a warning when I imported the dataset:

/lib/python3.10/site-packages/torch_geometric/utils/sparse.py:268 : Sparse CSR tensor support is in beta state, please submit a feature request to https://github.com/pytorch/pytorch/issues. (Triggered internally at ../aten/src/ATen/SparseCsrTensorImpl.cpp:53.) adj = torch.sparse_csr_tensor(

This is the code related to this:

def get_products():
    root = osp.join(osp.dirname(osp.realpath('__file__')), '..', 'data', 'products')
    dataset = PygNodePropPredDataset('ogbn-products', root)
    data = dataset[0]
    data = T.ToSparseTensor()(data)
    data.y = data.y.view(-1)
    split_idx = dataset.get_idx_split()
    data.train_mask = index2mask(split_idx['train'], data.num_nodes)
    data.val_mask = index2mask(split_idx['valid'], data.num_nodes)
    data.test_mask = index2mask(split_idx['test'], data.num_nodes)
    return data, dataset.num_features, dataset.num_classes

data, in_channels, out_channels = get_products()

It never happened before my updating. Now, my data.adj_t becomes tensor instead of sparse tensor. And some functions like: data.adj_t = data.adj_t.set_diag() will return errors. I am wondering how I could fix it.

Besides, after updating pyg, it was returning lib/python3.10/site-packages/torch_cluster/_version_cuda.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev. I fixed it by updating torch_cluster. I am not sure if this is related.

Any help would be appreciated.

Versions

PyTorch version: 2.2.1+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.27.5 Libc version: glibc-2.35

Python version: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.15.0-94-generic-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: 12.2.140 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA RTX A6000

Nvidia driver version: 535.129.03 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

Versions of relevant libraries: [pip3] mypy-extensions==1.0.0 [pip3] numpy==1.25.2 [pip3] pytorch-warmup==0.1.1 [pip3] torch==2.2.1 [pip3] torch-cluster==1.6.3 [pip3] torch_geometric==2.5.0 [pip3] torch-scatter==2.1.2 [pip3] torch-sparse==0.6.17 [pip3] torch_spline_conv==1.2.2+pt22cu121 [pip3] torchaudio==2.2.1 [pip3] torchvision==0.17.1 [pip3] triton==2.2.0 [conda] blas 1.0 mkl
[conda] cudatoolkit 11.8.0 h6a678d5_0
[conda] mkl 2023.1.0 h213fc3f_46343
[conda] mkl-service 2.4.0 py310h5eee18b_1
[conda] mkl_fft 1.3.6 py310h1128e8f_1
[conda] mkl_random 1.2.2 py310h1128e8f_1
[conda] numpy 1.26.0 pypi_0 pypi [conda] numpy-base 1.25.2 py310hb5e798b_0
[conda] pytorch-mutex 1.0 cpu pytorch [conda] pytorch-warmup 0.1.1 pypi_0 pypi [conda] torch 2.0.1 pypi_0 pypi [conda] torch-cluster 1.6.3 pypi_0 pypi [conda] torch-geometric 2.5.0 pypi_0 pypi [conda] torch-scatter 2.1.2 pypi_0 pypi [conda] torch-sparse 0.6.17 pypi_0 pypi [conda] torch-spline-conv 1.2.2+pt22cu121 pypi_0 pypi [conda] torchaudio 2.2.1 pypi_0 pypi [conda] torchvision 0.17.1 pypi_0 pypi [conda] triton 2.2.0 pypi_0 pypi

rusty1s commented 8 months ago

Does import torch_sparse work for you?

RX28666 commented 8 months ago

Thanks for your reply. Unfortunately, it returns bug: lib/python3.10/site-packages/torch_sparse/_version_cuda.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev same as "torch_scatter" although I update it.

Kh4L commented 8 months ago

@RX28666 What's the error raised by

data.adj_t = data.adj_t.set_diag()
rusty1s commented 8 months ago

It looks like you have multiple torch-sparse versions installed (one from conda and one from pip), which might explain this issue.

RX28666 commented 8 months ago

@RX28666 What's the error raised by

data.adj_t = data.adj_t.set_diag()

Hello,

Thanks for your reply. It returns 'Tensor' object has no attribute 'set_diag', it didn't happen when data.adj_t is Sparse Tensor previously. This can be solved by data.adj_t = gcn_norm(data.adj_t, add_self_loops=True), however, I concern about the influence like complexity if adjacency matrix is not Sparse Tensor.

RX28666 commented 8 months ago

It looks like you have multiple torch-sparse versions installed (one from conda and one from pip), which might explain this issue.

Hello,

Thanks for you reply. I used conda list to show all packages in my current environment, it returns:

tokenizers                0.13.3                   pypi_0    pypi
torch                     2.2.1                    pypi_0    pypi
torch-cluster             1.6.3                    pypi_0    pypi
torch-geometric           2.5.0                    pypi_0    pypi
torch-scatter             2.1.2                    pypi_0    pypi
torch-sparse              0.6.18                   pypi_0    pypi
torch-spline-conv         1.2.2+pt22cu121          pypi_0    pypi
torchaudio                2.2.1                    pypi_0    pypi
torchvision               0.17.1                   pypi_0    pypi
tornado                   6.1             py310h5764c6d_3    conda-forge
tqdm                      4.66.1                   pypi_0    pypi
traitlets                 5.14.0             pyhd8ed1ab_0    conda-forge
transformers              4.33.2                   pypi_0    pypi
triton                    2.2.0                    pypi_0    pypi
typing-extensions         4.9.0                    pypi_0    pypi
typing-inspect            0.9.0                    pypi_0    pypi
tzdata                    2023.3                   pypi_0    pypi
urllib3                   2.0.4                    pypi_0    pypi
wandb                     0.16.0                   pypi_0    pypi
wcwidth                   0.2.12             pyhd8ed1ab_0    conda-forge
wheel                     0.38.4          py310h06a4308_0  
xxhash                    3.4.1                    pypi_0    pypi
xz                        5.4.2                h5eee18b_0  
yarl                      1.9.2                    pypi_0    pypi
zeromq                    4.3.4                h2531618_0  
zlib                      1.2.13               h5eee18b_0  

After I use pip uninstall torch-sparse, then no torch-sparse shown in the package list any more. And No module named 'torch_sparse' returned after import. I am wondering if I understand you correctly. And I would appreciate if you could give me further suggestions.

RX28666 commented 8 months ago

Hello,

I repeatedly call pip uninstall torch-sparse, pip uninstall torch-scatter until it returns: WARNING: Ignoring invalid distribution -orch (/anaconda3/envs/LLMGNN/lib/python3.10/site-packages) WARNING: Skipping torch-sparse as it is not installed. Then pip install torch-sparse, the bug keeps: /anaconda3/envs/LLMGNN/lib/python3.10/site-packages/torch_sparse/_version_cuda.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev. FYI, my pytorch version is 2.2.1+cu121. Could you please help with this?

rusty1s commented 8 months ago

Can you show me the installation log when running

pip install --verbose torch-sparse

Do you use the -f option to install from wheels?

RX28666 commented 8 months ago

Can you show me the installation log when running

pip install --verbose torch-sparse

Do you use the -f option to install from wheels?

Hello,

This is the log:

Using pip 23.2.1 from /anaconda3/envs/LLMGNN/lib/python3.10/site-packages/pip (python 3.10)
WARNING: Ignoring invalid distribution -orch (/anaconda3/envs/LLMGNN/lib/python3.10/site-packages)
Collecting torch-sparse
  Using cached torch_sparse-0.6.18-cp310-cp310-linux_x86_64.whl
Requirement already satisfied: scipy in ./anaconda3/envs/LLMGNN/lib/python3.10/site-packages (from torch-sparse) (1.11.1)
Requirement already satisfied: numpy<1.28.0,>=1.21.6 in ./anaconda3/envs/LLMGNN/lib/python3.10/site-packages (from scipy->torch-sparse) (1.25.2)
WARNING: Ignoring invalid distribution -orch (/anaconda3/envs/LLMGNN/lib/python3.10/site-packages)
Installing collected packages: torch-sparse
Successfully installed torch-sparse-0.6.18

I double checked after continually runpip uninstall torch-sparse, there is no package in pip list

And I tried both for installation: pip install torch-sparse pip install torch-scatter torch-sparse -f https://data.pyg.org/whl/torch-2.2.1+cu121.html

but they returns same error.

rusty1s commented 8 months ago

Can you redo with


pip install --verbose --no-cache torch-sparse
RX28666 commented 8 months ago

Thanks for your reply! It works. However, I met another bug without modifying any parts of my code:

RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Then I used:

import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"

It returns:

RuntimeError                              Traceback (most recent call last)
Input [In [18]] in <cell line: 3>()
      [7]model.reset_parameters()
      [8]for epoch in range(args.epochs):
----> [9]  loss = train(data)

Input [In [16]], in train(data)
     [19]y = batch1.y[:batch1.batch_size][train].to(device)
     [20]
---> [21]out = model(x1, adj_t1, id1, batch1.batch_size, args.K_train, args.alpha)[:batch1.batch_size][train]
     [22] loss = F.nll_loss(out, y)

File [/lib/python3.10/site-packages/torch/nn/modules/module.py:1511](/lib/python3.10/site-packages/torch/nn/modules/module.py:1511), in Module._wrapped_call_impl(self, *args, **kwargs)
   [1509]    return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   [1510] else:
-> [1511]     return self._call_impl(*args, **kwargs)

File [/lib/python3.10/site-packages/torch/nn/modules/module.py:1520](/lib/python3.10/site-packages/torch/nn/modules/module.py:1520), in Module._call_impl(self, *args, **kwargs)
   [1515]# If we don't have any hooks, we want to skip the rest of the logic in
   [1516] # this function, and just call forward.
   [1517] if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   [1518]  or _global_backward_pre_hooks or _global_backward_hooks
   [1519]        or _global_forward_hooks or _global_forward_pre_hooks):
-> [1520]     return forward_call(*args, **kwargs)
   [1522] try:
   [1523]    result = None

Input [In [14]] in Net.forward(self, x, adj, id, size, K, alpha)
     [37] z = x.clone()
     [38]for i in range(K-1):
---> [39]    z = (1 - alpha) * (adj @ z) + alpha * x

File [/lib/python3.10/site-packages/torch_sparse/matmul.py:171), in <lambda>(self, other)
    [167]SparseTensor.spspmm = lambda self, other, reduce="sum": spspmm(
    [168]    self, other, reduce)
    [169]SparseTensor.matmul = lambda self, other, reduce="sum": matmul(
    [170]     self, other, reduce)
--> [171]SparseTensor.__matmul__ = lambda self, other: matmul(self, other, 'sum')

File [/lib/python3.10/site-packages/torch_sparse/matmul.py:160), in matmul(src, other, reduce)
    [142] """Matrix product of a sparse tensor with either another sparse tensor or a
    [143] dense tensor. The sparse tensor represents an adjacency matrix and is
    [144] stored as a list of edges. This method multiplies elements along the rows
   (...)
    [157] :rtype: (:class:`Tensor`)
    [158] """
    [159] if isinstance(other, torch.Tensor):
--> [160]     return spmm(src, other, reduce)
    [161] elif isinstance(other, SparseTensor):
    [162]     return spspmm(src, other, reduce)

File [/lib/python3.10/site-packages/torch_sparse/matmul.py:83), in spmm(src, other, reduce)
     [79] def spmm(src: SparseTensor,
     [80]         other: torch.Tensor,
     [81]        reduce: str = "sum") -> torch.Tensor:
     [82]     if reduce == 'sum' or reduce == 'add':
---> [83]        return spmm_sum(src, other)
     [84]     elif reduce == 'mean':
     [85]         return spmm_mean(src, other)

File [/lib/python3.10/site-packages/torch_sparse/matmul.py:24), in spmm_sum(src, other)
     [22] if other.requires_grad:
     [23]     row = src.storage.row()
---> [24]    csr2csc = src.storage.csr2csc()
     [25]     colptr = src.storage.colptr()
     [27] return torch.ops.torch_sparse.spmm_sum(row, rowptr, col, value, colptr,
     [28]                                       csr2csc, other)

File [/lib/python3.10/site-packages/torch_sparse/storage.py:412), in SparseStorage.csr2csc(self)
    [409] if csr2csc is not None:
    [410]     return csr2csc
--> [412] idx = self._sparse_sizes[0] * self._col + self.row()
    [413] max_value = self._sparse_sizes[0] * self._sparse_sizes[1]
    [414] _, csr2csc = index_sort(idx, max_value)

I also tried to run my other code scripts, the bug also happened with torch-sparse, like:

sr2csc = src.storage.csr2csc()
     [25] colptr = src.storage.colptr()
---> [27] return torch.ops.torch_sparse.spmm_sum(row, rowptr, col, value, colptr,
     [28]                                        csr2csc, other)
CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

I am wondering if this is because the version, as it never happened before update. Thanks.

RX28666 commented 8 months ago

Hello Matthias,

The following script can reproduce the bug:

import argparse
import os.path as osp
from typing import Tuple
import numpy as np
import time
import copy
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch import Tensor
from torch.nn import Linear
import scipy.sparse as sp
import torch_geometric
import torch_geometric.transforms as T
from torch_geometric.datasets import Planetoid
from torch_geometric.logging import init_wandb, log
from torch_geometric.utils import to_undirected
from torch_geometric.loader import DataLoader
from torch_geometric.loader import NeighborLoader
from ogb.nodeproppred import PygNodePropPredDataset, Evaluator
from torch_geometric.nn import GCNConv
from torch_geometric.nn.conv.gcn_conv import gcn_norm

def index2mask(idx: Tensor, size: int) -> Tensor:
    mask = torch.zeros(size, dtype=torch.bool, device=idx.device)
    mask[idx] = True
    return mask

def gen_masks(y: Tensor, train_per_class: int = 20, val_per_class: int = 30,
              num_splits: int = 20) -> Tuple[Tensor, Tensor, Tensor]:
    num_classes = int(y.max()) + 1

    train_mask = torch.zeros(y.size(0), num_splits, dtype=torch.bool)
    val_mask = torch.zeros(y.size(0), num_splits, dtype=torch.bool)

    for c in range(num_classes):
        idx = (y == c).nonzero(as_tuple=False).view(-1)
        perm = torch.stack(
            [torch.randperm(idx.size(0)) for _ in range(num_splits)], dim=1)
        idx = idx[perm]

        train_idx = idx[:train_per_class]
        train_mask.scatter_(0, train_idx, True)
        val_idx = idx[train_per_class:train_per_class + val_per_class]
        val_mask.scatter_(0, val_idx, True)

    test_mask = ~(train_mask | val_mask)

    return train_mask, val_mask, test_mask

def get_arxiv():
    root='/tmp/datasets'
    dataset = PygNodePropPredDataset('ogbn-arxiv', f'{root}/OGB',
                                     pre_transform=T.ToSparseTensor())
    data = dataset[0]
    data.adj_t = data.adj_t.to_symmetric()
    data.node_year = None
    data.y = data.y.view(-1)
    split_idx = dataset.get_idx_split()
    data.train_mask = index2mask(split_idx['train'], data.num_nodes)
    data.val_mask = index2mask(split_idx['valid'], data.num_nodes)
    data.test_mask = index2mask(split_idx['test'], data.num_nodes)
    return data, dataset.num_features, dataset.num_classes

data, in_channels, out_channels = get_arxiv()
dataset = PygNodePropPredDataset(name='ogbn-arxiv')
device = torch.device("cuda:2" if torch.cuda.is_available() else "cpu")

data.adj_t = data.adj_t.set_diag()
data.adj_t = gcn_norm(data.adj_t, add_self_loops=False)
data.n_id = torch.arange(data.num_nodes)

parser = argparse.ArgumentParser()
parser.add_argument('--runs', type=int, default=1)
parser.add_argument('--epochs', type=int, default=2000)
parser.add_argument('--lr', type=float, default=0.01)
parser.add_argument('--weight_decay', type=float, default=0)
parser.add_argument('--early_stopping', type=int, default=0)
parser.add_argument('--hidden', type=int, default=256)
parser.add_argument('--num_layers', type=int, default=3)
parser.add_argument('--dropout', type=float, default=0.5)
parser.add_argument('--normalize_features', action='store_true')
args = parser.parse_args(args=[])

class Net(nn.Module):
    def __init__(self, num_features, hidden_channels, num_classes, num_layers, num_nodes, **kwargs):
        super(Net, self).__init__()

        self.convs = torch.nn.ModuleList()
        self.convs.append(GCNConv(num_features, hidden_channels, normalize=False))
        for _ in range(num_layers - 2):
            self.convs.append(GCNConv(hidden_channels, hidden_channels, normalize=False))
        self.convs.append(GCNConv(hidden_channels, num_classes, normalize=False))
        self.num_classes = num_classes
        self.num_nodes = num_nodes
        self.hidden_channels = hidden_channels

    def reset_parameters(self):
        for conv in self.convs:
            conv.reset_parameters()

    def forward(self):
        data_z = self.convs[0](data.x.to(device), data.adj_t.to(device))
        return data_z

model = Net(data.x.shape[1], args.hidden, dataset.num_classes, args.num_layers, data.num_nodes)
model = model.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=args.lr, weight_decay=args.weight_decay)

def train(data):
    model.train()

    neigh_out = model()

    return None

acc = []
best = 0
for j in range(args.runs):
    tr = []
    val_accs = []
    test_accs = []
    for epoch in range(args.epochs):
        loss = train(data)

It returns:

RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

I am wondering if I missed anything here. Thanks.

rusty1s commented 8 months ago

This runs fine for me, but I had to set cuda:0 since I only have a single GPU. What happens if you do the same?

RX28666 commented 8 months ago

This runs fine for me, but I had to set cuda:0 since I only have a single GPU. What happens if you do the same?

I found my code runs fine only for cuda:0, but it always reports the same bug whatever other GPUs I use. Thanks for pointing out. And it seems that this happens after training several epochs, it doesn't report any bugs for the first several epochs.

Do you have any ideas about the reason for this? Since this only happens after I update the package. Thanks

rusty1s commented 8 months ago

I am not entirely sure. Can you show me/upload the input of

torch.ops.torch_sparse.spmm_sum(row, rowptr, col, value, colptr, csr2csc, other)

for which this crashes?