Results of `spmm` over `bool` tensors on a GPU are different from CPU

migalkin commented 2 years ago

There seems to be something weird happening when executing spmm over Boolean tensors on a GPU. We have a use-case when we could save a lot of memory and compute having both values and matrix as a bool tensor, but executing the same operation over those tensors on a CPU and GPU lead to surprisingly different results.

I could reproduce it even with the adjacency matrix from the README

>>> import torch
>>> from torch_sparse import spmm
>>> index = torch.tensor([[0,0,1,2,2],[0,2,1,0,1]])
>>> value = torch.tensor([True, True, True, True, True],dtype=torch.bool)
>>> matrix = torch.tensor([[True, False],[True, True], [False, True]],dtype=torch.bool)
>>> spmm(index, value, 3, 3, matrix)
tensor([[True, True],
        [True, True],
        [True, True]])
>>> spmm(index.cuda(), value.cuda(), 3, 3, matrix.cuda())
tensor([[ True, False],
        [ True,  True],
        [ True, False]], device='cuda:0')

spmm results are obviously different 🤔

The only workaround I found so far is to convert values and matrix to float or int and then convert the resulting tensor to bool through a comparison, but that effectively removes all the benefits of doing spmm over binary tensors 🤔

>>> spmm(index.cuda(), value.float().cuda(), 3, 3, matrix.float().cuda()) > 0.0
tensor([[True, True],
        [True, True],
        [True, True]], device='cuda:0')

My environment is as following with CUDA 10.2

torch               1.11.0
torch-cluster       1.6.0
torch-geometric     2.0.4
torch-scatter       2.0.9
torch-sparse        0.6.13
torch-spline-conv   1.2.1

rusty1s commented 2 years ago

Sorry for the late reply. It looks like this line and this line can not properly handle bool. Not sure what I can do about this since they are both delegated to PyTorch.

Importantly, I recommend to make use of SparseTensor for these operations:

from torch_sparse import SparseTensor

adj = SparseTensor.from_edge_index(index, value)
out = adj @ matrix

Sadly, it is not implemented for bool either

RuntimeError: "_" not implemented for 'Bool'

due to the way how the PyTorch dispatcher works, see here. We would need to add explicit support for this :(

migalkin commented 2 years ago

Wow, turns out the source of the problem is deeper than I expected and leads to the core of PyTorch! 👀

Okay, then maybe let's just insert a comment somewhere in the spmm function about expected data types of value and matrix? Eg,

value (:class:`Tensor`): The value tensor of sparse matrix. Expected dtypes are float/double/integer. Does not work for bool, half-precision and complex number dtypes

rusty1s commented 2 years ago

Sounds fair. Added the doc-string, see here.

migalkin commented 2 years ago

Thank you! I'll close the issue for now, but feel free to bring it up if pytorch internals change

rusty1s / pytorch_sparse

Results of `spmm` over `bool` tensors on a GPU are different from CPU #243