Closed yaohuicai closed 7 months ago
Do you have a small example to reproduce what you are referring to? That would be helpful.
Sure, absolutely.
First, we import the dependency and prepare the model definition.
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0,1"
import torch
import torch.nn as nn
from torch_sparse import SparseTensor
from torch_sparse import matmul
print(torch.cuda.device_count())
class SparseModel(nn.Module):
def __init__(self, M, N):
super().__init__()
value = torch.ones(int(M * N * 0.01))
row = torch.randint(low=0, high=M, size=(len(value),))
col = torch.randint(low=0, high=N, size=(len(value),))
self.W = nn.Parameter(SparseTensor(row=row, col=col, value=value, sparse_sizes=(M, N)).cuda(), requires_grad=True)
def forward(self, x):
print('self.W device =', self.W.storage._value.device, 'input x device =', x.device)
return matmul(self.W, x)
Then for single GPU:
M, N, L = 10_000, 20_000, 30
sp_model = SparseModel(M, N)
x = torch.rand(N, L).cuda()
result = sp_model(x)
print(result)
The output is
self.W device = cuda:0 input x device = cuda:0
tensor([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]], device='cuda:0',
grad_fn=<CppNode<SPMMSum>>)
For nn.DataParallel:
sp_model = nn.DataParallel(sp_model)
result = sp_model(x)
print(result)
The output is
self.W device = cuda:0 input x device = cuda:0
self.W device = cuda:0 input x device = cuda:1
tensor([[0.2518, 0.4229, 0.9151, ..., 0.6010, 0.4724, 0.5314],
[0.7368, 0.5397, 0.3843, ..., 0.8365, 0.6803, 0.0126],
[0.9484, 0.2409, 0.9839, ..., 0.7077, 0.0872, 0.6429],
...,
[0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],
[0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],
[0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000]],
device='cuda:0', grad_fn=<GatherBackward>)
This is a minimal example. The results are incorrect for both cases, probably because the first dimension of input is not batch dimension. We use a little trick to do batched spmm in practice and we can get the expected results on single GPU, so it is not a problem for us.
But the device indeed causes a problem for us. The above example demonstrates the device id for the parameter self.W
is incorrect, which should be identical to x.device
. It should be replicated to each of the target devices, but it remains on the master GPU cuda:0
.
This issue had no activity for 6 months. It will be closed in 2 weeks unless there is some new activity. Is this issue already resolved?
For
torch_sparse.SparseTensor
wrapped innn.DataParallel
, they will not be automatically replicated to each of the target devices -- they remains on the master GPU.I tested on RTX A6000 with PyTorch 2.0.1 and CUDA 11.7, which should be an up-to-date environment. But both PyTorch and CUDA will not throw an error. The
SparseTensor
is only involved in atorch_sparse.matmul
operation during forward pass. The program will run normally and produce a result about 1e10x larger than expected on non-master devices.I think even if the
torch_sparse.SparseTensor
wrapped innn.DataParallel
not being replicated is an expected behavior, at least some kind of warning should be thrown.