rusty1s / pytorch_sparse

PyTorch Extension Library of Optimized Autograd Sparse Matrix Operations
MIT License
1.01k stars 147 forks source link

Bad behavior when using SparseTensor in nn.DataParallel #340

Closed yaohuicai closed 7 months ago

yaohuicai commented 1 year ago

For torch_sparse.SparseTensor wrapped in nn.DataParallel, they will not be automatically replicated to each of the target devices -- they remains on the master GPU.

I tested on RTX A6000 with PyTorch 2.0.1 and CUDA 11.7, which should be an up-to-date environment. But both PyTorch and CUDA will not throw an error. The SparseTensor is only involved in a torch_sparse.matmul operation during forward pass. The program will run normally and produce a result about 1e10x larger than expected on non-master devices.

I think even if the torch_sparse.SparseTensor wrapped in nn.DataParallel not being replicated is an expected behavior, at least some kind of warning should be thrown.

rusty1s commented 1 year ago

Do you have a small example to reproduce what you are referring to? That would be helpful.

yaohuicai commented 1 year ago

Sure, absolutely.

First, we import the dependency and prepare the model definition.

import os
import torch
import torch.nn as nn
from torch_sparse import SparseTensor
from torch_sparse import matmul

class SparseModel(nn.Module):
  def __init__(self, M, N):
    value = torch.ones(int(M * N * 0.01))
    row = torch.randint(low=0, high=M, size=(len(value),))
    col = torch.randint(low=0, high=N, size=(len(value),))
    self.W = nn.Parameter(SparseTensor(row=row, col=col, value=value, sparse_sizes=(M, N)).cuda(), requires_grad=True)

  def forward(self, x):
    print('self.W device =',, 'input x device =', x.device)
    return matmul(self.W, x)

Then for single GPU:

M, N, L = 10_000, 20_000, 30
sp_model = SparseModel(M, N)
x = torch.rand(N, L).cuda()
result = sp_model(x)

The output is

self.W device = cuda:0 input x device = cuda:0
tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='cuda:0',

For nn.DataParallel:

sp_model = nn.DataParallel(sp_model)
result = sp_model(x)

The output is

self.W device = cuda:0 input x device = cuda:0
self.W device = cuda:0 input x device = cuda:1
tensor([[0.2518, 0.4229, 0.9151,  ..., 0.6010, 0.4724, 0.5314],
        [0.7368, 0.5397, 0.3843,  ..., 0.8365, 0.6803, 0.0126],
        [0.9484, 0.2409, 0.9839,  ..., 0.7077, 0.0872, 0.6429],
        [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000]],
       device='cuda:0', grad_fn=<GatherBackward>)

This is a minimal example. The results are incorrect for both cases, probably because the first dimension of input is not batch dimension. We use a little trick to do batched spmm in practice and we can get the expected results on single GPU, so it is not a problem for us.

But the device indeed causes a problem for us. The above example demonstrates the device id for the parameter self.W is incorrect, which should be identical to x.device. It should be replicated to each of the target devices, but it remains on the master GPU cuda:0.

github-actions[bot] commented 7 months ago

This issue had no activity for 6 months. It will be closed in 2 weeks unless there is some new activity. Is this issue already resolved?