mit-han-lab / torchsparse

[MICRO'23, MLSys'22] TorchSparse: Efficient Training and Inference Framework for Sparse Convolution on GPUs.
https://torchsparse.mit.edu
MIT License
1.22k stars 143 forks source link

[BUG] <title>Cuda out of memory #290

Closed JunyuanDeng closed 6 months ago

JunyuanDeng commented 9 months ago

Is there an existing issue for this?

Current Behavior

I write the following encoder

class encoder(nn.Module):
    def __init__(self):
        super(Spencoder_gps, self).__init__()
        self.conv1 = nn.Sequential(
            spnn.Conv3d(3, 32, 5, stride=2, padding=2, bias=False),
            spnn.BatchNorm(32),
            spnn.ReLU(),
            ResidualBlock(32, 32),
            ResidualBlock(32, 32)
        )
        self.conv2 = nn.Sequential(
            spnn.Conv3d(32, 48, 3, stride=2, padding=1, bias=False),
            spnn.BatchNorm(48),
            spnn.ReLU(),
            ResidualBlock(48, 48),
            ResidualBlock(48, 48))
        self.conv3 = nn.Sequential(
            spnn.Conv3d(48, 64, 3, stride=[1, 2, 2], padding=1, bias=False),
            spnn.BatchNorm(64),
            spnn.ReLU(),
            ResidualBlock(64, 64),
        )

For the input, I write a dense to sparse function:

def densetosparse(mask, img, bounds):
    # mask : [B,T,W,H]
    # img : [B,3,T,W,H]
    # grid : [3,T*W*H]
    # grid_xx : [1,T*W*H]

    B = img.shape[0]
    coord = torch.argwhere(mask).type(torch.int32)
    features = img[coord[:, 0], :, coord[:, 1], coord[:, 2], coord[:, 3]]
    sp_inputs = torchsparse.SparseTensor(features, coord.contiguous(), spatial_range=(6, 16, 1024, 1024))
    return sp_inputs

The first iteration was good, but for the second iteration failed with out of memory:

  File "/mnt/local_disk/djy/Forward4D_query_gaussian_sicong_pure3d_0125/models/image_upsample/UpsampleImage.py", line 176, in forward
    x_conv1 = self.conv1(inputs)
  File "/home/shaper/miniconda3/envs/forward4d/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/shaper/miniconda3/envs/forward4d/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/shaper/miniconda3/envs/forward4d/lib/python3.10/site-packages/torch/nn/modules/container.py", line 215, in forward
    input = module(input)
  File "/home/shaper/miniconda3/envs/forward4d/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/shaper/miniconda3/envs/forward4d/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/shaper/miniconda3/envs/forward4d/lib/python3.10/site-packages/torchsparse-2.1.0-py3.10-linux-x86_64.egg/torchsparse/nn/modules/conv.py", line 98, in forward
    return F.conv3d(
  File "/home/shaper/miniconda3/envs/forward4d/lib/python3.10/site-packages/torchsparse-2.1.0-py3.10-linux-x86_64.egg/torchsparse/nn/functional/conv/conv.py", line 92, in conv3d
    kmap = F.build_kernel_map(
  File "/home/shaper/miniconda3/envs/forward4d/lib/python3.10/site-packages/torchsparse-2.1.0-py3.10-linux-x86_64.egg/torchsparse/nn/functional/conv/kmap/build_kmap.py", line 85, in build_kernel_map
    kmap = build_kmap_implicit_GEMM_hashmap_on_the_fly(
  File "/home/shaper/miniconda3/envs/forward4d/lib/python3.10/site-packages/torchsparse-2.1.0-py3.10-linux-x86_64.egg/torchsparse/nn/functional/conv/kmap/func/hashmap_on_the_fly.py", line 72, in build_kmap_implicit_GEMM_hashmap_on_the_fly
    out = func(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 7.79 GiB. GPU 0 has a total capacty of 47.54 GiB of which 6.17 GiB is free. Including non-PyTorch memory, this process has 41.36 GiB memory in use. Of the allocated memory 24.22 GiB is allocated by PyTorch, and 16.28 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Expected Behavior

No more OOM problem.

Environment

- GCC:
- NVCC:
- PyTorch:
- PyTorch CUDA:
- TorchSparse:

Anything else?

No response

ys-2020 commented 8 months ago

Thank you for your interest in TorchSparse. Can you provide more information about your wordload? For example, what is the input resolution and batch sizes?

ZzTodd22 commented 8 months ago

Is there an existing issue for this?

  • [x] I have searched the existing issues

Current Behavior

I write the following encoder

class encoder(nn.Module):
    def __init__(self):
        super(Spencoder_gps, self).__init__()
        self.conv1 = nn.Sequential(
            spnn.Conv3d(3, 32, 5, stride=2, padding=2, bias=False),
            spnn.BatchNorm(32),
            spnn.ReLU(),
            ResidualBlock(32, 32),
            ResidualBlock(32, 32)
        )
        self.conv2 = nn.Sequential(
            spnn.Conv3d(32, 48, 3, stride=2, padding=1, bias=False),
            spnn.BatchNorm(48),
            spnn.ReLU(),
            ResidualBlock(48, 48),
            ResidualBlock(48, 48))
        self.conv3 = nn.Sequential(
            spnn.Conv3d(48, 64, 3, stride=[1, 2, 2], padding=1, bias=False),
            spnn.BatchNorm(64),
            spnn.ReLU(),
            ResidualBlock(64, 64),
        )

For the input, I write a dense to sparse function:

def densetosparse(mask, img, bounds):
    # mask : [B,T,W,H]
    # img : [B,3,T,W,H]
    # grid : [3,T*W*H]
    # grid_xx : [1,T*W*H]

    B = img.shape[0]
    coord = torch.argwhere(mask).type(torch.int32)
    features = img[coord[:, 0], :, coord[:, 1], coord[:, 2], coord[:, 3]]
    sp_inputs = torchsparse.SparseTensor(features, coord.contiguous(), spatial_range=(6, 16, 1024, 1024))
    return sp_inputs

The first iteration was good, but for the second iteration failed with out of memory:

  File "/mnt/local_disk/djy/Forward4D_query_gaussian_sicong_pure3d_0125/models/image_upsample/UpsampleImage.py", line 176, in forward
    x_conv1 = self.conv1(inputs)
  File "/home/shaper/miniconda3/envs/forward4d/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/shaper/miniconda3/envs/forward4d/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/shaper/miniconda3/envs/forward4d/lib/python3.10/site-packages/torch/nn/modules/container.py", line 215, in forward
    input = module(input)
  File "/home/shaper/miniconda3/envs/forward4d/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/shaper/miniconda3/envs/forward4d/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/shaper/miniconda3/envs/forward4d/lib/python3.10/site-packages/torchsparse-2.1.0-py3.10-linux-x86_64.egg/torchsparse/nn/modules/conv.py", line 98, in forward
    return F.conv3d(
  File "/home/shaper/miniconda3/envs/forward4d/lib/python3.10/site-packages/torchsparse-2.1.0-py3.10-linux-x86_64.egg/torchsparse/nn/functional/conv/conv.py", line 92, in conv3d
    kmap = F.build_kernel_map(
  File "/home/shaper/miniconda3/envs/forward4d/lib/python3.10/site-packages/torchsparse-2.1.0-py3.10-linux-x86_64.egg/torchsparse/nn/functional/conv/kmap/build_kmap.py", line 85, in build_kernel_map
    kmap = build_kmap_implicit_GEMM_hashmap_on_the_fly(
  File "/home/shaper/miniconda3/envs/forward4d/lib/python3.10/site-packages/torchsparse-2.1.0-py3.10-linux-x86_64.egg/torchsparse/nn/functional/conv/kmap/func/hashmap_on_the_fly.py", line 72, in build_kmap_implicit_GEMM_hashmap_on_the_fly
    out = func(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 7.79 GiB. GPU 0 has a total capacty of 47.54 GiB of which 6.17 GiB is free. Including non-PyTorch memory, this process has 41.36 GiB memory in use. Of the allocated memory 24.22 GiB is allocated by PyTorch, and 16.28 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Expected Behavior

No more OOM problem.

Environment

- GCC:
- NVCC:
- PyTorch:
- PyTorch CUDA:
- TorchSparse:

Anything else?

No response

If you suspect an issue with your sparse_to_dense function, consider trying mine. It's proven effective in my program and doesn't consume additional memory. In your scenario, you can pre-multiply x by the mask before invoking the function. Alternatively, you could attempt memory cleanup after each training round.

def from_dense(x: torch.Tensor):
    """create sparse tensor fron channel last dense tensor by to_sparse
    x must be BTHWC tensor, channel last
    """
    sparse_data = x.to_sparse(x.ndim-1)
    spatial_shape = sparse_data.shape[:-1]
    sparse_indices = sparse_data.indices().transpose(1, 0).contiguous().int()

    sparse_feature = sparse_data.values()

    return SparseTensor(feats=sparse_feature.cuda(), coords=sparse_indices.cuda(), spatial_range=spatial_shape)

torch.cuda.empty_cache()