traveller59 / spconv

Spatial Sparse Convolution Library
Apache License 2.0
1.88k stars 365 forks source link

Floating point exception (core dumped) #725

Open Kitsunetic opened 2 weeks ago

Kitsunetic commented 2 weeks ago

I always get floating point exception while I'm using SubMConv3d.

Here is my test code:

import torch as th
from spconv.pytorch import SubMConv3d, SparseConvTensor

xyz = th.randint(0, 32, (1000, 4), dtype=th.int64, device='cuda')
xyz[:, 0] = 0
feat = th.randn(1000, 32, device='cuda', dtype=th.float32)
sp = SparseConvTensor(feat, xyz, (32, 32, 32), 1, 1, 1)

conv = SubMConv3d(32, 64, 3).cuda()
conv(sp)

>>> Floating point exception (core dumped)

I'm using PyTorch 2.3.0 with CUDA 11.8, and spconv-cu18==2.3.6. Is there something wrong in my code, or someone knows the clue?

I have tested with A5000 and RTX 2080Ti GPUs but the result was always same.

shim94kr commented 1 week ago

I'm experiencing the exact same issue.

I've found that it works fine with kernel_size=1, but consistently crashes with kernel_size=3 or any other size.

@Kitsunetic Have you fixed this issue?

Kitsunetic commented 1 week ago

I'm experiencing the exact same issue.

I've found that it works fine with kernel_size=1, but consistently crashes with kernel_size=3 or any other size.

@Kitsunetic Have you fixed this issue?

No, I'm still figuring out the solution.

shim94kr commented 1 week ago

I found that downgrading PyTorch to version 2.2.2 resolves the issue.

Kitsunetic commented 1 week ago

which cuda version did you use?

shim94kr commented 1 week ago

I use CUDA 12.1, and I installed spconv-cu120.

Kitsunetic commented 3 days ago

Unfortunately, I'm still getting same issue with my retrial on nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04 docker image with Pytorch 2.2.2 with CUDA 12.1. I have tested with both ubuntu 22.04 and 20.04. Couly you give me more detail about your environment?

shim94kr commented 3 days ago

I set up the environment using the following .yaml file with conda env create -f ***.yaml. This is a different .yaml file than the one referenced in Issue #317, particularly with torch and torchvision configurations.

name: pointcept
channels:
  - pyg
  - pytorch
  - nvidia/label/cuda-12.1.1
  - nvidia
  - bioconda
  - conda-forge
  - defaults
dependencies:
  - python=3.9
  - pip
  - cuda
  - conda-forge::cudnn
  - gcc=12.1
  - gxx=12.1
  - pytorch=2.2.2
  - torchvision=0.17.2
  - pytorch-cuda=12.1
  - ninja
  - google-sparsehash
  - h5py
  - pyyaml
  - tensorboard
  - tensorboardx
  - yapf
  - addict
  - einops
  - scipy
  - plyfile
  - termcolor
  - timm
  - ftfy
  - regex
  - tqdm
  - matplotlib
  - black
  - open3d
  - pytorch-cluster
  - pytorch-scatter
  - pytorch-sparse
  - pip:
    - torch_geometric
#    - spconv-cu120
    - git+https://github.com/octree-nn/ocnn-pytorch.git
    - git+https://github.com/openai/CLIP.git
    - git+https://github.com/Dao-AILab/flash-attention.git
    - ./libs/pointops
    - ./libs/pointgroup_ops

After this setup, I installed the following additional components:

cd libs/pointops
python setup.py install 
cd ../..

pip install spconv-cu120
Kitsunetic commented 3 days ago

Thank you for sharing. However... I'm still getting same error even with environment based on provided yaml file. I expect this is not only the problem of dependencies, but also entire environment like OS can be related. So, I'm still figuring out the reason. Anyway, thank you again for your sharing! If you found another clue, please share with me!