mit-han-lab / torchsparse

[MICRO'23, MLSys'22] TorchSparse: Efficient Training and Inference Framework for Sparse Convolution on GPUs.
https://torchsparse.mit.edu
MIT License
1.22k stars 143 forks source link

[BUG] <problems encountered when reproducing artifact evaluation> #287

Closed hua0x522 closed 9 months ago

hua0x522 commented 10 months ago

Is there an existing issue for this?

Current Behavior

I have encountered problems when I reproduce AE of TorchSparse++. I downloaded the code from https://zenodo.org/records/8311889 and used the datasets provided of authors, which have been preprocessed.

(spconv) wxz@gpu4:~/torchsparse/torchsparse-artifact-micro-main/artifact-p1/evaluation$ CUDA_LAUNCH_BLOCKING=1 python evaluate.py 
[Warning] The current device does not support fp16. Set precision to fp32
Traceback (most recent call last):                                                                                                                  
  File "evaluate.py", line 301, in <module>                                                                                                         
    main()                                                                                                                                          
  File "/home/wxz/miniconda3/envs/spconv/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "evaluate.py", line 220, in main
    _ = model(inputs)
  File "/home/wxz/miniconda3/envs/spconv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wxz/torchsparse/torchsparse-artifact-micro-main/artifact-p1/evaluation/core/models/segmentation_models/minkunet.py", line 104, in forward
    x3 = self.stage3(x2)
  File "/home/wxz/miniconda3/envs/spconv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wxz/miniconda3/envs/spconv/lib/python3.8/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/home/wxz/miniconda3/envs/spconv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wxz/torchsparse/torchsparse-artifact-micro-main/artifact-p1/evaluation/core/models/modules/layers_3d.py", line 125, in forward
    x = self.relu(self.net(x) + self.downsample(x))
  File "/home/wxz/torchsparse/torchsparse/tensor.py", line 109, in __add__
    feats=self.feats + other.feats,
RuntimeError: CUDA error: invalid configuration argument

My GPU is GPU 0: Tesla V100-PCIE-32GB (UUID: GPU-b57016fe-8dca-4290-b860-a09e19c8fb30) Before encountering this problem, I got this one firstly:

(spconv) wxz@gpu4:~/torchsparse/torchsparse-artifact-micro-main/artifact-p1/evaluation$ python evaluate.py 
[Warning] The current device does not support fp16. Set precision to fp32
Traceback (most recent call last):                                                                                                                  
  File "evaluate.py", line 301, in <module>                                                                                                         
    main()                                                                                                                                          
  File "/home/wxz/miniconda3/envs/spconv/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "evaluate.py", line 220, in main
    _ = model(inputs)
  File "/home/wxz/miniconda3/envs/spconv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wxz/torchsparse/torchsparse-artifact-micro-main/artifact-p1/evaluation/core/models/segmentation_models/minkunet.py", line 101, in forward
    x0 = self.stem(x)
  File "/home/wxz/miniconda3/envs/spconv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wxz/miniconda3/envs/spconv/lib/python3.8/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/home/wxz/miniconda3/envs/spconv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wxz/torchsparse/torchsparse/nn/modules/conv.py", line 98, in forward
    return F.conv3d(
  File "/home/wxz/torchsparse/torchsparse/nn/functional/conv/conv.py", line 47, in conv3d
    dataflow = config.dataflow
AttributeError: 'dict' object has no attribute 'dataflow'

I tried to fix it by just ignore the config passed in torchsparse/nn/functional/conv/conv.py:

# torchsparse/nn/functional/conv/conv.py: line 37
    config = None
    if config is None:
        config = F.conv_config.get_global_conv_config()
        if config is None:
            config = F.conv_config.get_default_conv_config(
                conv_mode=conv_mode, training=training
            )

    # TODO: Deal with kernel volume > 32. (Split mask or unsort)

    dataflow = config.dataflow
    kmap_mode = config.kmap_mode

Expected Behavior

No response

Environment

- GCC:9.3.0
- NVCC:11.3
- PyTorch:1.10.0+cu113
- PyTorch CUDA:11.3
- TorchSparse:2.1.0

Anything else?

No response

ys-2020 commented 9 months ago

Hi @hua0x522 , Than you for your interest in TorchSparse! Did you build the docker container for the artifact evaluation? It looks like you are running it in your local environment. The problem is that you have installed TorchSparse v2.1.0, while you are running the benchmark code for TorchSparse v2.0. ( In the folder of artifact-p1.)

To run the benchmark code for v2.1.0, you should switch to the folder of artifact-p2, and remove your change torchsparse/nn/functional/conv/conv.py. Additionally, I strongly recommend you follow the README.md in artifact-p2 and build the docker container for benchmark evaluation.

Finally, the GPU you are using might be a bit too old (does not support fp16 arithmetics), which means that you may not be able to reproduce the figures in our paper with this GPU.

Thank you.

hua0x522 commented 9 months ago

Thank you for your patient explanation. Now I can correctly execute the AE code in artifact-p2.

hua0x522 commented 9 months ago

Thank you for your patient explanation. Now I can correctly execute the AE code in artifact-p2. By the way, may I ask why the batch size in the Evaluation of TorchSparse++ is set as 1 or 2, instead of larger batch size like 4, 8, 16 ?

------------------ 原始邮件 ------------------ 发件人: "mit-han-lab/torchsparse" @.>; 发送时间: 2024年1月25日(星期四) 下午3:48 @.>; @.**@.>; 主题: Re: [mit-han-lab/torchsparse] [BUG] <problems encountered when reproducing artifact evaluation> (Issue #287)

Hi @hua0x522 , Than you for your interest in TorchSparse! Did you build the docker container for the artifact evaluation? It looks like you are running it in your local environment. The problem is that you have installed TorchSparse v2.1.0, while you are running the benchmark code for TorchSparse v2.0. ( In the folder of artifact-p1.)

To run the benchmark code for v2.1.0, you should switch to the folder of artifact-p2, and remove your change torchsparse/nn/functional/conv/conv.py. Additionally, I strongly recommend you follow the README.md in artifact-p2 and build the docker container for benchmark evaluation.

Finally, the GPU you are using might be a bit too old (does not support fp16 arithmetics), which means that you may not be able to reproduce the figures in our paper with this GPU.

Thank you.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

Fengzexu commented 4 months ago

Hello author, I found that in the evaluation, the Minkunet model output of SPCONV and Torchsparse ++ is different,(artifact-p2 evaluate.py, model output cosine similarity is approximately 0.81 ). I make sure each backend using same input point clouds. And, the cosine similarity between ME and Torchsparse++ output is approximately 0.99.I am not very familiar with this field and may have made some naive mistakes. Looking forward to your reply.