occasionally crash - Githubissues

xmyqsh commented 4 years ago

Hello,

I occasionally got the following error when using spconv. It almost works well when I disconnect my internet connection.

[33750/39521]
cudahash: Completely failed to build
Cuda error in file '/xxxxxx/spconv/src/cuhash/hash_table.cpp' in line 194 : an illegal memory access was encountered.

Any advice for this error?

By the way, I'm planning to wrap this repo into a TensorFlow version. Hope for further help here.

traveller59 commented 4 years ago

cuhash may have problem. currently it's recommend to disable cuda hash by set use_hash=False. Tensorflow: implement spconv in tensorflow is straightforward but not "wrap", you still need to write much code because the tensorflow c++ operator is quite different from pytorch.

xmyqsh commented 4 years ago

That's true. I need to read the code in deep.

xmyqsh commented 4 years ago

@traveller59 Indeed, it is straightforward.

What are you thinking about the description above? 30 sparse 3D conv, 7 sparse pooling, 7 sparse unpooling on 200k waymo lidar points only use 12ms. Can you summarize the key factor in the above description that speed up the speed that spconv do not have?

traveller59 commented 4 years ago

I don't know how many voxels generated from 200k waymo pointcloud.
I don't know the kernel size/stride of pooling operation. number of points generated from 3/2/1(k,s,p) is greatly higher than 2/2/0(ksp).
they seems not use regular sparse convolution. this can bring speed up.
their pooling operations use a better algorithm to achieve better speed.
they seems use fused gemm to calculate conv. I'm not sure whether fused gemm can achieve better performance. (I have tested the performance of fused gemm from another sparse conv implementation, when the channel is small, fused gemm seems faster than gather-gemm-scatterAdd in spconv)

traveller59 commented 4 years ago

The subm indice generation speed is greatly increased (4x faster) in newest code. you can have a try.

xmyqsh commented 4 years ago

Interesting! Follow you!

Do you think voxel generation could be speed up by CUDA? DOPS uses pooling op to do it.

For 2, I guess they do not use 2/2/0, because 2/2/0 have poor receptive field in their UNet, and the receptive field is more harder to achieve in sparse conv net compared to normal conv net.

What is fused gemm? Can you have a brief introduction?

traveller59 commented 4 years ago

Yes, voxel generation (one point per voxel) is equivalent to max pool operation. the maxpool in spconv isn't optimized. I will try to optimize it but I'm not sure what the "shuffle" mean in their paper.

fused gemm: perform gather-gemm-scatter in one kernel, no additional buffer needed. you can check fused gemm implemented in SparseConvNet.

xmyqsh commented 4 years ago

It seems that I have to revert my understanding of 2. You are correct, they use 2/2/0(ksp) instead of 3/2/1(k,s,p) for pooling. pool 3/2/1(k,s,p) is same as conv with 3/2/1(k,s,p). But with pool 2/2/0(ksp) , each voxel will be looked up only once as the DOPS said. The voxel generation is same as the 2/2/0(ksp) version, which is s/s/0(ksp). And the last thing that we should make voxel generation same as 2/2/0(ksp) version of pool is that the s in s/s/0(ksp) is discrete, but the points' position is continue. There is two way to resolve this problem, and both of them use the only look up once property: Solution 1: sort/shuffle points by the radix sort: use radix sort on the first several digits of the points' position to make it sorted in terms of stride resolution. And then do max or ave on the slot/bucket of radix sort. And CUB library has the radix sort API. Solution 2: use HashMultiMap: your cpu single thread version of voxelization is exactly same as the HashMultiMap, the only thing to do is how to use CUDA to implemented it. Maybe several Atomic operators will resolve the problem.

So, the meaning of shuffle is the order of sequence unware, it is another description of only look once property

traveller59 commented 4 years ago

Solution 1: I check the cub radix sort but no API found for slot access. I still don't know how to use radix to perform pooling. Solution 2: looks ok.

I create a benchmark script with a encoder network and waymo data, it costs 40ms in 1080Ti (1ms subm indice gen, 35ms subm and 4ms maxpool). I really don't think they can achieve 12ms with both encoder and decoder... The voxel size and point cloud range all missing in their paper.

xmyqsh commented 4 years ago

1ms for 7 subm indice gen? In my intuition, indice gen will consume lots of time so we need indice_key. Your 4x faster subm indice generation works! This is a good and helpful benchmark, thank you! 40 / 4 = 10ms 40 / 8 = 5ms Amazing result!! I have to say, spconv has the change to be faster than DOPS.

But you haven't use shared memory and bit operation to partitioning and caching the filter.

Have you tested your normal sparse conv compared with subm?

traveller59 commented 4 years ago

What does "40 / 4 = 10ms" mean?

The "shared memory and bit operation to partitioning and caching the filter" is used for gemm. the fused gemm implementation in SparseConvNet use shared memory to cache filter, but spconv use cuBLAS gemm, so we don't need cache for filter.

speed of normal sparse conv isn't comparable with subm because normal conv generate too much new points and will greatly impact speed of following layers.

I will bench SparseConvNet fused gemm in this benchmark first, then consider to modify CUTLASS to get a better fused gemm.

xmyqsh commented 4 years ago

40 / 4 > 10ms When we downsample the resolution, the sparsity will decrease.

You are a big god, I will learn lots of things here!

xmyqsh commented 4 years ago

How is going on?

#define RULEBOOKITERATOR(X, Y)                                                 \
  {                                                                            \
    Int rbMaxSize = 0;                                                         \
    for (auto &r : _rules)                                                     \
      rbMaxSize = std::max(rbMaxSize, (Int)r.size());                          \
    at::Tensor rulesBuffer = at::empty({rbMaxSize}, at::CUDA(at_kINT));        \
    Int *rbB = rulesBuffer.data_ptr<Int>();                                        \
    for (int k = 0; k < _rules.size(); ++k) {                                  \
      auto &r = _rules[k];                                                     \
      Int nHotB = r.size() / 2;                                                \
      if (nHotB) {                                                             \
        cudaMemcpy(rbB, &r[0], sizeof(Int) * 2 * nHotB,                        \
                   cudaMemcpyHostToDevice);                                    \
        X                                                                      \
      }                                                                        \
      Y                                                                        \
    }                                                                          \
  }

      if (nHotB) {                                                             \
        cudaMemcpy(rbB, &r[0], sizeof(Int) * 2 * nHotB,                        \
                   cudaMemcpyHostToDevice);                                    \
        X                                                                      \
      }

Do you think the copy of rulebook from host to device is the bottleneck of SparseConvNet? A simple opt could copy it totally once instead of kernelVolumes times.

But one thing maybe concluded that comparing with speed, SparseConvNet care GPU memory usage more.

Anyway, all of data in spconv is in device, we do not have the problem.

Kernel calls comparison: spconv(gather-mm_out-scatter): kernelVolume 3 vs SparseConvNet(convA(main part of nHot)---convB(remainder)): kernelVolume 2

memory copy comparison: spconv(gather copy + scatter implicit copy): all copy are device from device, but buffer is too large vs
SparseConvNet(rulebook copy (host to device) + shared memory copy(O(K^2))): shared memory is fast, but rulebook copy from host to device is too slow

All of the copy above are kernelVolume times.

memory usage comparison: spconv's gather/scatter's buffer is too large and device to device copy SparseConvNet uses shared memory to do this should be faster.

Conclusion: spconv's bottleneck is gather/scatter device to device large memory copy. SparseConvNet's bottleneck is rulebook host to device small memory multi-times(kernelVolume times) copy. spconv could try to borrow SparseConvNet's shared memory conv Benefit: shared memory copy is faster, memory usage is smaller without gather/scatter large buffer. cons: speed is doublable which should be benchmark further. I don't think there is big difference in multi-add, their multi-add are the same. futher: bit operation for further opt

traveller59 commented 4 years ago

The major bottleneck of SparseConvNet is their fused gemm. the benchmark I posted before costs 40ms in spconv, but 300ms when using fused gemm in SparseConvNet instead of gather-cuBLAS-scatterAdd in spconv... That's why I start to working on cutlass support.

traveller59 / spconv

occasionally crash #147