Closed xmyqsh closed 3 years ago
cuhash may have problem. currently it's recommend to disable cuda hash by set use_hash=False
.
Tensorflow:
implement spconv in tensorflow is straightforward but not "wrap", you still need to write much code because the tensorflow c++ operator is quite different from pytorch.
That's true. I need to read the code in deep.
@traveller59 Indeed, it is straightforward.
What are you thinking about the description above? 30 sparse 3D conv, 7 sparse pooling, 7 sparse unpooling on 200k waymo lidar points only use 12ms. Can you summarize the key factor in the above description that speed up the speed that spconv do not have?
The subm indice generation speed is greatly increased (4x faster) in newest code. you can have a try.
Interesting! Follow you!
Do you think voxel generation could be speed up by CUDA? DOPS uses pooling op to do it.
For 2, I guess they do not use 2/2/0, because 2/2/0 have poor receptive field in their UNet, and the receptive field is more harder to achieve in sparse conv net compared to normal conv net.
What is fused gemm? Can you have a brief introduction?
Yes, voxel generation (one point per voxel) is equivalent to max pool operation. the maxpool in spconv isn't optimized. I will try to optimize it but I'm not sure what the "shuffle" mean in their paper.
fused gemm: perform gather-gemm-scatter in one kernel, no additional buffer needed. you can check fused gemm implemented in SparseConvNet.
It seems that I have to revert my understanding of 2. You are correct, they use 2/2/0(ksp) instead of 3/2/1(k,s,p) for pooling.
pool 3/2/1(k,s,p) is same as conv with 3/2/1(k,s,p). But with pool 2/2/0(ksp) , each voxel will be looked up only once as the DOPS said.
The voxel generation is same as the 2/2/0(ksp) version, which is s/s/0(ksp). And the last thing that we should make voxel generation same as 2/2/0(ksp) version of pool is that the s in s/s/0(ksp) is discrete, but the points' position is continue.
There is two way to resolve this problem, and both of them use the only look up once property
:
Solution 1: sort/shuffle points by the radix sort: use radix sort on the first several digits of the points' position to make it sorted in terms of stride resolution. And then do max or ave on the slot/bucket of radix sort. And CUB library has the radix sort API.
Solution 2: use HashMultiMap: your cpu single thread version of voxelization is exactly same as the HashMultiMap, the only thing to do is how to use CUDA to implemented it. Maybe several Atomic operators will resolve the problem.
So, the meaning of shuffle
is the order of sequence unware, it is another description of only look once property
Solution 1: I check the cub radix sort but no API found for slot access. I still don't know how to use radix to perform pooling. Solution 2: looks ok.
I create a benchmark script with a encoder network and waymo data, it costs 40ms in 1080Ti (1ms subm indice gen, 35ms subm and 4ms maxpool). I really don't think they can achieve 12ms with both encoder and decoder... The voxel size and point cloud range all missing in their paper.
1ms for 7 subm indice gen?
In my intuition, indice gen will consume lots of time so we need indice_key.
Your 4x faster subm indice generation
works!
This is a good and helpful benchmark, thank you!
40 / 4 = 10ms
40 / 8 = 5ms
Amazing result!!
I have to say, spconv has the change to be faster than DOPS.
But you haven't use shared memory and bit operation to partitioning and caching the filter.
Have you tested your normal sparse conv compared with subm?
What does "40 / 4 = 10ms" mean?
The "shared memory and bit operation to partitioning and caching the filter" is used for gemm. the fused gemm implementation in SparseConvNet use shared memory to cache filter, but spconv use cuBLAS gemm, so we don't need cache for filter.
speed of normal sparse conv isn't comparable with subm because normal conv generate too much new points and will greatly impact speed of following layers.
I will bench SparseConvNet fused gemm in this benchmark first, then consider to modify CUTLASS to get a better fused gemm.
40 / 4 > 10ms When we downsample the resolution, the sparsity will decrease.
You are a big god, I will learn lots of things here!
How is going on?
#define RULEBOOKITERATOR(X, Y) \
{ \
Int rbMaxSize = 0; \
for (auto &r : _rules) \
rbMaxSize = std::max(rbMaxSize, (Int)r.size()); \
at::Tensor rulesBuffer = at::empty({rbMaxSize}, at::CUDA(at_kINT)); \
Int *rbB = rulesBuffer.data_ptr<Int>(); \
for (int k = 0; k < _rules.size(); ++k) { \
auto &r = _rules[k]; \
Int nHotB = r.size() / 2; \
if (nHotB) { \
cudaMemcpy(rbB, &r[0], sizeof(Int) * 2 * nHotB, \
cudaMemcpyHostToDevice); \
X \
} \
Y \
} \
}
if (nHotB) { \
cudaMemcpy(rbB, &r[0], sizeof(Int) * 2 * nHotB, \
cudaMemcpyHostToDevice); \
X \
}
Do you think the copy of rulebook from host to device is the bottleneck of SparseConvNet? A simple opt could copy it totally once instead of kernelVolumes times.
But one thing maybe concluded that comparing with speed, SparseConvNet care GPU memory usage more.
Anyway, all of data in spconv is in device, we do not have the problem.
Kernel calls comparison: spconv(gather-mm_out-scatter): kernelVolume 3 vs SparseConvNet(convA(main part of nHot)---convB(remainder)): kernelVolume 2
memory copy comparison:
spconv(gather copy + scatter implicit copy): all copy are device from device, but buffer is too large
vs
SparseConvNet(rulebook copy (host to device) + shared memory copy(O(K^2))): shared memory is fast, but rulebook copy from host to device is too slow
All of the copy above are kernelVolume times.
memory usage comparison: spconv's gather/scatter's buffer is too large and device to device copy SparseConvNet uses shared memory to do this should be faster.
Conclusion: spconv's bottleneck is gather/scatter device to device large memory copy. SparseConvNet's bottleneck is rulebook host to device small memory multi-times(kernelVolume times) copy. spconv could try to borrow SparseConvNet's shared memory conv Benefit: shared memory copy is faster, memory usage is smaller without gather/scatter large buffer. cons: speed is doublable which should be benchmark further. I don't think there is big difference in multi-add, their multi-add are the same. futher: bit operation for further opt
The major bottleneck of SparseConvNet is their fused gemm. the benchmark I posted before costs 40ms in spconv, but 300ms when using fused gemm in SparseConvNet instead of gather-cuBLAS-scatterAdd in spconv... That's why I start to working on cutlass support.
Hello,
I occasionally got the following error when using spconv. It almost works well when I disconnect my internet connection.
Any advice for this error?
By the way, I'm planning to wrap this repo into a TensorFlow version. Hope for further help here.