Open ArseniuML opened 1 year ago
torch_scatter is not included in TorchEx. It is from https://github.com/rusty1s/pytorch_scatter. Have you successfully deployed spconv, which I think is the most important thing.
Unfortunately, I am stuck on exporting aten::all operation to ONNX. It seems that PyTorch update is needed, but I can't launch FSDv2 even with PyTorch 1.9.0.
https://github.com/open-mmlab/mmcv/pull/2216
Collate and scatter has been removed, and it is inclear what to do...
Perhaps I have to merge FSD into the latest mmdetection 1.3.0? In this case I want at least have a patch of FSD over mmdetection3d 0.5.0 in order to apply this patch to mmdetection 1.3.0 (and resolve conflicts by hands next).
There is 'first release' commit (fb8c92fa2aaef6b76aed0d60ce3df3f94b0604cd), is it vanilla mmdetection3d 0.5.0?
Could you give an advice, what would be the right way to deploy FSDv2 in TensorRT, if this is even possible?
I will not recommend you deploy a sparse conv based algorithm if you are not an experienced engineer... There are many things you need to deal with before deploying it, such as
It took quite long time for a couple of professional engineers in TuSimple to make it work, including the author of spconv. I can't help you too much because I did not join them.
It is really an odyssey for beginners to deal with all these things... However, I truly appreciate your effort and attention to our work.
I am trying to deploy spconv, but there is one limitation, which seems to me fundamental. Spconv outputs shape is data-dependent , i.e.
Dense 1d matrix: 00001110000
Sparse 1d matrix (features, indices): 1 4 1 5 1 6
Dense 1d kernel: 111
Dense result: 001232100
Sparse result: 1 2 2 3 3 4 2 5 1 6
Dense 1d matrix: 00010101000
Sparse 1d matrix: 1 3 1 5 1 7
Dense 1d kernel: 111
Dense result: 011212110
Sparse result: 1 1 1 2 2 3 1 4 2 5 1 6 1 7
So output matrices shape depends not only on input matrices shape, but also on data in these input matrices (specifically on data in input indices matrix).
TensorRT has limited dynamic shape support, i.e. input tensor can has a arbitrary shape (in min,opt,max range), but output tensor shape must strictly depend on the input tensor shape and is calculating at the time TRT engine is being built.
Solution is to pre-calculate maximum bound for output tensor shape, and really fill only part of output tensor elements in runtime, letting others to remain zeros. But at the next step I need to slice this padded tensor, and it's unclear how to do this in general case.
Spconv deploy for FSD implies writing two plugins = GetIndicePairs plugin and ImplicitGemm plugin - ImplicitGemm uses result of GetIndicePairs (GetIndicePairs result can be reused by more than one ImplicitGemm layer). I tried to slice GetIndicePairs result in this way:
Set additional output 1x1 tensor in GetIndicePairs and populate it with real_indices_num. Perform dynamic slicing in Pytorch like this:
num_act_out_real = res[8][0]
return (res[0][:num_act_out_real, :],
...
where res[0] is out indices tensor, and res[8] is additional tensor for real_indices_num.
This model can be exported to ONNX with some Slice layers.
The problem is - I can build TRT engine from this ONNX, but can't run this engine (engine->createExecutionContext() return nullptr). Meanwhile, if I perform slicing by constant:
num_act_out_real = 24 #for test
return (res[0][:num_act_out_real, :],
...
TRT engine can be run normally. So I suppose that ONNX supports dynamic slicing, but TRT is not...
I can workaround this and imagine some schemes how to pass real_indices_num from GeiIndicePairs to ImplicitGemm:
Set additional output 1x1 tensor in GetIndicePairs and 1x1 input tensor in ImplicitGemm. After indice pairs are calculated in GetIndicePairs layer, copy real_indices_num from CPU to this output tensor on GPU. In ImplicitGemm plugin, copy real_indices_num from 1x1 input tensor on GPU to some CPU variable. Perform indices slicing based on real_indices_num.
Enumerate GetIndicePairs and ImplicitGemm layers so each ImplicitGemm layer will know the number of it's parent GetIndicePairs layer. Allocate static CPU array with one element for each GetIndicePairs layer. Each ImplicitGemm will know real_indice_pairs needed for slicing.
But the problem is - how to perform slicing after ImpicitGemm layer? What will be if I don't perform slicing and simply pass zero padded features and indices tensors further?
1 0 3 5 4 6 0 0 - pad 0 0 - pad 0 0 - pad
If slicing is not performed, technically is it inconsistent sparse tensor - feature with index 0 can be both 0 and 1.
It seems that TRT plugin for Traveller59 Spconv exists: https://github.com/traveller59/spconv/blob/master/docs/TENSORRT_INT8_GUIDE.md
But dialog on https://github.com/traveller59/spconv/issues is almost dead...
I think I must feed num_act_out_real (real_indices_num) to some FSD-specific layers in order they can perform slicing.
Or perform padding like this:
1 0 3 5 4 6 0 -1 - pad 0 -1 - pad 0 -1 - pad
And consumers of sparse input must know, that features with index -1 does not really exist.
Fix the dynamic input size of backbone and rcnn.
As I can understand, there are 2 backbones - backbone of the segmentor, which is SimpleSparseUNet, and backbone of SingleStageFSDv2, which is VirtualVoxelMixer. @Abyssaledge do you mean input size to both backbones need to be fixed?
Yes, I believe all the inputs should be in fixed sizes.
It seems that TensorRT 8.6 supports data-dependent operations (for example NonZero). Data-dependent operations still are not supported in plugins, but there is a workaround for this - add additional output tensor num_out, and perform slicing after plugin. So perhaps it is no need to fix backbone input size.
This way I translated FSDv2 to TensorRT up to
combined_out = self.combine_classes(...)
point in simple_test(..)
Output tensors are:
return combined_out['seg_points'], combined_out['seg_logits'], combined_out['seg_vote_preds'], combined_out['seg_feats'], combined_out['center_preds'], combined_out['batch_idx']
I verified, that output in TensoRT and Pytorch versions are the same given the same input (with acceptable precision in my opinion). I have tried an input, that network didn't see during an export.
Now I am trying to translate FSDv2 up to
voxel_feats = extract_output['virtual_feats']
voxel_coors = extract_output['virtual_coors']
voxel_xyz = extract_output['virtual_centers']
But there is another problem - insufficient workspace. FSDv2 seems to be too large for TensorRT. I tried to set max workspace size 1 << 50 (which is effectively infinity), but it didn't help... Do you can suppose, how do I can simplify FSDv2 in order to reduce workspace TRT needs (if this is possible)?
I set GetIndicePairsImplicitGemmPlugin to require 1700000 bytes workspace, and I believe this is not a problem. Another plugins require 0 bytes workspace.
Zipped ONNX is about 160 Mb, so it's hard to attach it there.
If I set multiscale_features=None in
extract_output = self.extract_feat(combined_out, dict_to_sample, multiscale_features=None)
would it be a serious problem for FSDv2?
It seems that I can't deploy FSDv2 due to memory limitations (TensorRT requests > 30 Gb GPU memory - and this impossible for me) @Abyssaledge is there any chance to simplify FSDv2 in order to deploy to TensorRT? Maybe FSD (not v2) would require less memory - can you estimate?
Why not try smaller channels or depth?
I am trying to convert FSDv2 to ONNX (and next to TensorRT), but there is an error:
RuntimeError: ONNX export failed on an operator with unrecognized namespace torch_scatter::scatter_max. If you are trying to export a custom operator, make sure you registered it with the right domain and version.
It seems, that I must convert TorchEx operations to ONNX first. How difficult is it?
Do you have any plans to speedup FSDv2?
FSDv2 timings at our point clouds are from 120 to 180 ms. I want to speedup it at least to 50 ms - it seems impossible to integrate FSDv2 to real autonomous driving system otherwise...