tusen-ai / SST

Code for a series of work in LiDAR perception, including SST (CVPR 22), FSD (NeurIPS 22), FSD++ (TPAMI 23), FSDv2, and CTRL (ICCV 23, oral).
Apache License 2.0
801 stars 102 forks source link

FSDv2 speedup #165

Open ArseniuML opened 1 year ago

ArseniuML commented 1 year ago

I am trying to convert FSDv2 to ONNX (and next to TensorRT), but there is an error:

RuntimeError: ONNX export failed on an operator with unrecognized namespace torch_scatter::scatter_max. If you are trying to export a custom operator, make sure you registered it with the right domain and version.

It seems, that I must convert TorchEx operations to ONNX first. How difficult is it?

Do you have any plans to speedup FSDv2?

FSDv2 timings at our point clouds are from 120 to 180 ms. I want to speedup it at least to 50 ms - it seems impossible to integrate FSDv2 to real autonomous driving system otherwise...

Abyssaledge commented 1 year ago

torch_scatter is not included in TorchEx. It is from https://github.com/rusty1s/pytorch_scatter. Have you successfully deployed spconv, which I think is the most important thing.

ArseniuML commented 1 year ago

Unfortunately, I am stuck on exporting aten::all operation to ONNX. It seems that PyTorch update is needed, but I can't launch FSDv2 even with PyTorch 1.9.0.

ArseniuML commented 1 year ago
  1. Export to ONNX is hard because of errors "div() takes 3 positional arguments but 4 were give" (related to aten::div operation). It seems that I have to re-implement all changes in torch.onnx.symbolic_opset... files from Pytorch 1.8.1 version, and do it via register_custom_op_symbolic(...) And I'm not even considering spconv...
  2. In order to upgrade Pytorch to the newest version (2.1) I have to upgrade MMCV first. But there are breaking changes in MMCV 2.x:

https://github.com/open-mmlab/mmcv/pull/2216

Collate and scatter has been removed, and it is inclear what to do...

Perhaps I have to merge FSD into the latest mmdetection 1.3.0? In this case I want at least have a patch of FSD over mmdetection3d 0.5.0 in order to apply this patch to mmdetection 1.3.0 (and resolve conflicts by hands next).

There is 'first release' commit (fb8c92fa2aaef6b76aed0d60ce3df3f94b0604cd), is it vanilla mmdetection3d 0.5.0?

  1. I visualized FSDv2 via torchview, and graph seems to be huge. I have no hope, that I can implement it in TensorRT by hands...

Could you give an advice, what would be the right way to deploy FSDv2 in TensorRT, if this is even possible?

Abyssaledge commented 1 year ago

I will not recommend you deploy a sparse conv based algorithm if you are not an experienced engineer... There are many things you need to deal with before deploying it, such as

  1. Simplify the code. I leave many APIs and hyper-parameters to make FSD easy to modify. However, these APIs are not deploy-friendly.
  2. Fix the dynamic input size of backbone and rcnn.
  3. Deploy spconv
  4. ...

It took quite long time for a couple of professional engineers in TuSimple to make it work, including the author of spconv. I can't help you too much because I did not join them.

It is really an odyssey for beginners to deal with all these things... However, I truly appreciate your effort and attention to our work.

ArseniuML commented 11 months ago

I am trying to deploy spconv, but there is one limitation, which seems to me fundamental. Spconv outputs shape is data-dependent , i.e.

Dense 1d matrix: 00001110000

Sparse 1d matrix (features, indices): 1 4 1 5 1 6

Dense 1d kernel: 111

Dense result: 001232100

Sparse result: 1 2 2 3 3 4 2 5 1 6

Dense 1d matrix: 00010101000

Sparse 1d matrix: 1 3 1 5 1 7

Dense 1d kernel: 111

Dense result: 011212110

Sparse result: 1 1 1 2 2 3 1 4 2 5 1 6 1 7

So output matrices shape depends not only on input matrices shape, but also on data in these input matrices (specifically on data in input indices matrix).

TensorRT has limited dynamic shape support, i.e. input tensor can has a arbitrary shape (in min,opt,max range), but output tensor shape must strictly depend on the input tensor shape and is calculating at the time TRT engine is being built.

Solution is to pre-calculate maximum bound for output tensor shape, and really fill only part of output tensor elements in runtime, letting others to remain zeros. But at the next step I need to slice this padded tensor, and it's unclear how to do this in general case.

Spconv deploy for FSD implies writing two plugins = GetIndicePairs plugin and ImplicitGemm plugin - ImplicitGemm uses result of GetIndicePairs (GetIndicePairs result can be reused by more than one ImplicitGemm layer). I tried to slice GetIndicePairs result in this way:

Set additional output 1x1 tensor in GetIndicePairs and populate it with real_indices_num. Perform dynamic slicing in Pytorch like this:

num_act_out_real = res[8][0]
return (res[0][:num_act_out_real, :],
        ...

where res[0] is out indices tensor, and res[8] is additional tensor for real_indices_num.

This model can be exported to ONNX with some Slice layers.

spconv_deploy_test

The problem is - I can build TRT engine from this ONNX, but can't run this engine (engine->createExecutionContext() return nullptr). Meanwhile, if I perform slicing by constant:

num_act_out_real = 24 #for test
return (res[0][:num_act_out_real, :],
        ...

TRT engine can be run normally. So I suppose that ONNX supports dynamic slicing, but TRT is not...

I can workaround this and imagine some schemes how to pass real_indices_num from GeiIndicePairs to ImplicitGemm:

  1. Straightforward method.

Set additional output 1x1 tensor in GetIndicePairs and 1x1 input tensor in ImplicitGemm. After indice pairs are calculated in GetIndicePairs layer, copy real_indices_num from CPU to this output tensor on GPU. In ImplicitGemm plugin, copy real_indices_num from 1x1 input tensor on GPU to some CPU variable. Perform indices slicing based on real_indices_num.

  1. CPU-GPU-CPU copying seems to me non-elegant, so perhaps there is another way:

Enumerate GetIndicePairs and ImplicitGemm layers so each ImplicitGemm layer will know the number of it's parent GetIndicePairs layer. Allocate static CPU array with one element for each GetIndicePairs layer. Each ImplicitGemm will know real_indice_pairs needed for slicing.

But the problem is - how to perform slicing after ImpicitGemm layer? What will be if I don't perform slicing and simply pass zero padded features and indices tensors further?

  1. Won't we lose all the benefit of sparsity?
  2. Sparse tensor can be like this (features, indices)

1 0 3 5 4 6 0 0 - pad 0 0 - pad 0 0 - pad

If slicing is not performed, technically is it inconsistent sparse tensor - feature with index 0 can be both 0 and 1.

It seems that TRT plugin for Traveller59 Spconv exists: https://github.com/traveller59/spconv/blob/master/docs/TENSORRT_INT8_GUIDE.md

But dialog on https://github.com/traveller59/spconv/issues is almost dead...

ArseniuML commented 11 months ago

I think I must feed num_act_out_real (real_indices_num) to some FSD-specific layers in order they can perform slicing.

Or perform padding like this:

1 0 3 5 4 6 0 -1 - pad 0 -1 - pad 0 -1 - pad

And consumers of sparse input must know, that features with index -1 does not really exist.

ArseniuML commented 11 months ago

Fix the dynamic input size of backbone and rcnn.

As I can understand, there are 2 backbones - backbone of the segmentor, which is SimpleSparseUNet, and backbone of SingleStageFSDv2, which is VirtualVoxelMixer. @Abyssaledge do you mean input size to both backbones need to be fixed?

Abyssaledge commented 11 months ago

Yes, I believe all the inputs should be in fixed sizes.

ArseniuML commented 10 months ago

It seems that TensorRT 8.6 supports data-dependent operations (for example NonZero). Data-dependent operations still are not supported in plugins, but there is a workaround for this - add additional output tensor num_out, and perform slicing after plugin. So perhaps it is no need to fix backbone input size.

This way I translated FSDv2 to TensorRT up to combined_out = self.combine_classes(...) point in simple_test(..)

Output tensors are: return combined_out['seg_points'], combined_out['seg_logits'], combined_out['seg_vote_preds'], combined_out['seg_feats'], combined_out['center_preds'], combined_out['batch_idx']

I verified, that output in TensoRT and Pytorch versions are the same given the same input (with acceptable precision in my opinion). I have tried an input, that network didn't see during an export.

Now I am trying to translate FSDv2 up to

voxel_feats = extract_output['virtual_feats']
voxel_coors = extract_output['virtual_coors']
voxel_xyz = extract_output['virtual_centers']

But there is another problem - insufficient workspace. FSDv2 seems to be too large for TensorRT. I tried to set max workspace size 1 << 50 (which is effectively infinity), but it didn't help... Do you can suppose, how do I can simplify FSDv2 in order to reduce workspace TRT needs (if this is possible)?

I set GetIndicePairsImplicitGemmPlugin to require 1700000 bytes workspace, and I believe this is not a problem. Another plugins require 0 bytes workspace.

Zipped ONNX is about 160 Mb, so it's hard to attach it there.

ArseniuML commented 10 months ago

If I set multiscale_features=None in

extract_output = self.extract_feat(combined_out, dict_to_sample, multiscale_features=None)

would it be a serious problem for FSDv2?

ArseniuML commented 10 months ago

It seems that I can't deploy FSDv2 due to memory limitations (TensorRT requests > 30 Gb GPU memory - and this impossible for me) @Abyssaledge is there any chance to simplify FSDv2 in order to deploy to TensorRT? Maybe FSD (not v2) would require less memory - can you estimate?

Abyssaledge commented 10 months ago

Why not try smaller channels or depth?