[FEATURE] Support ONNX and TensorRT export

IamShubhamGupto commented 7 months ago

Hello @guipotje and team,

In this PR, I am adding support for exporting ONNX and TensorRT models. On running the TensorRT engine, we can observe improvement in FPS. Since the evaluation code is currently unavailable, I cannot quantify the accuracy but it feels similar to the provided xfeat.pt file performance. The following is currently supported and changed:

[x] Export onnx with opset version preference (default 17)
[x] Export onnx with dynamic axis preference (enabled by default)
[x] Export TensorRT with image size preference (default 480x640)
[x] Export TensorRT with fp32 preference
[x] Export TensorRT with fp16 preference (enabled by default)
[x] Export TensorRT with workspace size preference (default 4096MiB)
[x] Demo realtime_demo.py with engine
[x] Make realtime_demo.py backwards compatible with cv2 version 4.5.0
[x] Update README.md for TensorRT export

We can possibly further improve FPS by simplifying the onnx export using onnx-simplier but that can be an iterative PR.

I will be attaching logs and performance observations in comments below. Let me know if there any changes required. Thank you again for the amazing work!

IamShubhamGupto commented 7 months ago

TensorRT export logs and statistics - trt.log

mode	observed FPS
XFeat (cuda)	21.4
XFeat (tensorrt)	24.7

XFeat (cuda) Screenshot from 2024-05-03 03-17-08

XFeat (tensorrt) Screenshot from 2024-05-03 03-25-10

TensoRT is more sparse for almost similar poses but generally seems to have the same performance if not better.

Device: Nvidia Jetson AGX Xavier - Jetpack 5.1 Docker image: dustynv/l4t-ml:r35.4.1

guipotje commented 6 months ago

Hello @IamShubhamGupto and @acai66, thank you so much for your contributions!

I notice that both of you are providing an ONNX export, and it probably doesn't make sense to merge both since they serve the same purpose. Do you have any suggestions on how to proceed? In my opinion, it makes sense to choose the one that supports the earliest operator set, as it would maximize the model's compatibility with the greatest number of devices and libraries

IamShubhamGupto commented 6 months ago

Hello @IamShubhamGupto and @acai66, thank you so much for your contributions!

I notice that both of you are providing an ONNX export, and it probably doesn't make sense to merge both since they serve the same purpose. Do you have any suggestions on how to proceed? In my opinion, it makes sense to choose the one that supports the earliest operator set, as it would maximize the model's compatibility with the greatest number of devices and libraries

sure, maybe me and @acai66 can work on collaborating our work. In summary, there is no specific operator set defined in this PR and is up to the user to provide one, by default it is the latest supported and that is 17. Users can choose to use an older op set such as 13 by providing command line arguments.

python3 export.py --onnx_opset 13

Another key difference in our PRs is @acai66 can quantize the XFeatModel and XFeat modules where as I choose to only quantize XFeatModel. This feature can be merged into my branch as the code to then export TensorRT engine is already available here .

Let me know what you guys think

IamShubhamGupto commented 6 months ago

Hey just bumping this conversation up again @guipotje @acai66

acai66 commented 6 months ago

Hello @IamShubhamGupto and @acai66, thank you so much for your contributions! I notice that both of you are providing an ONNX export, and it probably doesn't make sense to merge both since they serve the same purpose. Do you have any suggestions on how to proceed? In my opinion, it makes sense to choose the one that supports the earliest operator set, as it would maximize the model's compatibility with the greatest number of devices and libraries

sure, maybe me and @acai66 can work on collaborating our work. In summary, there is no specific operator set defined in this PR and is up to the user to provide one, by default it is the latest supported and that is 17. Users can choose to use an older op set such as 13 by providing command line arguments.
python3 export.py --onnx_opset 13 
Another key difference in our PRs is @acai66 can quantize the XFeatModel and XFeat modules where as I choose to only quantize XFeatModel. This feature can be merged into my branch as the code to then export TensorRT engine is already available here .

Let me know what you guys think

Thank you for your suggestion. I concur with your viewpoint and believe that merging our work would indeed be a beneficial course of action.

acai66 commented 6 months ago

My PR is aimed at improving the convenience and compatibility of deployment, especially in environments without PyTorch. I've uploaded an example of using the ONNX Runtime Python API, which can be easily reimplemented using the ONNX Runtime C++ API.

IamShubhamGupto commented 6 months ago

My PR is aimed at improving the convenience and compatibility of deployment, especially in environments without PyTorch. I've uploaded an example of using the ONNX Runtime Python API, which can be easily reimplemented using the ONNX Runtime C++ API.

that's amazing, we can then find some time this week to start merging the examples. I'll give your editor access to my fork.

guipotje commented 6 months ago

My PR is aimed at improving the convenience and compatibility of deployment, especially in environments without PyTorch. I've uploaded an example of using the ONNX Runtime Python API, which can be easily reimplemented using the ONNX Runtime C++ API.

that's amazing, we can then find some time this week to start merging the examples. I'll give your editor access to my fork.

Fantastic, @IamShubhamGupto and @acai66! While I'm not experienced in ONNX, I'm here to help with anything related to the original code — just let me know.

acai66 commented 6 months ago

My PR is aimed at improving the convenience and compatibility of deployment, especially in environments without PyTorch. I've uploaded an example of using the ONNX Runtime Python API, which can be easily reimplemented using the ONNX Runtime C++ API.

that's amazing, we can then find some time this week to start merging the examples. I'll give your editor access to my fork.

Thank you very much for the invitation. I am concerned that I have little experience with TensorRT deployment. I have invited you to be a collaborator on the forked repository.

acai66 commented 6 months ago

My PR is aimed at improving the convenience and compatibility of deployment, especially in environments without PyTorch. I've uploaded an example of using the ONNX Runtime Python API, which can be easily reimplemented using the ONNX Runtime C++ API.

that's amazing, we can then find some time this week to start merging the examples. I'll give your editor access to my fork.

Fantastic, @IamShubhamGupto and @acai66! While I'm not experienced in ONNX, I'm here to help with anything related to the original code — just let me know.

I apologize for my poor English; I am using translation software to communicate. I have made some modifications to the source code to improve the compatibility and convenience of exporting ONNX. However, I am unsure if this will affect the original PyTorch functionality and performance. I am unable to conduct sufficient related tests.

IamShubhamGupto commented 6 months ago

@acai66 for now I am pausing development on my branch. we will merge onnx export from your PR and then I will continue TensorRT development here.

I will review and commit changes on your PR in the next couple of hours.

verlab / accelerated_features

[FEATURE] Support ONNX and TensorRT export #4