neuralmagic / sparseml

Libraries for applying sparsification recipes to neural networks with a few lines of code, enabling faster and smaller models
Apache License 2.0
2.07k stars 148 forks source link

Deploying with Tensor RT #364

Closed XiaoJiNu closed 2 years ago

XiaoJiNu commented 3 years ago

Would you like to give the tutorial of deplopying with Tensor RT for sparsed yolov5l moldel ?

mfl22 commented 3 years ago

I think this feature is not implemented here, sparseml is focused on deployment on CPU-s, see: https://docs.neuralmagic.com/deepsparse/ (NOTE: I'm just a user of sparseml, not developer...)

XiaoJiNu commented 3 years ago

@mfproto Thank you, then do you know how to inference the sparsed model in gpu ?

mfl22 commented 3 years ago

@XiaoJiNu that's a good question and a subject of ongoing research and development (especially for unstructured sparsity), but I don't think this is the focus here... sparseml focuses on model sparsification, deployment on specific platforms (other than CPU-s supported by DeepSparse engine) is another issue...

XiaoJiNu commented 3 years ago

@mfproto We can train with gpu then we must can inference with gpu, the only question is that the author doesn't give the inference script

markurtz commented 3 years ago

Hi @XiaoJiNu, as @mfproto said our main focus is compatibility and performance with the DeepSparse engine. We are working on TensorRT compatibility, though, mainly for internal benchmarking. The main issue is converting the quantized ONNX graph to something compatible with TensorRT. The newer Ampere chips like the A100 have support for semi-structured sparsity which the models we've pushed would meet the criterion they require for the most part. We'll work on getting an example up to work through this more. If you're looking to run right now my recommendation would be to convert the sparse, FP32 ONNX files to TensorRT through their supported pipeline. For quantized, I would recommend taking our sparse (not quantized) models and then running post-training quantization on them through the TensorRT API. Let me know if you need any other help!

XiaoJiNu commented 3 years ago

Hi @markurtz, thanks for your detailed reply ! What another help I need is that how to inference the sparse yolov5 model with gpu ? I only find the training command but don't find the inference command.

markurtz commented 3 years ago

Hi @XiaoJiNu, if you're looking to go through the PyTorch APIs for GPU inference you can follow our example annotate application and set the device to cuda: https://github.com/neuralmagic/deepsparse/blob/main/examples/ultralytics-yolo/annotate.py#L102

PyTorch unfortunately does not have support for quantization or sparsity through its APIs on GPU, though. If you'd like to try out speedups through TensorRT using our models, there are conversion pathways that accept ONNX formats. This tutorial, starting at section 4, walks through this process in an alright way: https://learnopencv.com/how-to-convert-a-model-from-pytorch-to-tensorrt-and-speed-up-inference/

We are working on a standardized tutorial and pathway for this to convert to TensorRT, though, and will update once that's available. TensorRT support for ONNX and operators can be limited, though, so if you run into any issues let us know and happy to help more!

hunterchenghx commented 3 years ago

hi @markurtz , I also want to deploy the sparse quantized model in TensorRT. I think it is able to transform the onnx model (with Q/DQ modules) to TensorRT8 now. However, tensorrt 8 doesn't support uint8(int8 is okay), which is the data type in the trained model using your method. Could me tell me how to adjust the recipe or code, so that I could transfer learning a model with int8 weights? Thanks so much!

markurtz commented 3 years ago

Hi @hunterchenghx, thanks for bringing this point up. Over the next few weeks, we're going to be working on deployment flows for TensorRT. We'll update here with those fixes as we should be able to introduce an API that will convert the uint8 to int8 without any additional editing or retraining of the model.

markurtz commented 2 years ago

Hi @hunterchenghx @XiaoJiNu we're running a bit delayed on getting this prioritized given our busy schedule. We are planning to get support for this added into our 0.10 release coming out in mid to late January.

Thanks, Mark

markurtz commented 2 years ago

Hi everyone, the TensorRT integration has grown in scope quite a bit from what we initially thought was a small issue. This is now targeted for our 0.11 release which will be coming out in late February. If we push anything to our nightly before then, will update here.

Thanks, Mark

markurtz commented 2 years ago

Hi everyone, we ran into quite a few issues with quantization aware training and are currently working a refactor for it to get better support of graphs for models like ResNet-50, BERT, and YOLOv5 for DeepSparse and TensorRT. It is something being actively worked on now for our 0.12 and as soon as things begin landing on nightly we will update here!

Thanks, Mark

markurtz commented 2 years ago

Hi @XiaoJiNu, @mfproto, and @hunterchenghx we've converted our quantization flows now to additionally work with TensorRT from ONNX with our 0.12 release. We'll be putting up a tutorial a bit later to show the full end-to-end flow, but running any quantization flow should result in a TensorRT compatible graph now. Closing this ticket out and if there is any support you need feel free to open another or reach out in our slack channel!