hmorimitsu / ptlflow

PyTorch Lightning Optical Flow models, scripts, and pretrained weights.
Apache License 2.0
250 stars 33 forks source link

TensorRT support of rapidflow #62

Closed Cai-RS closed 5 months ago

Cai-RS commented 5 months ago

Hi, thanks for your great work! I want to export the rapidflow to ONNX, but it seems that there are many custom layer in it (mainly from local_timm) and some ops are not supported in onnx opset? Do you know how can I solve this problem?

hmorimitsu commented 5 months ago

Hi,

I updated rapidflow's readme with some instructions about how to export to ONNX. Please take a look and see if it works for you. It requires ONNX opset >= 16.

Best,

Henrique

Cai-RS commented 5 months ago

Hi,

I updated rapidflow's readme with some instructions about how to export to ONNX. Please take a look and see if it works for you. It requires ONNX opset >= 16.

Best,

Henrique

OMG, thank you so much! It dose work for me such a beginner! I am considering learning how to write a TensorRt plugin for 'alt_cuda_corr' and register it. But you mentioned that alt_cuda_corr is much slower than all-pairs corr, and speed is much more important for me. So maybe directly converting the model to ONNX is the best choice for me? BTW,during conversion I noticed that you set (384, 1280) as default padding image size for kitti, why not choose (384, 1248) which is closer to the origional resolution (375, 1242)?

hmorimitsu commented 5 months ago

I am considering learning how to write a TensorRt plugin for 'alt_cuda_corr' and register it. But you mentioned that alt_cuda_corr is much slower than all-pairs corr, and speed is much more important for me. So maybe directly converting the model to ONNX is the best choice for me?

If you are not going to use large images, then all-pairs corr is probably better.

BTW,during conversion I noticed that you set (384, 1280) as default padding image size for kitti, why not choose (384, 1248) which is closer to the origional resolution (375, 1242)?

Yes, you can use 1248. I just chose 1280 out of habit because it is multiple of more powers of 2, but it is not necessary here.

Actually, since the padding/resizing is inside RAPIDFlow (see self.preprocess_images and self.postprocess_predictions inside RAPIDFlow's forward), I think you can even use (375, 1242) as the size. However, in an ONNX application, I think it is maybe a better idea to remove those resizing operations out of the model to make it more flexible and possibly avoid unnecessary operations.

Cai-RS commented 5 months ago

I am considering learning how to write a TensorRt plugin for 'alt_cuda_corr' and register it. But you mentioned that alt_cuda_corr is much slower than all-pairs corr, and speed is much more important for me. So maybe directly converting the model to ONNX is the best choice for me?

If you are not going to use large images, then all-pairs corr is probably better.

BTW,during conversion I noticed that you set (384, 1280) as default padding image size for kitti, why not choose (384, 1248) which is closer to the origional resolution (375, 1242)?

Yes, you can use 1248. I just chose 1280 out of habit because it is multiple of more powers of 2, but it is not necessary here.

Actually, since the padding/resizing is inside RAPIDFlow (see self.preprocess_images and self.postprocess_predictions inside RAPIDFlow's forward), I think you can even use (375, 1242) as the size. However, in an ONNX application, I think it is maybe a better idea to remove those resizing operations out of the model to make it more flexible and possibly avoid unnecessary operations.

Thanks for your reply so quickly. Yes I see that pre- and post- process are included in the onnx model. I add the setting "dynamic_axis" when calling onnx export function so that the model can accept different input size. I think this is good because I don't have to manually resize the raw input for the network, and I think resize op using GPU in onnx or tensorrt engine anyhow is faster than the manual one.

Cai-RS commented 5 months ago

I am considering learning how to write a TensorRt plugin for 'alt_cuda_corr' and register it. But you mentioned that alt_cuda_corr is much slower than all-pairs corr, and speed is much more important for me. So maybe directly converting the model to ONNX is the best choice for me?

If you are not going to use large images, then all-pairs corr is probably better.

BTW,during conversion I noticed that you set (384, 1280) as default padding image size for kitti, why not choose (384, 1248) which is closer to the origional resolution (375, 1242)?

Yes, you can use 1248. I just chose 1280 out of habit because it is multiple of more powers of 2, but it is not necessary here.

Actually, since the padding/resizing is inside RAPIDFlow (see self.preprocess_images and self.postprocess_predictions inside RAPIDFlow's forward), I think you can even use (375, 1242) as the size. However, in an ONNX application, I think it is maybe a better idea to remove those resizing operations out of the model to make it more flexible and possibly avoid unnecessary operations.

Well, I think you are right. Maybe the pre- and post- operations still need to be placed outside the network, because the resolution of two image frames in some datasets may be different... Or change the network to accept two inputs and resize them separately.

Cai-RS commented 5 months ago

Hi,

I updated rapidflow's readme with some instructions about how to export to ONNX. Please take a look and see if it works for you. It requires ONNX opset >= 16.

Best,

Henrique

Sorry to bother you again. I made inference on a same image with ONNX and pytorch model for many times. The results of pytorch model make no difference for the same image, but the results of ONNX change everytime. I got this conclusion by comparing the post-processed flow resluts of ONNX and oytorch Screenshot from 2024-03-20 16-29-13

Screenshot from 2024-03-20 16-29-38

Screenshot from 2024-03-20 16-34-48

These are three inference results of a same image pair by a ONNX model. Do you konw the reason of this probelm?

hmorimitsu commented 5 months ago

Hi, I am sorry, but I am not very familiar with ONNX. Right now I cannot imagine what would cause the results to change. It seems that the difference is not very large. Is it possible that ONNX does not calculate results exactly (only approximate) to speedup the process and thus cause some flutuations in the results?

I will try to think more about it and let you know if I find something. If you find out the reason, please let me know. I am also curious.

Cai-RS commented 5 months ago

Hi, I am sorry, but I am not very familiar with ONNX. Right now I cannot imagine what would cause the results to change. It seems that the difference is not very large. Is it possible that ONNX does not calculate results exactly (only approximate) to speedup the process and thus cause some flutuations in the results?

I will try to think more about it and let you know if I find something. If you find out the reason, please let me know. I am also curious.

Thanks for your reply. It seems that it is the issue of onnxruntime, which doesn't provide deteministic computations, see https://stackoverflow.com/questions/69053582/onnx-model-inference-produces-different-results-for-the-same-input https://github.com/microsoft/onnxruntime/issues/4611

It seems that there is nothing we can do to avoid the problem. The images I showed you before are comparison of the results of pytorch inference on the "CPU "and the results of ONNX inference on onnxruntime-GPU, with error threshold “tolerance rtol=0.001, atol=1e-05”. Yes, the difference is totally acceptable. And when I compare the results of pytorch inference on the "GPU "and the results of ONNX inference on onnxruntime-GPU, it's be like Screenshot from 2024-03-21 01-32-38 The number of mismatch seems pretty large in situation pytorch-GPU. I guessed maybe the error threshold is too small, so I raised the threshold to “tolerance rtol=0.001, atol=0.0001“, and the result is better Screenshot from 2024-03-21 01-51-02

keep raising the threshold to "tolerance rtol=0.01, atol=0.001" and result is “stably” better (number of mismatch is 0~3) Screenshot from 2024-03-21 01-39-34

Apparently almost all of mismatch is less than 0.01(rtol) and 0.001(atol), and there is only one pixel with rtol always nearly 10 (don't know which point it is and why). Anyway, considering that this is a comparison of post-processing results, the actual smallest unit of flow should be 1 pixel, we can say that the performance of the converted onnx is satisfactory.

BTW, from the comparison we know that pytorch infer in GPU and CPU have different result. And after verification, the difference in results between the pytorch cpu and gpu infer is close to the difference between onnxruntime-gpu and pytorch-gpu. But why do the results of onnxruntime-gpu look more like the results of pytorch running on the CPU than pytorch on the GPU? I've tried ONNXs exported by pytorch running in CPU and the one running in GPU, and get same results. I did install onnxruntime-gpu instead of cpu version. I can't find the answer for this...(but this doesn't affect the use of onnx, I'm just curious hhh...) Screenshot from 2024-03-21 02-54-45 Screenshot from 2024-03-21 02-58-24

hmorimitsu commented 5 months ago

Thank you for this detailed analysis, I also learned a lot from it! I didn't know there were so many nuances about how you run the model, it is very interesting for practical applications. Thank you for checking that the ONNX version is still relatively stable too. I hadn't tested it very extensively before, so it's good to know it works.

Cai-RS commented 5 months ago

Thank you for this detailed analysis, I also learned a lot from it! I didn't know there were so many nuances about how you run the model, it is very interesting for practical applications. Thank you for checking that the ONNX version is still relatively stable too. I hadn't tested it very extensively before, so it's good to know it works.

Thank you for providing such a superior model with the best balance of speed and accuracy. Learning to design such an innovative model structure is too difficult for me (AI theory develops too fast), while learning how to deploy and apply the model is relatively simple but also interesting. Thank you again for your work and your reply with patience. best wishes