"trt_cuda_graph_enable" bug in tensorrt.

microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

https://onnxruntime.ai

MIT License

14.88k stars 2.95k forks source link

"trt_cuda_graph_enable" bug in tensorrt. #20050

Open hy846130226 opened 8 months ago

hy846130226 commented 8 months ago

Describe the issue

I'm using onnx-tensorrt.

When I enable the trt_cuda_graph_enable like this:

Subsequently, no matter how many images of data I pass for inference, what I get is always the result of the first image.

The following is my infer code: ![Uploading image.png…]()

the “input” and “output temp” is reusable.

To reproduce

use onnx-tensorrt 1.16.3/1.17.0/1.17.1
infer different image

Urgency

No response

Platform

Windows

OS Version

WIN10

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.17.0

ONNX Runtime API

C++

Architecture

X64

Execution Provider

TensorRT

Execution Provider Library Version

No response

hy846130226 commented 8 months ago

If I disable the trt_cuda_graph_enable, I could get every image correct results.

tianleiwu commented 8 months ago

Make sure you use IO/Binding to bind input tensors in GPU memory. During inference, copy input to same address (input shape shall be the same) of the input used in the first inference run.

You can get some idea from corresponding python code: https://github.com/microsoft/onnxruntime/blob/4a196d15940b0f328735c888e2e861d67602ffcf/onnxruntime/python/tools/transformers/io_binding_helper.py#L212-L307

hy846130226 commented 8 months ago

Hi @tianleiwu

Thanks for your help!

I could use the cudaGraphic in tensorrt but I am confused in onnx-tensorrt.

I know I'm supposed to copy the input to the same address, but shouldn't this operation be automated by calling this method?

std::vector Run(const RunOptions& run_options, const char const input_names, const Value input_values, size_t input_count, const char const* output_names, size_t output_count);

But it seems like I should use some methods to get the IOBinding, then every times I infer the image, I should changed the IOBinding to binding the address, even my address always the same. (every time I got the image data, I will copy it to input address, in other words, my address is reusable)

hy846130226 commented 8 months ago

Author

And by the way, How could I got the IOBinding in onnx-tensorrt in C++.

tianleiwu commented 8 months ago

For cuda graph, you shall only create IO Binding once. For the first call, the cuda graph will be captured. For the remaining calls, you only need copy data to same address and call run with io binding API to replay the captured graph.

An example of I/O binding for TRT in C++ is here: https://github.com/microsoft/onnxruntime/blob/4a196d15940b0f328735c888e2e861d67602ffcf/onnxruntime/test/shared_lib/test_inference.cc#L1897-L1909

Example of cuda graph here: https://github.com/microsoft/onnxruntime/blob/4a196d15940b0f328735c888e2e861d67602ffcf/onnxruntime/test/shared_lib/test_inference.cc#L1975-L1986

https://github.com/microsoft/onnxruntime/blob/4a196d15940b0f328735c888e2e861d67602ffcf/onnxruntime/test/shared_lib/test_inference.cc#L2052-L2081

hy846130226 commented 8 months ago

Hi @tianleiwu

Thanks for your help!

I modify the code according the example but it does not work.

do I miss something?

tianleiwu commented 8 months ago

@hy846130226, please bind inputs and outputs to buffers in GPU memory instead of CPU memory.

github-actions[bot] commented 7 months ago

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.