[TensorRT EP] How can I disable generating cache when using trt execution provider

noahzn commented 3 days ago

I have already generated some trt cache when infering my ONNX model using TRT Execution Provider. Then, for the online testing of my model, I set so.graph_optimization_level = ort.GraphOptimizationLevel.ORT_DISABLE_ALL, but it seems that still new caches are generated. I only want to reuse the old cache while not generating new cache. How can I do that? Thanks in advance!

providers = ["CUDAExecutionProvider", "CPUExecutionProvider"]
trt_engine_cache_path = "weights/.trtcache_engines"
trt_timing_cache_path = "weights/.trtcache_timings"

# Create the 'weights' directory if it doesn't exist
os.makedirs(os.path.dirname(trt_engine_cache_path), exist_ok=True)

if conf.trt:
    providers = [
                        (
                            "TensorrtExecutionProvider",
                            {
                                "trt_max_workspace_size": 2 * 1024 * 1024 * 1024,
                                "trt_fp16_enable": True,
                                "trt_engine_cache_enable": True,
                                'trt_timing_cache_enable': True,
                                "trt_engine_cache_path": trt_engine_cache_path,
                                "trt_timing_cache_path": trt_timing_cache_path,

                            }
                        )
                    ] + providers

yf711 commented 3 days ago

Hi @noahzn Your old engine/profile might not be reused by TRTEP if current inference param/cache name/env variables/HW env changes.

Here's more info about engine reusability: https://onnxruntime.ai/docs/execution-providers/TensorRT-ExecutionProvider.html#trt_engine_cache_enable

I wonder if you update your old engine/profile with newly generated ones, is that new engine going to be reused? or a newer engine need to be generated

noahzn commented 3 days ago

@yf711 Thanks for your reply! My networks are keypoints detection and matching. I think the issue is that we cannot guarantee to extract the same numbers of keypoints on both images. I have warmed up the networks using about 10k paired of images, but it still generates new engines for some paired of images. The old generated engines are still used I think, because it indeed accelerates the inference. What can I do in this case? will trt_profile_min_shapes and trt_profile_max_shapes help? I tried setting this for input dimensions, but it's not enough. Following input(s) has no associated shape profiles provided: /Reshape_3_output_0,/norm/Div_output_0,/Resize_output_0,/Unsqueeze_18_output_0,/NonZero_output_0. Maybe some intermediate layers also need to be given dimension ranges?

microsoft / onnxruntime

[TensorRT EP] How can I disable generating cache when using trt execution provider #22822