Tensorrt cache is not being re-used with dynamic dimensions

talmaj-at-hypothetic commented 1 year ago

Using onnxruntime with TensorrtExecutionProvider rebuilds the cache engine when you pass the trt_profile_min_shapes, trt_profile_opt_shapes, trt_profile_max_shapes. If I first build the cache with dynamic shapes and in the next inference do not define the trt_profile shapes, then the cache is used.

I am using: onnxruntime-gpu==1.15.0

To reproduce

Run twice the example:

import onnxruntime as ort

ort.set_default_logger_severity(0) # Turn on verbose mode for ORT TRT
sess_options = ort.SessionOptions()

trt_ep_options = {
    "trt_fp16_enable": True,
    "trt_engine_cache_enable": True,
    "trt_profile_min_shapes": "sample:2x4x64x64,encoder_hidden_states:2x77x768",
    "trt_profile_max_shapes": "sample:32x4x64x64,encoder_hidden_states:32x77x768",
    "trt_profile_opt_shapes": "sample:2x4x64x64,encoder_hidden_states:2x77x768",
}

sess = ort.InferenceSession(
    "my_model.onnx",
    providers=[
        ("TensorrtExecutionProvider", trt_ep_options),
        "CUDAExecutionProvider",
    ],
)

batch_size = 1
unet_dim = 4
max_text_len = 77
embed_dim = 768
latent_height = 64
latent_width = 64

args = {
    "sample": np.zeros(
        (2 * batch_size, unet_dim, latent_height, latent_width), dtype=np.float32
    ),
    "timestep": np.ones((1,), dtype=np.float32),
    "encoder_hidden_states": np.zeros(
        (2 * batch_size, max_text_len, embed_dim),
        dtype=np.float32,
    ),
}
sess.run(None, args)

Urgency

Low.

Platform

Linux

OS Version

Ubuntu 20.04

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

onnxruntime-gpu==1.15.0

ONNX Runtime API

Python

Architecture

X64

Execution Provider

TensorRT

Execution Provider Library Version

docker image: nvcr.io/nvidia/tensorrt:22.12-py3

tianleiwu commented 1 year ago

I recommend using different engine cache path for different profile like this: https://github.com/microsoft/onnxruntime/blob/21a71d52bd2074b770807b209939ec11e2c64fa7/onnxruntime/python/tools/transformers/models/stable_diffusion/onnxruntime_tensorrt_txt2img.py#L94

chilo-ms commented 1 year ago

The current TRT EP's logic of using engine cache is if the trt_engine_cache_enable is on and there are matching engine/profile cache files in the cache path, TRT EP will load those files. Just like tianleiwu suggested, could you use different trt_engine_cache_enable (meaning maintaining different folders) for different inference session if the ep options are different?

talmaj-at-hypothetic commented 1 year ago

I am keeping the same ep options. This part is basically narrowing down / debugging the problem:

If I first build the cache with dynamic shapes and in the next inference do not define the trt_profile shapes, then the cache is used.

If I keep the same ep options without dynamic shapes it rebuilds cache for each different dynamic input.
If I keep the same ep options with dynamic shapes it also always rebuilds the cache, even for the same dynamic input.
If I first build cache with dynamic shapes and then switch off the dynamic shapes, then it always loads the cache for predefined shapes.

So we would want the last behaviour with ep options with dynamic shapes always on.

Does that make it clearer?

chilo-ms commented 1 year ago

If I keep the same ep options with dynamic shapes it also always rebuilds the cache, even for the same dynamic input.

This behavior is weird, the engine cache shouldn't be rebuilt if the dynamic input is the same across inference runs. Could you turn on the verbose mode and share the log? Or could you share the model so that I can try to repro.

If I first build cache with dynamic shapes and then switch off the dynamic shapes, then it always loads the cache for predefined shapes.

What do you mean by switching off the dynamic shapes? Does it mean not providing the trt_profile_xxx_shapes ep options? If so, the result you were seeing is expected. The engine cache built in the first inference run has the associated shape ranges for each dynamic input stored in the xxxxx.profile, and for the second inference run, TRT EP will compare the current input shape with the ones that in the xxxx.profile and found it was in the range, so TRT EP won't rebuild the engine and will use the cache directly. Only if the shape of current input is out of range, TRT EP will rebuild the engine.

microsoft / onnxruntime