microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.58k stars 2.92k forks source link

trt_weight_stripped_engine_enable does not work together with trt_dump_ep_context_model #22179

Open BengtGustafsson opened 1 month ago

BengtGustafsson commented 1 month ago

Describe the issue

We want to use trt_dump_ep_context_model to minimize the setup time and we want to use trt_weight_stripped_engine_enable to protect our models from competitors when we deliver our software.

While both these features work separately (most of the time in case of trt_weight_stripped_engine_enable) we can't get them to work when both are enabled, we get errors from the ort::Session constructor:

Non-zero status code returned while running TRTKernel_graph_TRTKernel_graph_torch-jit-export_5359030231610903815_0_998102994216759885_0 node. Name:'TensorrtExecutionProvider_TRTKernel_graph_TRTKernel_graph_torch-jit-export_5359030231610903815_0_998102994216759885_0_0' Status Message: C:\gitlab-runner\builds\H1MW1hSx\0\cv\impl\thirdpartylibs\onnxruntime\onnxruntime\core\providers\tensorrt\tensorrt_execution_provider.cc:949 onnxruntime::BindContextInput [ONNXRuntimeError] : 11 : EP_FAIL : TensorRT EP failed to call nvinfer1::IExecutionContext::setInputShape() for input 'img'

Here is the last part of the log output on INFO level:

I onnxruntime: GraphTransformer TransposeOptimizer modified: 0 with status: OK [graph_transformer.cc:15 onnxruntime::GraphTransformer::Apply] I onnxruntime: [TensorRT EP] Model name is _ctx.onnx [tensorrt_execution_provider_utils.h:543 onnxruntime::TRTGenerateId] I onnxruntime: [TensorRT EP] TensorRT subgraph MetaDef name TRTKernel_graph_TRTKernel_graph_torch-jit-export_5359030231610903815_0_998102994216759885_0 [tensorrt_execution_provider.cc:2058 onnxruntime::TensorrtExecutionProvider::GetSubGraph] V onnxruntime: [TensorRT EP] GetEpContextFromGraph engine_cache_path: C:\ProgramData\ContextVision\cvn_cache\e9d2977634193884f085e9031fa54f0c24fc45f2./TensorrtExecutionProvider_TRTKernel_graph_torch-jit-export_5359030231610903815_0_0_sm86.stripped.engine [onnx_ctx_model_helper.cc:324 onnxruntime::TensorRTCacheModelHandler::GetEpContextFromGraph] V onnxruntime: [TensorRT EP] TensorrtExecutionProvider_TRTKernel_graph_torch-jit-export_5359030231610903815_0_0_sm86.engine exists. [onnx_ctx_model_helper.cc:335 onnxruntime::TensorRTCacheModelHandler::GetEpContextFromGraph] V onnxruntime: [TensorRT EP] DeSerialized TensorrtExecutionProvider_TRTKernel_graph_torch-jit-export_5359030231610903815_0_0_sm86.engine [onnx_ctx_model_helper.cc:358 onnxruntime::TensorRTCacheModelHandler::GetEpContextFromGraph] I onnxruntime: GraphTransformer Level2_RuleBasedTransformer modified: 0 with status: OK [graph_transformer.cc:15 onnxruntime::GraphTransformer::Apply] I onnxruntime: GraphTransformer TransposeOptimizer_CPUExecutionProvider modified: 0 with status: OK [graph_transformer.cc:15 onnxruntime::GraphTransformer::Apply] I onnxruntime: GraphTransformer QDQS8ToU8Transformer modified: 0 with status: OK [graph_transformer.cc:15 onnxruntime::GraphTransformer::Apply] I onnxruntime: GraphTransformer QDQSelectorActionTransformer modified: 0 with status: OK [graph_transformer.cc:15 onnxruntime::GraphTransformer::Apply] I onnxruntime: GraphTransformer GemmActivationFusion modified: 0 with status: OK [graph_transformer.cc:15 onnxruntime::GraphTransformer::Apply] I onnxruntime: GraphTransformer MatMulIntegerToFloatFusion modified: 0 with status: OK [graph_transformer.cc:15 onnxruntime::GraphTransformer::Apply] I onnxruntime: GraphTransformer DynamicQuantizeMatMulFusion modified: 0 with status: OK [graph_transformer.cc:15 onnxruntime::GraphTransformer::Apply] I onnxruntime: GraphTransformer ConvActivationFusion modified: 0 with status: OK [graph_transformer.cc:15 onnxruntime::GraphTransformer::Apply] I onnxruntime: GraphTransformer GeluFusion modified: 0 with status: OK [graph_transformer.cc:15 onnxruntime::GraphTransformer::Apply] I onnxruntime: GraphTransformer LayerNormFusion modified: 0 with status: OK [graph_transformer.cc:15 onnxruntime::GraphTransformer::Apply] I onnxruntime: GraphTransformer SimplifiedLayerNormFusion modified: 0 with status: OK [graph_transformer.cc:15 onnxruntime::GraphTransformer::Apply] I onnxruntime: GraphTransformer AttentionFusion modified: 0 with status: OK [graph_transformer.cc:15 onnxruntime::GraphTransformer::Apply] I onnxruntime: GraphTransformer EmbedLayerNormFusion modified: 0 with status: OK [graph_transformer.cc:15 onnxruntime::GraphTransformer::Apply] I onnxruntime: GraphTransformer GatherSliceToSplitFusion modified: 0 with status: OK [graph_transformer.cc:15 onnxruntime::GraphTransformer::Apply] I onnxruntime: GraphTransformer GatherToSliceFusion modified: 0 with status: OK [graph_transformer.cc:15 onnxruntime::GraphTransformer::Apply] I onnxruntime: GraphTransformer MatmulTransposeFusion modified: 0 with status: OK [graph_transformer.cc:15 onnxruntime::GraphTransformer::Apply] I onnxruntime: GraphTransformer BiasGeluFusion modified: 0 with status: OK [graph_transformer.cc:15 onnxruntime::GraphTransformer::Apply] I onnxruntime: GraphTransformer SkipLayerNormFusion modified: 0 with status: OK [graph_transformer.cc:15 onnxruntime::GraphTransformer::Apply] I onnxruntime: GraphTransformer FastGeluFusion modified: 0 with status: OK [graph_transformer.cc:15 onnxruntime::GraphTransformer::Apply] I onnxruntime: GraphTransformer QuickGeluFusion modified: 0 with status: OK [graph_transformer.cc:15 onnxruntime::GraphTransformer::Apply] I onnxruntime: GraphTransformer BiasSoftmaxFusion modified: 0 with status: OK [graph_transformer.cc:15 onnxruntime::GraphTransformer::Apply] I onnxruntime: GraphTransformer BiasDropoutFusion modified: 0 with status: OK [graph_transformer.cc:15 onnxruntime::GraphTransformer::Apply] I onnxruntime: GraphTransformer MatMulScaleFusion modified: 0 with status: OK [graph_transformer.cc:15 onnxruntime::GraphTransformer::Apply] I onnxruntime: GraphTransformer MatMulActivationFusion modified: 0 with status: OK [graph_transformer.cc:15 onnxruntime::GraphTransformer::Apply] I onnxruntime: GraphTransformer MatMulNBitsFusion modified: 0 with status: OK [graph_transformer.cc:15 onnxruntime::GraphTransformer::Apply] I onnxruntime: GraphTransformer QDQFinalCleanupTransformer modified: 0 with status: OK [graph_transformer.cc:15 onnxruntime::GraphTransformer::Apply] I onnxruntime: GraphTransformer NchwcTransformer modified: 0 with status: OK [graph_transformer.cc:15 onnxruntime::GraphTransformer::Apply] I onnxruntime: GraphTransformer NhwcTransformer modified: 0 with status: OK [graph_transformer.cc:15 onnxruntime::GraphTransformer::Apply] I onnxruntime: GraphTransformer ConvAddActivationFusion modified: 0 with status: OK [graph_transformer.cc:15 onnxruntime::GraphTransformer::Apply] I onnxruntime: GraphTransformer RemoveDuplicateCastTransformer modified: 0 with status: OK [graph_transformer.cc:15 onnxruntime::GraphTransformer::Apply] I onnxruntime: GraphTransformer CastFloat16Transformer modified: 0 with status: OK [graph_transformer.cc:15 onnxruntime::GraphTransformer::Apply] I onnxruntime: GraphTransformer MemcpyTransformer modified: 0 with status: OK [graph_transformer.cc:15 onnxruntime::GraphTransformer::Apply] V onnxruntime: Node placements [session_state.cc:1146 onnxruntime::VerifyEachNodeIsAssignedToAnEp] V onnxruntime: All nodes placed on [TensorrtExecutionProvider]. Number of nodes: 1 [session_state.cc:1149 onnxruntime::VerifyEachNodeIsAssignedToAnEp] V onnxruntime: SaveMLValueNameIndexMapping [session_state.cc:126 onnxruntime::SessionState::CreateGraphInfo] V onnxruntime: Done saving OrtValue mappings. [session_state.cc:172 onnxruntime::SessionState::CreateGraphInfo] I onnxruntime: Use DeviceBasedPartition as default [allocation_planner.cc:2567 onnxruntime::IGraphPartitioner::CreateGraphPartitioner] I onnxruntime: Saving initialized tensors. [session_state_utils.cc:209 onnxruntime::session_state_utils::SaveInitializedTensors] I onnxruntime: Done saving initialized tensors [session_state_utils.cc:360 onnxruntime::session_state_utils::SaveInitializedTensors] I onnxruntime: Session successfully initialized. [inference_session.cc:2094 onnxruntime::InferenceSession::Initialize] Must have redone optimization as it took: 1.0658e+06 ms to create ort::Session for: Device: NVIDIA RTX A5000, Identifier: [Denoising, 1, Denoising, Velvet], TensorRT version: 10.4.0.26, OnnxRuntime version: 1.19.2, SmallSize: img:1x1x64x64,mix_factor:1x1 Hash: e9d2977634193884f085e9031fa54f0c24fc45f2 OptSize: img:1x1x80x80,mix_factor:1x1, MaxSize: img:1x1x90x90,mix_factor:1x1, Date/Time: 2024-09-23 07:37:02.9472181 I onnxruntime: Extending BFCArena for Cuda. bin_num:6 (requested) num_bytes: 25600 (actual) rounded_bytes:25600 [bfc_arena.cc:347 onnxruntime::BFCArena::AllocateRawInternal] I onnxruntime: Extended allocation by 1048576 bytes. [bfc_arena.cc:206 onnxruntime::BFCArena::Extend] I onnxruntime: Total allocated bytes: 1048576 [bfc_arena.cc:209 onnxruntime::BFCArena::Extend] I onnxruntime: Allocated memory at 0000000C24B00000 to 0000000C24C00000 [bfc_arena.cc:212 onnxruntime::BFCArena::Extend] E onnxruntime: [2024-09-23 07:54:55 ERROR] IExecutionContext::setInputShape: Error Code 3: API Usage Error (Parameter check failed, condition: satisfyProfile. Set dimension [1,1,80,80] for tensor img does not satisfy any optimization profiles. Valid range for profile 0: [1,1,45,64]..[1,1,64,90].) [tensorrt_execution_provider.h:88 onnxruntime::TensorrtLogger::log]

The options dump as: V onnxruntime: [TensorRT EP] TensorRT provider options: device_id: 0, trt_max_partition_iterations: 1000, trt_min_subgraph_size: 1, trt_max_workspace_size: 40737418240, trt_fp16_enable: 0, trt_int8_enable: 0, trt_int8_calibration_cache_name: , int8_calibration_cache_available: 0, trt_int8_use_native_tensorrt_calibration_table: 0, trt_dla_enable: 0, trt_dla_core: 0, trt_dump_subgraphs: 0, trt_engine_cache_enable: 1, trt_weight_stripped_engine_enable: 0, trt_onnx_model_folder_path: , trt_cache_path: ./, trt_global_cache_path: , trt_engine_decryption_enable: 0, trt_engine_decryption_lib_path: , trt_force_sequential_engine_build: 0, trt_context_memory_sharing_enable: 0, trt_layer_norm_fp32_fallback: 0, trt_build_heuristics_enable: 0, trt_sparsity_enable: 0, trt_builder_optimization_level: 3, trt_auxiliary_streams: 0, trt_tactic_sources: , trt_profile_min_shapes: img:1x1x64x64,mix_factor:1x1, trt_profile_max_shapes: img:1x1x90x90,mix_factor:1x1, trt_profile_opt_shapes: img:1x1x80x80,mix_factor:1x1, trt_cuda_graph_enable: 0, trt_dump_ep_context_model: 0, trt_ep_context_file_path: C:\ProgramData\ContextVision\cvn_cache\e9d2977634193884f085e9031fa54f0c24fc45f2, trt_ep_context_embed_mode: 0, trt_cache_prefix: , trt_engine_hw_compatible: 0, trt_onnx_model_bytestreamsize: 1750065 [tensorrt_execution_provider.cc:1728 onnxruntime::TensorrtExecutionProvider::TensorrtExecutionProvider] I

As trt_dump_ep_context_model always stores the resulting .engine files on disk directly the only other way to protect our models would be to try to encrypt the resulting file as quickly as possible and then decrypt it again before loading it. This obviously is less than safe and it would be rather easy to come by the data by monitoring the file writes.

To reproduce

Enable both features, see option printout above.

Create an ort::Session. Note the error.

Urgency

No response

Platform

Windows

OS Version

11

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.19.2

ONNX Runtime API

C++

Architecture

X64

Execution Provider

TensorRT

Execution Provider Library Version

TensorRT 10.4.0.26 on CUDA 11.6

BengtGustafsson commented 1 week ago

Is there a way to make you address this types of issues? We have tried to find a way to pay for support but couldn't.