microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.73k stars 2.94k forks source link

TensorrtExecutionProvider slower than CUDAExecutionProvider: Faster-rcnn [Performance] #17434

Open datinje opened 1 year ago

datinje commented 1 year ago

Describe the issue

on my Faster-rcnn-rpn models doing detections of patterns, after considerable efforts to infer with TensorRT EP, (see https://github.com/microsoft/onnxruntime/issues/16886 as this shows that I have simplified the model and infered the shapes of the model nodes before submitting to TRT) , I found that TRT EP is about 30% slower than with Cuda EP in FP32 (and in TF32) - only with FP16 TRT EP -almost- catches up.

I only mentions here the second inference , not the warm up once (which is considerably slower which is normal)

After looking at the VERBOSE mode logs , found out that not all the nodes are running on TRT, one is still on CPU and 6 on Cuda EP. That cause many memory transfers between Host and GPU . I suppose this is the reason. So my question is why is ther still nodes on CPU and Cuda EPs ? Can this be fixed ?

Here are the logs : 2023-09-06 16:45:59.604024060 [V:onnxruntime:, session_state.cc:1149 VerifyEachNodeIsAssignedToAnEp] Node placements 2023-09-06 16:45:59.604038849 [V:onnxruntime:, session_state.cc:1155 VerifyEachNodeIsAssignedToAnEp] Node(s) placed on [TensorrtExecutionProvider]. Number of nodes: 11 2023-09-06 16:45:59.604042765 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] TRTKernel_graph_torch_jit_15684953649142847852_0 (TensorrtExecutionProvider_TRTKernel_graph_torch_jit_15684953649142847852_0_0) 2023-09-06 16:45:59.604046398 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] TRTKernel_graph_torch_jit_15684953649142847852_1 (TensorrtExecutionProvider_TRTKernel_graph_torch_jit_15684953649142847852_1_1) 2023-09-06 16:45:59.604049385 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] TRTKernel_graph_torch_jit_15684953649142847852_2 (TensorrtExecutionProvider_TRTKernel_graph_torch_jit_15684953649142847852_2_2) 2023-09-06 16:45:59.604052381 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] TRTKernel_graph_torch_jit_15684953649142847852_3 (TensorrtExecutionProvider_TRTKernel_graph_torch_jit_15684953649142847852_3_3) 2023-09-06 16:45:59.604055213 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] TRTKernel_graph_torch_jit_15684953649142847852_4 (TensorrtExecutionProvider_TRTKernel_graph_torch_jit_15684953649142847852_4_4) 2023-09-06 16:45:59.604057978 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] TRTKernel_graph_torch_jit_15684953649142847852_5 (TensorrtExecutionProvider_TRTKernel_graph_torch_jit_15684953649142847852_5_5) 2023-09-06 16:45:59.604060720 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] TRTKernel_graph_torch_jit_15684953649142847852_6 (TensorrtExecutionProvider_TRTKernel_graph_torch_jit_15684953649142847852_6_6) 2023-09-06 16:45:59.604063521 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] MemcpyFromHost (Memcpy) 2023-09-06 16:45:59.604066111 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] MemcpyToHost (Memcpy_token_422) 2023-09-06 16:45:59.604068754 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] MemcpyToHost (Memcpy_token_423) 2023-09-06 16:45:59.604078119 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] MemcpyToHost (Memcpy_token_424) 2023-09-06 16:45:59.604081367 [V:onnxruntime:, session_state.cc:1155 VerifyEachNodeIsAssignedToAnEp] Node(s) placed on [CPUExecutionProvider]. Number of nodes: 1 2023-09-06 16:45:59.604086459 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] RoiAlign (/model/roi_heads/box_pooler/level_poolers.0/RoiAlign) 2023-09-06 16:45:59.604093948 [V:onnxruntime:, session_state.cc:1155 VerifyEachNodeIsAssignedToAnEp] Node(s) placed on [CUDAExecutionProvider]. Number of nodes: 5 2023-09-06 16:45:59.604099017 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] NonZero (/model/proposal_generator/NonZero) 2023-09-06 16:45:59.604103942 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] NonMaxSuppression (NonMaxSuppression_497) 2023-09-06 16:45:59.604108777 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] NonZero (/model/roi_heads/NonZero) 2023-09-06 16:45:59.604113159 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] NonMaxSuppression (NonMaxSuppression_796) 2023-09-06 16:45:59.604117903 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] NonZero (/model/NonZero)

I got the same issue in both C++ and python runtime APIs

To reproduce

I can't share my model for IP , but I see similar issues with public Detectron Model zoo faster-rcnn-rpn (see https://github.com/microsoft/onnxruntime/issues/16886) how to run it - but with this one even more nodes are fallback on CPU and cuda , among which the nodes in bold above. So maybe fixes investigating this one will lead to same fixes.

Urgency

I have been blocked for several months on trying to run the model on TRT EP (see https://github.com/microsoft/onnxruntime/issues/16886 thx for the ort staff that helped me) now to find out that this may not be worth. Looks like I am not fat - only actually 3 operator/nodes to go on TRT EP, but times up I will need in a couple of month to freeze the model to certify the results with no second chance certifying with TRT FP16 or better INT8. I am expecting a x2 perf improvement in TRT fp16 and another x2 improvement in INT8 (accuracy is still excellent in FP16).

Platform

Linux

OS Version

SLES15 SP4

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.15.1+ (using main latest for a fix to build TRT EP)

ONNX Runtime API

Python

Architecture

X64

Execution Provider

TensorRT

Execution Provider Library Version

TensorRT 8.6.1

Model File

I can't but could use fatser-rcnn-rpn from detectron2 model zoo (see https://github.com/microsoft/onnxruntime/issues/16886)

Is this a quantized model?

No

datinje commented 5 months ago

Ok. Thanks for the update.On which ort version would you like me to test memory? 1.17.x , because of TRT 10 Nvidia issue on faster rcnn ? Or Can I use ort 1.18 with latest TRT 8.6?

Le 6 juin 2024 20:53:58 GMT+02:00, Chi Lo @.***> a écrit :

As for memory consumption of using TRT DDS ops support, we later didn't see it consumes significant memory. But still if you can help provide the comparison of memory consumption between with DDS nodes placed on TRT and DDS nodes not placed on TRT, that will be great!

Also, for TRT 10, we found an issue when running Faster-RCNN model from ONNX model with DDS nodes placed on TRT. Nvidia is aware of this and is fixing it now.

Lastly, we will be discussing whether to enable the TRT DDS support in ORT release. So we really appreciate your feedback here.

-- Reply to this email directly or view it on GitHub: https://github.com/microsoft/onnxruntime/issues/17434#issuecomment-2153189152 You are receiving this because you were mentioned.

Message ID: @.***>

jcdatin commented 5 months ago

For information , ort 1.18.0 does not compile with gcc12 [ 70%] Building CXX object CMakeFiles/onnxruntime_providers_cuda.dir/tmp/onnxruntime/onnxruntime/contrib_ops/cuda/quantization/qordered_ops/qordered_unary_ops.cc.o /tmp/onnxruntime/onnxruntime/contrib_ops/cuda/quantization/moe_quantization.cc: In member function ‘virtual onnxruntime::common::Status onnxruntime::contrib::cuda::QMoE::ComputeInternal(onnxruntime::OpKernelContext*) const’: /tmp/onnxruntime/onnxruntime/contrib_ops/cuda/quantization/moe_quantization.cc:99:24: error: ‘moe_params.onnxruntime::contrib::cuda::MoEParameters::inter_size’ may be used uninitialized [-Werror=maybe-uninitialized] 99 | moe_runner.run_moe_fc(

command is Step 28/40 : RUN CC=gcc-12 CXX=g++-12 ./build.sh --nvcc_threads 2 --config RelWithDebInfo --use_cuda --cudnn_home /usr/local/cuda/lib64 --cuda_home /usr/local/cuda/ --use_tensorrt --use_tensorrt_oss_parser --tensorrt_home /usr/local/TensorRT --build_shared_lib --parallel --skip_tests --allow_running_as_root --cmake_extra_defines "CMAKE_CUDA_ARCHITECTURES=86" --cmake_extra_defines "CMAKE_CUDA_HOST_COMPILER=/usr/bin/gcc-12"

Namespace(build_dir='/tmp/onnxruntime/build/Linux', config=['RelWithDebInfo'], update=False, build=False, clean=False, parallel=0, nvcc_threads=2, test=False, skip_tests=True, compile_no_warning_as_error=False, enable_nvtx_profile=False, enable_memory_profile=False, enable_training=False, enable_training_apis=False, enable_training_ops=False, enable_nccl=False, mpi_home=None, nccl_home=None, use_mpi=False, enable_onnx_tests=False, path_to_protoc_exe=None, fuzz_testing=False, enable_symbolic_shape_infer_tests=False, gen_doc=None, gen_api_doc=False, use_cuda=True, cuda_version=None, cuda_home='/usr/local/cuda/', cudnn_home='/usr/local/cuda/lib64', enable_cuda_line_info=False, enable_cuda_nhwc_ops=False, enable_pybind=False, build_wheel=False, wheel_name_suffix=None, numpy_version=None, skip_keras_test=False, build_csharp=False, build_nuget=False, msbuild_extra_options=None, build_java=False, build_nodejs=False, build_objc=False, build_shared_lib=True, build_apple_framework=False, cmake_extra_defines=[['CMAKE_CUDA_ARCHITECTURES=86'], ['CMAKE_CUDA_HOST_COMPILER=/usr/bin/gcc-12']], target=None, x86=False, rv64=False, arm=False, arm64=False, arm64ec=False, buildasx=False, riscv_toolchain_root='', riscv_qemu_path='', msvc_toolset=None, windows_sdk_version=None, android=False, android_abi='arm64-v8a', android_api=27, android_sdk_path='', android_ndk_path='', android_cpp_shared=False, android_run_emulator=False, use_gdk=False, gdk_edition='.', gdk_platform='Scarlett', ios=False, visionos=False, macos=None, apple_sysroot='', ios_toolchain_file='', visionos_toolchain_file='', xcode_code_signing_team_id='', xcode_code_signing_identity='', cmake_generator=None, osx_arch='x86_64', apple_deploy_target=None, enable_address_sanitizer=False, use_binskim_compliant_compile_flags=False, disable_memleak_checker=False, build_wasm=False, build_wasm_static_lib=False, emsdk_version='3.1.57', enable_wasm_simd=False, enable_wasm_threads=False, disable_wasm_exception_catching=False, enable_wasm_api_exception_catching=False, enable_wasm_exception_throwing_override=True, wasm_run_tests_in_browser=False, enable_wasm_profiling=False, enable_wasm_debug_info=False, wasm_malloc=None, emscripten_settings=None, use_extensions=False, extensions_overridden_path=None, cmake_path='cmake', ctest_path='ctest', skip_submodule_sync=False, use_mimalloc=False, use_dnnl=False, dnnl_gpu_runtime='', dnnl_opencl_root='', use_openvino=None, dnnl_aarch64_runtime='', dnnl_acl_root='', use_coreml=False, use_webnn=False, use_snpe=False, snpe_root=None, use_nnapi=False, nnapi_min_api=None, use_jsep=False, use_qnn=False, qnn_home=None, use_rknpu=False, use_preinstalled_eigen=False, eigen_path=None, enable_msinternal=False, llvm_path=None, use_vitisai=False, use_tvm=False, tvm_cuda_runtime=False, use_tvm_hash=False, use_tensorrt=True, use_tensorrt_builtin_parser=True, use_tensorrt_oss_parser=True, tensorrt_home='/usr/local/TensorRT', test_all_timeout='10800', use_migraphx=False, migraphx_home=None, use_full_protobuf=False, llvm_config='', skip_onnx_tests=False, skip_winml_tests=False, skip_nodejs_tests=False, enable_msvc_static_runtime=False, enable_language_interop_ops=False, use_dml=False, dml_path='', use_winml=False, winml_root_namespace_override=None, dml_external_project=False, use_telemetry=False, enable_wcos=False, enable_lto=False, enable_transformers_tool_test=False, use_acl=None, acl_home=None, acl_libs=None, use_armnn=False, armnn_relu=False, armnn_bn=False, armnn_home=None, armnn_libs=None, build_micro_benchmarks=False, minimal_build=None, include_ops_by_config=None, enable_reduced_operator_type_support=False, disable_contrib_ops=False, disable_ml_ops=False, disable_rtti=False, disable_types=[], disable_exceptions=False, rocm_version=None, use_rocm=False, rocm_home=None, code_coverage=False, enable_lazy_tensor=False, ms_experimental=False, enable_external_custom_op_schemas=False, external_graph_transformer_path=None, enable_cuda_profiling=False, use_cann=False, cann_home=None, enable_rocm_profiling=False, use_xnnpack=False, use_azure=False, use_cache=False, use_triton_kernel=False, use_lock_free_queue=False, allow_running_as_root=True)

reverting to gcc11 (works)

jcdatin commented 5 months ago

due to the nvidia bug in TensorRT 10 GA mentioned above I am still not sure which version of TensorRT C++ package I should install to run with onnxrt 1.18.0 besides using the deps.txt patch above : 8.6 ? (before building onnxrt I am installing trt and then using flags --use_tensorrt --use_tensorrt_oss_parser --tensorrt_home /usr/local/TensorRT to build onnxrt where trt is installed)

jcdatin commented 5 months ago

@chilo-ms could not use onnxrt 1.18.0 without TRT 10 ; either: -build fails populating onnx 1.16.1 because of patches with https://github.com/microsoft/onnxruntime/blob/main/cmake/deps.txt#L41 while modifying onnx_tensorrt like showed above or -build compilation of onnrt fails because of TRT 8.6 as NvOnnxParser.cpp:6: /tmp/onnxruntime/build/Linux/RelWithDebInfo/_deps/onnx_tensorrt-src/Status.hpp: In function ‘std::ostream& nvinfer1::operator<<(std::ostream&, const nvinfer1::DataType&)’: /tmp/onnxruntime/build/Linux/RelWithDebInfo/_deps/onnx_tensorrt-src/Status.hpp:198:30: error: ‘kBF16’ is not a member of ‘nvinfer1::DataType’ 198 | case nvinfer1::DataType::kBF16: return stream << "bfloat16";

reverting to onnxruntime 1.17.3 TRT 8.6 and using --use_tensorrt_oss_parser (to get DDS and x2 acceleartion in TRT EP - too bad as RL of ort 1.18.0 mentions "Finalized support for DDS ops" . I will wait for nvida TRT 10 bug fix to use ort 1.18.0 for DDS ops.

jcdatin commented 5 months ago

what is the problem with TRT 10.0 GA and ORT wwhen using -use_tensorrt_oss_parser ?

jcdatin commented 5 months ago

I tried ORT 1.18.0 with installation of TRT 10.0GA after changing deps.txt like above .

I am getting a TRT EP error near ScatterND operator = 2024-06-10 15:04:48.739948511 [V:onnxruntime:cad-engine-inference, tensorrt_execution_provider_utils.h:236 SerializeProfileV2] [TensorRT EP] In SerializeProfileV2() 2024-06-10 15:04:48.739963594 [V:onnxruntime:cad-engine-inference, tensorrt_execution_provider_utils.h:241 SerializeProfileV2] [TensorRT EP] input tensor is '/model/my_model/rpn/Gather_32_output_0' 2024-06-10 15:04:48.739967444 [V:onnxruntime:cad-engine-inference, tensorrt_execution_provider_utils.h:246 operator()] [TensorRT EP] profile #0, dim is 0 ... 2024-06-10 15:04:48.739974154 [V:onnxruntime:cad-engine-inference, tensorrt_execution_provider_utils.h:241 SerializeProfileV2] [TensorRT EP] input tensor is '/model/my_model/backbone/pretrained_backbone_model/res5/res5.2/relu_1/Relu_output_0' 2024-06-10 15:04:48.739976919 [V:onnxruntime:cad-engine-inference, tensorrt_execution_provider_utils.h:246 operator()] [TensorRT EP] profile #0, dim is 3 ... 2024-06-10 15:04:48.739989572 [V:onnxruntime:cad-engine-inference, tensorrt_execution_provider_utils.h:241 SerializeProfileV2] [TensorRT EP] input tensor is '/model/my_model/rpn/Gather_33_output_0' 2024-06-10 15:04:48.739992899 [V:onnxruntime:cad-engine-inference, tensorrt_execution_provider_utils.h:246 operator()] [TensorRT EP] profile #0, dim is 0 .. 2024-06-10 15:04:48.739998684 [V:onnxruntime:cad-engine-inference, tensorrt_execution_provider_utils.h:241 SerializeProfileV2] [TensorRT EP] input tensor is '/model/my_model/rpn/ScatterND_output_0' ... 2024-06-10 15:04:48.740066729 [V:onnxruntime:cad-engine-inference, tensorrt_execution_provider.cc:3337 operator()] [TensorRT EP] Serialized ./tensorrtCacheTF32/TensorrtExecutionProvider_TRTKernel_graph_torch_jit_7890731245993728330_1_1_sm86.profile .. ./tensorrtCacheTF32/TensorrtExecutionProvider_TRTKernel_graph_torch_jit_7890731245993728330_1_1_sm86.engine 2024-06-10 15:04:48.766256435 [V:onnxruntime:, execution_steps.cc:98 Execute] stream 0 activate notification with index 3 .. 2024-06-10 15:04:48.778499217 [E:onnxruntime:cad-engine-inference, tensorrt_execution_provider.h:82 log] [2024-06-10 15:04:48 ERROR] 4: FillOperation::kLINSPACE only supports output type kINT32 or kINT64 or kFLOAT

Then I also noticed that now node ScatterND of my faster-rcnn is now allocated to CPU causing graph split and MemcpyFromHost MemcpyToHost nodes addition : is that the problem you referenced above with TRT 10 ?

2024-06-10 15:04:23.831586227 [V:onnxruntime:, session_state.cc:1152 VerifyEachNodeIsAssignedToAnEp] Node(s) placed on [CPUExecutionProvider]. Number of nodes: 2 2024-06-10 15:04:23.831588667 [V:onnxruntime:, session_state.cc:1154 VerifyEachNodeIsAssignedToAnEp] ScatterND (/model/my_model/rpn/ScatterND) 2024-06-10 15:04:23.831591018 [V:onnxruntime:, session_state.cc:1154 VerifyEachNodeIsAssignedToAnEp] ScatterND (/model/my_model/roi_heads/ScatterND)

I can't generate anymore any image working for my faster-rcnn model (neither ort 1.17.3 with TRT 8.6 - deps.txt unchanged- , neither with ort 1.18.0 with TRT 10.0 -deps.txt changed- ) What Can I try ?

jcdatin commented 5 months ago

@chilo-ms : can you advise me a way to build ONNXRT with TRT 8.6 and TRT DDS? (Nvidia told me their faster-rcnn issue will be delivered in a 10.2 that is in a month - that is too late for me). All I got is my archived docker image where I built ort form the main branch with TRT DDS 2 near 2 months ago. I am using it in our prototype but I can't deploy this ORT in a product since our process require us to control (and reproduce) our build and runtime environment. But I can't regenerate anymore any ORT w/ TRT (8.6 or 10.0) capable of running my own faster-rcnn . Either I am geeting the nvidia issue with faster-rcnn w/ TRT 10 or I can't get anymore the DDS operator allocated to TRT EP (but to CPU EP). I need your helpfor a configuration that makes DDS work again with TRT 8.6 until NVIDIA delivers 10.2 does not have to be an ORT official release : I can download a zip of the source code that I will archive in house for our process.

jcdatin commented 5 months ago

Using the prototype docker image (the one using TRT DDS with the same model) I don't see memory usage higher with TRT EP than CUDA EP (actually on the contrary Cuda EP is using more memory) image

chilo-ms commented 5 months ago

sorry for the late reply.

Please use ORT 1.17.x with TRT 8.6 and manually build ORT with additional --use_tensorrt_oss_parser and modified deps.txt:

- onnx_tensorrt;https://github.com/onnx/onnx-tensorrt/archive/a43ce67187bab219520fd80f21af8bbd4354bc8c.zip;572535aefef477050f86744dfab1fef840198035
+ onnx_tensorrt;https://github.com/onnx/onnx-tensorrt/archive/bacfaaa951653cd4e72efe727a543567cb38f7de.zip;26434329612e804164ab7baa6ae629ada56c1b26

I think this is what you did two months ago to make all nodes placed on TRT and successfully run.

Note: It's TRT parser which decides whether to support DDS nodes (NonZero, NMS, RoiAlign). To double check OSS TRT parser did support DDS nodes, once the ORT build is finished, you can see the OSS parser repo (source code) being downloaded to ./build/Linux/Debug/_deps/onnx_tensorrt-src where isDDSOp() should be removed from ModelImporter.cpp

chilo-ms commented 5 months ago

Using the prototype docker image (the one using TRT DDS with the same model) I don't see memory usage higher with TRT EP than CUDA EP (actually on the contrary Cuda EP is using more memory)

Thanks a lot for this feedback! We will have an internal discussion and it's highly likely to ask Nvidia to ship later version of TRT with natively support DDS ops.

jcdatin commented 5 months ago

@chilo-ms using ort 1.17.3 with TRT 8.6.1 and modifying the deps.txt like above compiles ort with DDS (my whole faster-rcnn graph is using a unique node with all onnx operator on trt , including nonzero, roialign and nms). But the inference with TRT gives crap results (same crap value always) whereas the same inference with cuda EP is fine and equivalent to pytorch results. I do not know what to do . I have a docker image I built in april from main branch that works fine (with perf and accurate result - and good memeoru usage). But I can't regenerate it as i did not note the git changeset at that time nor did I record the ORT source code in our own archive. I am stuck since my regulatory standard mandates me to be able to regenerate the build environment. Can you help ? I would prefer using an official ORT label like 1.17.3 , but using a nightly tag would do. Do you have this tag ? I built it on April 12th or 13th. also ORT 1.8.0 is not an option as it requires TRT 10 which has a bug for the faster-rcnn (is there a PR for that?)

jcdatin commented 5 months ago

can't do a git log on may archived image that works : I deleted the clone in /tmp after installing onnxruntime to save space. Any idea please to find out the delta between 1.17.3 and my image that could explain the regression in TRT EP for 1.17.3 on faster-rcnn ?

jcdatin commented 5 months ago

btw : why do I need this patch ?

I do install TRT 8.6 myself and provide its path in onnxrt build with -tensorrt_home /usr/local/TensorRT I thought this patch was to take TRT 8.6 instead of TRT 10 , but what if I provide the trt install path to onnxt build ?

jcdatin commented 5 months ago

looking in detail in my archived image , in onnxruntime install dir (/usr/local/lib) , I found : in onnxruntime_config.h :

define ORT_BUILD_INFO "ORT Build Info: git-branch=main, git-commit-id=bb1972264b, build type=RelWithDebInfo, cmake cxx flags: -ffunction-sections -fdata-sections -Wno-restrict -DCPUINFO_SUPPORTED"

define ORT_VERSION "1.18.0"

also in folder /usr/local/lib/onnx_tensorrt-src found in CMakeLists.txt:

--------------------------------------------------

Version information

--------------------------------------------------

set(ONNX2TRT_MAJOR 8) set(ONNX2TRT_MINOR 6) set(ONNX2TRT_PATCH 1) set(ONNX2TRT_VERSION "${ONNX2TRT_MAJOR}.${ONNX2TRT_MINOR}.${ONNX2TRT_PATCH}" CACHE STRING "ONNX2TRT version")

SET(CPACK_GENERATOR "DEB") SET(CPACK_DEBIAN_PACKAGE_MAINTAINER "NVIDIA") #required SET(CPACK_PACKAGE_NAME "onnx-trt-dev") SET(CPACK_PACKAGE_VERSION "0.5.9")

When building with label ONNXRT 1.18.0 and TRT 8.6.1 though ort + TRT EP is crashing on my model so what is the difference between build bb1972264b and official 1.18.0 (on my archive image built with ort tag bb1972264b I did not changed deps.txt)

jcdatin commented 4 months ago

I could rebuild my onnxrtuntime with TRT 8.6.1 using tag bb1972264b and no change in deps.txt Now I need to wait for when an official working version of ort is released that works with TRT 8.6 Right now the tricks about do not work and ORT 1.18.0 official does not work with TRT 10 with my faster-rcnn. I would like to understand what is the regression between build bb1972264b and ORT 1.17.3 or 1.18.0 with change in deps.txt

chilo-ms commented 4 months ago

btw : why do I need this patch ?

If you go to onnx-tensorrt repos and get the commit history of main, you will have better understanding.

You need to patch https://github.com/onnx/onnx-tensorrt/commit/bacfaaa951653cd4e72efe727a543567cb38f7de in order to have DDS TRT native support. The old commit https://github.com/onnx/onnx-tensorrt/commit/a43ce67187bab219520fd80f21af8bbd4354bc8c doesn't have TRT DDS support.

chilo-ms commented 4 months ago

I would like to understand what is the regression between build https://github.com/microsoft/onnxruntime/commit/bb1972264b89261e98d438367eb54d97eea52c12 and ORT 1.17.3 or 1.18.0 with change in deps.txt

Thanks for raising this issue. I think this DDS related bug fix https://github.com/microsoft/onnxruntime/pull/19575 is missing in ORT 1.17.3 which resulted in corrupted/wrong output that you observed.

So, if you want to use ORT release branch, i suggest you can build ORT 1.18.0 + TRT 8.6 + --use_tensorrt_oss_parser + the patch in deps.txt (+ onnx_tensorrt;https://github.com/onnx/onnx-tensorrt/archive/bacfaaa951653cd4e72efe727a543567cb38f7de.zip;26434329612e804164ab7baa6ae629ada56c1b26)

I could rebuild my onnxrtuntime with TRT 8.6.1 using tag https://github.com/microsoft/onnxruntime/commit/bb1972264b89261e98d438367eb54d97eea52c12 and no change in deps.txt

BTW, does this rebuild + --use_tensorrt_oss_parser can repro what you saw with good performance/accuracy back in Mid April?

jcdatin commented 4 months ago

I guess I tried 1.18.0 + 8.6.1 + Deps.txt patch as advised , but did not work on my model. Now that I understand better the build after archiving bb19722 I will retry.

MiroPsota commented 4 months ago

I successfully compiled ORT 1.18.1 with TRT 8.6.1 and --use_tensorrt_oss_parser (and CUDNN 8.9.2, CUDA 11.8). It can run the RTMDet model which has a custom TRT DDS op BatchedNMS without problems.

Here are the changes I made: The last TRT 8.x commit in onnx_tensorrt:

- onnx_tensorrt;https://github.com/onnx/onnx-tensorrt/archive/06adf4461ac84035bee658c6cf5df39f7ab6071d.zip;46dceef659d75d276e7914a8057c2282269d5e7b
+ onnx_tensorrt;https://github.com/onnx/onnx-tensorrt/archive/0462dc31ae78f48744b6141ae376df1f96d3f459.zip;5ff086361956cceb81ed17453a1fd8db2aa4328d

After cmake downloads the source of the onnx repo, download stl_backports.h to the onnx-src folder. For me, it was in <onnxruntime_dir>/build/Linux/Release/_deps/onnx-src/onnx/common/stl_backports.h.

jcdatin commented 4 months ago

@MiroPsota : why do we need to add stl_backport.h in onnx-src for ort 1.18.1 and not previous versions ?

jcdatin commented 4 months ago

@chilo-ms , ORT 1.18.1 now uses Cudnn 9 (vs 8 previously in 1.18.0) . What about TRT : is TRT10 fixed for faster-rcnn model or shall we still use TRT 8 (I am using Cuda 12.2). 1.18.1 is said to -Now using latest commit of onnx-tensorrt parser, which includes several issue fixes : is this using DDS by default ? -Support for TensorRT hardware compatible engines. : what does this mean (TRT was supported previously) -Additional TensorRT support and performance improvements : is this coming only with TRT 10 ? -Support for INT64 types in TensorRT constant layer calibration. : my pytroch original model uses INT64 . is this removing INT32 truncation I previously add - is TRT now supporting INT64. from what version (TRT10) ?

MiroPsota commented 4 months ago

Older versions of ORT were patching it with cmake somewhere or it wasn't necessary. Version 1.18.1 expects an onnx_tensorrt commit which supports onnx>1.14.1, which is the last version with this file and onnx_tensorrt commit 0462dc3 expects this file to be present. As this file is just some independet helper file I copied it and it worked.

TensorRT 10.0 didn't work for me, see this issue.

jcdatin commented 4 months ago

@MiroPsota : did you use cudnn 8 or moved to the new cudnn 9 supported by ort ?

MiroPsota commented 4 months ago

See my post for the exact versions used.

I successfully compiled ORT 1.18.1 with TRT 8.6.1 and --use_tensorrt_oss_parser (and CUDNN 8.9.2, CUDA 11.8). It can run the RTMDet model which has a custom TRT DDS op BatchedNMS without problems.

jcdatin commented 4 months ago

I got a compilation error for ORT 1.18.1 after

  1. changing depts.txt like above to use latest TRT 8 commit,and using cudnn 8.9 - in my case using cuda 12.2
  2. adding file stl_backports.h in onnx deps (otherwise the compiler yells not finding this file - to be fixed in the ort release) note that in a docker container , you need to add the file AFTER downloading the 3P onnx modules - for that use ORT build script with the option --update and then call again the build script with option --build
  3. adding --use_tensorrt_oss_parser

error is [ 32%] Building CXX object CMakeFiles/onnxruntime_common.dir/tmp/onnxruntime/onnxruntime/core/common/status.cc.o In file included from /tmp/onnxruntime/build/Linux/Release/_deps/onnx_tensorrt-src/ModelImporter.cpp:8: /tmp/onnxruntime/build/Linux/Release/_deps/onnx_tensorrt-src/onnx_utils.hpp: In function ‘std::ostream& operator<<(std::ostream&, const onnx::Mode /tmp/onnxruntime/build/Linux/Release/_deps/onnx_tensorrt-src/onnx_utils.hpp:128:43: error: invalid initialization of reference of type ‘const googm expression of type ‘const onnx::ModelProto’ 128 | stream << pretty_print_onnx_to_string(message); | ^~~ /tmp/onnxruntime/build/Linux/Release/_deps/onnx_tensorrt-src/onnx_utils.hpp:117:83: note: in passing argument 1 of ‘std::string pretty_print_onnx_tobuf::Message&)’ 117 | inline std::string pretty_print_onnx_to_string(::google::protobuf::Message const& message) | ~~~~~~~^~~

MiroPsota commented 4 months ago

I will probably try CUDA 12.x in the future. For now, I can only provide the script I just built ORT with on Ubuntu 24.04:

git clone -b v1.18.1 --depth 1 https://github.com/microsoft/onnxruntime.git onnxruntime-1.18.1
cd onnxruntime-1.18.1

sed -i 's@onnx_tensorrt;https://github.com/onnx/onnx-tensorrt/archive/06adf4461ac84035bee658c6cf5df39f7ab6071d.zip;46dceef659d75d276e7914a8057c2282269d5e7b@onnx_tensorrt;https://github.com/onnx/onnx-tensorrt/archive/0462dc31ae78f48744b6141ae376df1f96d3f459.zip;5ff086361956cceb81ed17453a1fd8db2aa4328d@' cmake/deps.txt

export CC=/usr/bin/gcc-11
export CXX=/usr/bin/g++-11
./build.sh \
  --config Release \
  --build_wheel \
  --build_shared_lib \
  --parallel 4 \
  --compile_no_warning_as_error \
  --skip_submodule_sync \
  --use_cuda \
  --cuda_home $LIBS_DIR/cuda-11.8.0 \
  --cudnn_home $LIBS_DIR/cudnn-linux-x86_64-8.9.2.26_cuda11-archive \
  --use_tensorrt \
  --tensorrt_home $LIBS_DIR/TensorRT-8.6.1.6.Linux.x86_64-gnu.cuda-11.8/TensorRT-8.6.1.6 \
  --use_tensorrt_oss_parser \
  --use_openvino CPU \
  --cmake_generator Ninja \
  --cmake_extra_defines \
    CMAKE_CC_COMPILER=/usr/bin/gcc-11 \
    CMAKE_CXX_COMPILER=/usr/bin/g++-11 \
    CMAKE_CUDA_ARCHITECTURES=86 \
    CMAKE_CUDA_HOST_COMPILER=/usr/bin/g++-11 \
    OpenVINO_DIR=$LIBS_DIR/l_openvino_toolkit_ubuntu22_2024.1.0.15008.f4afc983258_x86_64/runtime/cmake \
    onnxruntime_BUILD_UNIT_TESTS=OFF \
  --skip_tests \
  --update

wget https://raw.githubusercontent.com/onnx/onnx/v1.14.1/onnx/common/stl_backports.h -O build/Linux/Release/_deps/onnx-src/onnx/common/stl_backports.h

./build.sh \
  --config Release \
  --build_wheel \
  --build_shared_lib \
  --parallel 4 \
  --compile_no_warning_as_error \
  --skip_submodule_sync \
  --use_cuda \
  --cuda_home $LIBS_DIR/cuda-11.8.0 \
  --cudnn_home $LIBS_DIR/cudnn-linux-x86_64-8.9.2.26_cuda11-archive \
  --use_tensorrt \
  --tensorrt_home $LIBS_DIR/TensorRT-8.6.1.6.Linux.x86_64-gnu.cuda-11.8/TensorRT-8.6.1.6 \
  --use_tensorrt_oss_parser \
  --use_openvino CPU \
  --cmake_generator Ninja \
  --cmake_extra_defines \
    CMAKE_CC_COMPILER=/usr/bin/gcc-11 \
    CMAKE_CXX_COMPILER=/usr/bin/g++-11 \
    CMAKE_CUDA_ARCHITECTURES=86 \
    CMAKE_CUDA_HOST_COMPILER=/usr/bin/g++-11 \
    OpenVINO_DIR=$LIBS_DIR/l_openvino_toolkit_ubuntu22_2024.1.0.15008.f4afc983258_x86_64/runtime/cmake \
    onnxruntime_BUILD_UNIT_TESTS=OFF \
  --skip_tests

Libraries are from here: CUDA cuDNN TensorRT OpenVINO

chilo-ms commented 4 months ago

@chilo-ms , ORT 1.18.1 now uses Cudnn 9 (vs 8 previously in 1.18.0) . What about TRT : is TRT10 fixed for faster-rcnn model or shall we still use TRT 8 (I am using Cuda 12.2). 1.18.1 is said to -Now using latest commit of onnx-tensorrt parser, which includes several issue fixes : is this using DDS by default ?

The TRT 10 DDS issue in Faster-RCNN should be resolved in TRT 10.3 which plans to be released in August. Also, TRT 10.3 is going to support DDS in built-in parser which means DDS is enabled for out of the box TRT EP. (we don't need to build from source with --use_tensorrt_oss_parser)

Right now, if you want to use DDS, please do ORT 1.18.1 + TRT 8.6 with additional --use_tensorrt_oss_parser and modified deps.txt:

- onnx_tensorrt;https://github.com/onnx/onnx-tensorrt/archive/a43ce67187bab219520fd80f21af8bbd4354bc8c.zip;572535aefef477050f86744dfab1fef840198035
+ onnx_tensorrt;https://github.com/onnx/onnx-tensorrt/archive/bacfaaa951653cd4e72efe727a543567cb38f7de.zip;26434329612e804164ab7baa6ae629ada56c1b26
chilo-ms commented 3 months ago

@jcdatin

TRT 10.2 is out which should fix the Faster-RCNN model issue. Could you share your custom Faster-RCNN model with to verify? We also want to test your model against DDS enabled feature in TRT so that we can evaluate and enable DDS support by default in ORT TRT (meaning, it will be DDS support out of the box).

MiroPsota commented 3 months ago

Can I build ORT 1.18.1 with TRT 10.2 or should I use main?

chilo-ms commented 3 months ago

Can I build ORT 1.18.1 with TRT 10.2 or should I use main?

Both should work

jcdatin commented 3 months ago

I will try , but gimme time

yf711 commented 3 months ago

I got a compilation error for ORT 1.18.1 after

  1. changing depts.txt like above to use latest TRT 8 commit,and using cudnn 8.9 - in my case using cuda 12.2
  2. adding file stl_backports.h in onnx deps (otherwise the compiler yells not finding this file - to be fixed in the ort release) note that in a docker container , you need to add the file AFTER downloading the 3P onnx modules - for that use ORT build script with the option --update and then call again the build script with option --build
  3. adding --use_tensorrt_oss_parser

error is [ 32%] Building CXX object CMakeFiles/onnxruntime_common.dir/tmp/onnxruntime/onnxruntime/core/common/status.cc.o In file included from /tmp/onnxruntime/build/Linux/Release/_deps/onnx_tensorrt-src/ModelImporter.cpp:8: /tmp/onnxruntime/build/Linux/Release/_deps/onnx_tensorrt-src/onnx_utils.hpp: In function ‘std::ostream& operator<<(std::ostream&, const onnx::Mode /tmp/onnxruntime/build/Linux/Release/_deps/onnx_tensorrt-src/onnx_utils.hpp:128:43: error: invalid initialization of reference of type ‘const googm expression of type ‘const onnx::ModelProto’ 128 | stream << pretty_print_onnx_to_string(message); | ^~~ /tmp/onnxruntime/build/Linux/Release/_deps/onnx_tensorrt-src/onnx_utils.hpp:117:83: note: in passing argument 1 of ‘std::string pretty_print_onnx_tobuf::Message&)’ 117 | inline std::string pretty_print_onnx_to_string(::google::protobuf::Message const& message) | ~~~~~~~^~~

Btw TRT8.6 with oss enabled would rely on the protobuf full version, and ORT 1.18.1 uses protobuf lite by default. You can enable full protobuf by updating this option https://github.com/microsoft/onnxruntime/blob/v1.18.1/cmake/CMakeLists.txt#L120 when building ORT 1.18.1 with TRT8.6

jcdatin commented 3 months ago

@yf711 : thx : I will try as my trying of ORT 1.18.1 with TRT 10 and cudnn 9 and Cuda 12.2 using use_tensorrt_oss_parser fails .

With ORT 1.18.1 and TRT10 default build , I am getting an error when TRT EP is first loading/parsing my model (note : my model works fine with derivative of 1.17.1 - ort build bb1972264b w/ trt 8 and cudnn 8-). I am getting error = terminate called after throwing an instance of 'Ort::Exception' what(): User needs to provide all the dynamic shape inputs with associated profiles if they want to explicitly set profiles through provider options. Please note that main graph could be partitioned into TRT/CUDA/CPU subgraphs, in this case, user also needs to provide shape profiles for the TRT subgraph's input. Following input(s) has no associated shape profiles provided: /model/my_model/rpn/Squeeze_2_output_0,/model/my_model/rpn/Squeeze_1_output_0,/model/my_model/el/my_model/rpn/NonZero_output_0

This is weird since I used the following tools onnxim and onnx symbolic_shap_infer.py which was the only way since the begining to run TRT EP on my model. So I don't understand why new shape profiles are needed with TRT 10 and ORT 1.18.1.

I checked that my faster rcnn model works with Cuda EP and TRT EP in "normal cache" mode

What is not working any more compared to build bb1972264b is the embedded context mode of TRT which is nice to speed up (x10) the onnx model load time (from 3s to 300ms). here is the code // WRAPPED TRT EP const auto& api = Ort::GetApi(); OrtTensorRTProviderOptionsV2* tensorrt_options; Ort::ThrowOnError(api.CreateTensorRTProviderOptions(&tensorrt_options));

    std::vector<const char*> option_keys = {
        "trt_fp16_enable",
        "trt_engine_cache_enable",
        "trt_engine_cache_path",
        "trt_dump_ep_context_model",
        "trt_ep_context_file_path",
        "trt_profile_min_shapes",
        "trt_profile_max_shapes",
        "trt_profile_opt_shapes",
    };
    std::vector<const char*> option_values = {
        "0",    // trt_fp16_enable
        "1",   // trt_engine_cache_enable : create the embedded profile and engine in cache path (trt_ep_context_file_path)
        cachePath.c_str(),   // trt_engine_cache_path : relative path to the embedded profile and engine cache
        "1",                       // trt_dump_ep_context_model : create the embedded model context (_ctx.onnx) file that contains names of profile and engine
        contextPath.c_str(),       // trt_ep_context_file_path : path to the embedded context files
        "image:0x0",                 // trt_profile_min_shapes
        "image:3072x2400",     // trt_profile_max_shapes
        "image:2048x1024",    // trt_profile_opt_shapes
    };

Is there any change in ORT 1.18.1 TRT EP API for embedded TRT context that was introduced and worked well in ORT 1.17.1 ?

jcdatin commented 3 months ago

apologies : my build above was still using TRT 10.0.1 (w/ cudnn 9), retrying with TRT 10.3.0.29 (and cudnn 9)

chilo-ms commented 3 months ago

@jcdatin

What is not working any more compared to build https://github.com/microsoft/onnxruntime/commit/bb1972264b89261e98d438367eb54d97eea52c12 is the embedded context mode of TRT which is nice to speed up (x10) the onnx model load time (from 3s to 300ms).

That's weird, there is no change in terms of EPContext/Embedded engine feature between ORT 1.17 and ORT 1.18. What's the error you saw?

my build above was still using TRT 10.0.1 (w/ cudnn 9), retrying with TRT 10.3.0.29 (and cudnn 9)

Yes, please use the latest TRT 10.3 which fixes issues when running Faster-RCNN.

jcdatin commented 3 months ago

Rebuilt ORT 1.18.1 with TRT 10.3.0.26 (and cudnn 9.3.0.75) - with cuda 12.2

First observation (when not using embedded context of TRT)= -I still see not only nodes NonZero, NonMaxSuppression and RoiAlign on CPU EP , but now also node ScatterND (compared to build https://github.com/microsoft/onnxruntime/commit/bb1972264b89261e98d438367eb54d97eea52c12). This is causing execution slow down compared to when all these nodes were on TRT. (x2). I used ort 1.18.1 build command = CC=gcc-11 CXX=g++-11 ./build.sh --nvcc_threads 2 --config $ORT_BUILD_MODE --use_cuda --cudnn_home /usr/local/cuda/lib64 --cuda_home /usr/local/cuda/ --use_tensorrt --use_tensorrt_oss_parser --tensorrt_home /usr/local/TensorRT --build_shared_lib --parallel --skip_tests --allow_running_as_root --cmake_extra_defines "CMAKE_CUDA_ARCHITECTURES=75" --cmake_extra_defines "CMAKE_CUDA_HOST_COMPILER=/usr/bin/gcc-11" Why is --use_tensorrt_oss_parser not choosing TRT EP nodes for allocation like build bb19722 ?

first thing first , do you know why I am still having these nodes on CPU EP ? Shall I remove option --use_tensorrt_oss_parser ? (I am going to try).

Second Observation when using TRT embedded context with config above, I am getting the same error as with TRT 10.0= terminate called after throwing an instance of 'Ort::Exception' what(): User needs to provide all the dynamic shape inputs with associated profiles if they want to explicitly set profiles through provider options. Please note that main graph could be partitioned into TRT/CUDA/CPU subgraphs, in this case, user also needs to provide shape profiles for the TRT subgraph's input if it's dynamic shape input. Following input(s) has no associated shape profiles provided: /model/my_model/rpn/Squeeze_2_output_0,/model/my_model/rpn/Squeeze_1_output_0,/model/my_model/rpn/Reshape_17_output_0,/model/my_model/rpn/NonZero_output_0 Aborted (core dumped)

This used to work with TRT 8.6/cudnn 8.9 and ORT build bb19722 A I said I used onnx symbolic_shap_infer.py on my faster rcnn model prior though to running ORT with TRT EP (only way to run TRT EP anyway) This is another big issue since it x10 the load time of the onnx model.

So far these are too big regressions for me to use ORT 1.18.1 and beyond.

jcdatin commented 3 months ago

other question : what is the ONNXRT optimisation level to use in conjunction with TRT EP (which has its own optimizations) ? sessionOptions.SetGraphOptimizationLevel(optiLevel);

jcdatin commented 3 months ago

tried to build ORT 1.18.1 w/ TRT 10.3 without --use_tensorrt_oss_parser and the following nodes are still on CPU EP NonZero, NonMaxSuppression and RoiAlign , ScatterND 1) seems ORT 1.18.1 does not work optimally with DDS on TRT 10.3 (for my faster rcnn)

2) similarly embedded trt context is not working w/ ORT 1.18.1 and TRT 10.3 , it crashes with the error above . I noted the following warning though with TRT 10 that may indicate the problem is a regression in TRT 10 ? Do you confirm ? 2024-08-17 09:50:08.899501193 [W:onnxruntime:Inference, tensorrt_execution_provider.h:86 log] [2024-08-17 09:50:08 WARNING] /model/my_model/rpn/Reshape: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 3 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is usually a sign of a bug in the network. 2024-08-17 09:50:08.899536800 [W:onnxruntime:Inference, tensorrt_execution_provider.h:86 log] [2024-08-17 09:50:08 WARNING] /model/my_model/rpn/Reshape: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 4 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is usually a sign of a bug in the network. 2024-08-17 09:50:08.899550675 [W:onnxruntime:Inference, tensorrt_execution_provider.h:86 log] [2024-08-17 09:50:08 WARNING] /model/my_model/rpn/Reshape_2: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 3 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is usually a sign of a bug in the network. 2024-08-17 09:50:08.899562173 [W:onnxruntime:Inference, tensorrt_execution_provider.h:86 log] [2024-08-17 09:50:08 WARNING] /model/my_model/rpn/Reshape_2: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 4 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is usually a sign of a bug in the network. 2024-08-17 09:50:20.218509763 [W:onnxruntime:iInference,tensorrt_execution_provider.h:86 log] [2024-08-17 09:50:20 WARNING] Profile kMIN values are not self-consistent. IShuffleLayer /model/my_model/rpn/Reshape: reshaping failed for tensor: /model/my_model/rpn/head/cls_logits/Conv_output_0 Reshape placeholder 0 has no corresponding input dimension. Instruction: RESHAPE_ZERO_IS_PLACEHOLDERinput dims{1 13 0 0} reshape dims{1 -1 1 0 0}.

First thing first , can you investigate why DDS nodes not on TRT EP ?

chilo-ms commented 2 months ago

Rebuilt ORT 1.18.1 with TRT 10.3.0.26 (and cudnn 9.3.0.75) - with cuda 12.2

First observation (when not using embedded context of TRT)= -I still see not only nodes NonZero, NonMaxSuppression and RoiAlign on CPU EP , but now also node ScatterND (compared to build bb19722). This is causing execution slow down compared to when all these nodes were on TRT. (x2). I used ort 1.18.1 build command = CC=gcc-11 CXX=g++-11 ./build.sh --nvcc_threads 2 --config $ORT_BUILD_MODE --use_cuda --cudnn_home /usr/local/cuda/lib64 --cuda_home /usr/local/cuda/ --use_tensorrt --use_tensorrt_oss_parser --tensorrt_home /usr/local/TensorRT --build_shared_lib --parallel --skip_tests --allow_running_as_root --cmake_extra_defines "CMAKE_CUDA_ARCHITECTURES=75" --cmake_extra_defines "CMAKE_CUDA_HOST_COMPILER=/usr/bin/gcc-11" Why is --use_tensorrt_oss_parser not choosing TRT EP nodes for allocation like build bb19722 ?

first thing first , do you know why I am still having these nodes on CPU EP ? Shall I remove option --use_tensorrt_oss_parser ? (I am going to try).

Let me reply the first question. ORT 1.18.x and current main with --use_tensorrt_oss_parser doesn't enable TRT DDS nodes. The build bb19722 (dated back to April) did enable DDS nodes, however, TRT 10 has some DDS related issues, therefore, we disable TRT DDS nodes since then.

i agree it's a bit complicated to enable DDS like i mentioned here. Please use this branch to build ORT with --use_tensorrt_oss_parser against TRT 10.3. You don't need to modify additional files, then you can run TRT EP with DDS enabled meaning NonZero, NMS and RoiAlign should be run by TRT.

One thing to note is, when running the NMS node, TRT EP + TRT 10.3 is taking much longer time to finish (compared to TRT 8.6). We are still investigating the issue. And if possible, could you share your model with us to test? Or could you help test from your side?

chilo-ms commented 2 months ago

Second Observation when using TRT embedded context with config above, I am getting the same error as with TRT 10.0= terminate called after throwing an instance of 'Ort::Exception' what(): User needs to provide all the dynamic shape inputs with associated profiles if they want to explicitly set profiles through provider options. Please note that main graph could be partitioned into TRT/CUDA/CPU subgraphs, in this case, user also needs to provide shape profiles for the TRT subgraph's input if it's dynamic shape input. Following input(s) has no associated shape profiles provided: /model/my_model/rpn/Squeeze_2_output_0,/model/my_model/rpn/Squeeze_1_output_0,/model/my_model/rpn/Reshape_17_output_0,/model/my_model/rpn/NonZero_output_0 Aborted (core dumped)

@jcdatin In order to use embedded context, the whole model should be TRT eligible meaning the whole model should be placed on TRT EP. https://onnxruntime.ai/docs/execution-providers/TensorRT-ExecutionProvider.html#more-about-embedded-engine-model--epcontext-model In your case, some nodes are placed on CPU. (Please see my previous reply to fix this issue)

jcdatin commented 2 months ago

@chilo-ms : thx for your answer. I was in vacations. I will try DDS with your branch and investigate TRT EP with TRT10.3 for NMS node. I will also check that TRT embedded context is working once all nodes on TRT EP.

jcdatin commented 1 month ago

Nvidia informed me that the NMS performance issue is a known problem that will be fixed in TRT 10.6

chilo-ms commented 1 month ago

Nvidia informed me that the NMS performance issue is a known problem that will be fixed in TRT 10.6

Yeah, the NMS regression in TRT 10 is a known issue and Nvidia has been investigated this issue. We have been tracking this issue with them and hopefully it can be fixed in TRT 10.6.

jcdatin commented 1 week ago

TRT 10.6 is out as well as ONNRT 1.20. But I see some restrictions :

what is the version of Cuda supported by ORT : I am using 12.2 and TRT 10.6 seems to require 12.6 but first things first , what about ORT 1.20 supporting TRT10.6 and DNS ?

chilo-ms commented 5 days ago

Re: ORT 1.20 only supports TRt 10.4 and 10.5 (and I need TRT10.6)

ORT 1.20 supports TRT 10.4 and 10.5 means our CIs tested against those TRT versions and the prebuilt package built against those versions. But you can still run the ORT TRT prebuilt library with TRT 10.6. (Note: specify TRT 10.6 lib path to LD_LIBRARY_PATH)

Re: Previous ORT and TRT 10.x could not dispatch aNMS nor nonZero ops to TRT tree, so I have to take TRT10.6 : will ORT still dispatch NMS/NonZero to TRT , I prefer TRT perf limitation than ORT displating these DNS ops still to CPU.

Start from TRT 10.7 (which is not released yet), TRT will completely enable DDS ops, aka ORT will dispatch NMS/NonZero/RoiAlign to TRT by default. Before TRT 10.7, user needs to build ORT with open-source parser to achieve this. But please be aware of the known DDS perf issue from TRT 10.0 to 10.7 (Nvidia likely won't fix the issue in TRT 10.7)
ORT TRT has a PR (which will be included in ORT 1.20.1 patch release) to add a new provider option trt_op_types_to_exclude which will exclude some ops to be run on TRT. This PR also adds NMS/NonZero/RoiAlign to the exclude list by default due to perf issue. User can provide empty string to it, i.e. trt_op_types_to_exclude="" to override so that all ops will be considered run on TRT.

Re: what is the version of Cuda supported by ORT : I am using 12.2 and TRT 10.6 seems to require 12.6. ORT should be compatible with CUDA 12.x. Did you find any issue of running ORT with CUDA 12.6?

jcdatin commented 5 days ago

Thank you @chilo-ms , I am building and testing 1.20.0 with trt 10.6 and oss trt parser . I will report the TRT 10.6 DNS operator performance degradation. When TRT10.7 is available I will test it with ORT 1.20.1 and its empty trt_op_types_to_exclude list and default trt parser. Keep posted

jcdatin commented 4 days ago

I am getting a an ort 1.20.0 compilation error when building with TRT 10.6 (TensorRT-10.6.0.26.Linux.x86_64-gnu.cuda-12.6.tar.gz) over CUDA 12.2 with build command : CC=gcc-11 CXX=g++-11 ./build.sh --skip_submodule_sync --nvcc_threads 2 --config ${ORT_BUILD_MODE} --use_cuda --cudnn_home /usr/local/cuda/lib64 --cuda_home /usr/local/cuda/ --use_tensorrt --use_tensorrt_oss_parser --tensorrt_home /usr/local/TensorRT --build_shared_lib --parallel --skip_tests --allow_running_as_root --cmake_extra_defines "CMAKE_CUDA_ARCHITECTURES=89" --cmake_extra_defines "CMAKE_CUDA_HOST_COMPILER=/usr/bin/gcc-11"

cf [ 31%] Building CXX object _deps/onnx_tensorrt-build/CMakeFiles/nvonnxparser_static.dir/onnxErrorRecorder.cpp.

In file included from /onnxruntime/build/Linux/Release/_deps/onnx_tensorrt-src/onnxErrorRecorder.cpp:5: /onnxruntime/build/Linux/Release/_deps/onnx_tensorrt-src/onnxErrorRecorder.hpp:32:38: error: ‘ILogger’ in namespace ‘nvinfer1’ does not name a type 32 | using ILogger = nvinfer1::ILogger; | ^~~ /onnxruntime/build/Linux/Release/_deps/onnx_tensorrt-src/onnxErrorRecorder.hpp:39:9: error: ‘ILogger’ has not been declared 39 | ILogger logger, IErrorRecorder otherRecorder = nullptr); | ^~~ /onnxruntime/build/Linux/Release/_deps/onnx_tensorrt-src/onnxErrorRecorder.hpp:70:36: error: expected ‘)’ before ‘’ token 70 | ONNXParserErrorRecorder(ILogger logger, IErrorRecorder otherRecorder = nullptr); | ~ ^ | ) /onnxruntime/build/Linux/Release/_deps/onnx_tensorrt-src/onnxErrorRecorder.hpp:74:26: error: ‘ILogger’ has not been declared 74 | static void logError(ILogger logger, const char str); | ^~~ /onnxruntime/build/Linux/Release/_deps/onnx_tensorrt-src/onnxErrorRecorder.hpp:103:5: error: ‘ILogger’ does not name a type 103 | ILogger mLogger{nullptr}; | ^~~ /onnxruntime/build/Linux/Release/_deps/onnx_tensorrt-src/onnxErrorRecorder.cpp:12:26: error: ‘onnx2trt::ONNXParserErrorRecorder onnx2trt::ONNXParserErrorRecorder::create’ is not a static data member of ‘class onnx2trt::ONNXParserErrorRecorder’ 12 | ONNXParserErrorRecorder ONNXParserErrorRecorder::create( | ^~~~~~~ /onnxruntime/build/Linux/Release/_deps/onnx_tensorrt-src/onnxErrorRecorder.cpp:13:15: error: ‘ILogger’ is not a member of ‘nvinfer1’ 13 | nvinfer1::ILogger logger, nvinfer1::IErrorRecorder otherRecorder) | ^~~ gmake[2]: *** Waiting for unfinished jobs....

chilo-ms commented 4 days ago

Please specified the correct onnx-tensorrt commit in the cmake/deps.txt of your ort repo. https://github.com/microsoft/onnxruntime/blob/main/cmake/deps.txt#L40