microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.23k stars 2.87k forks source link

TensorrtExecutionProvider slower than CUDAExecutionProvider: Faster-rcnn [Performance] #17434

Open datinje opened 1 year ago

datinje commented 1 year ago

Describe the issue

on my Faster-rcnn-rpn models doing detections of patterns, after considerable efforts to infer with TensorRT EP, (see https://github.com/microsoft/onnxruntime/issues/16886 as this shows that I have simplified the model and infered the shapes of the model nodes before submitting to TRT) , I found that TRT EP is about 30% slower than with Cuda EP in FP32 (and in TF32) - only with FP16 TRT EP -almost- catches up.

I only mentions here the second inference , not the warm up once (which is considerably slower which is normal)

After looking at the VERBOSE mode logs , found out that not all the nodes are running on TRT, one is still on CPU and 6 on Cuda EP. That cause many memory transfers between Host and GPU . I suppose this is the reason. So my question is why is ther still nodes on CPU and Cuda EPs ? Can this be fixed ?

Here are the logs : 2023-09-06 16:45:59.604024060 [V:onnxruntime:, session_state.cc:1149 VerifyEachNodeIsAssignedToAnEp] Node placements 2023-09-06 16:45:59.604038849 [V:onnxruntime:, session_state.cc:1155 VerifyEachNodeIsAssignedToAnEp] Node(s) placed on [TensorrtExecutionProvider]. Number of nodes: 11 2023-09-06 16:45:59.604042765 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] TRTKernel_graph_torch_jit_15684953649142847852_0 (TensorrtExecutionProvider_TRTKernel_graph_torch_jit_15684953649142847852_0_0) 2023-09-06 16:45:59.604046398 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] TRTKernel_graph_torch_jit_15684953649142847852_1 (TensorrtExecutionProvider_TRTKernel_graph_torch_jit_15684953649142847852_1_1) 2023-09-06 16:45:59.604049385 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] TRTKernel_graph_torch_jit_15684953649142847852_2 (TensorrtExecutionProvider_TRTKernel_graph_torch_jit_15684953649142847852_2_2) 2023-09-06 16:45:59.604052381 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] TRTKernel_graph_torch_jit_15684953649142847852_3 (TensorrtExecutionProvider_TRTKernel_graph_torch_jit_15684953649142847852_3_3) 2023-09-06 16:45:59.604055213 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] TRTKernel_graph_torch_jit_15684953649142847852_4 (TensorrtExecutionProvider_TRTKernel_graph_torch_jit_15684953649142847852_4_4) 2023-09-06 16:45:59.604057978 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] TRTKernel_graph_torch_jit_15684953649142847852_5 (TensorrtExecutionProvider_TRTKernel_graph_torch_jit_15684953649142847852_5_5) 2023-09-06 16:45:59.604060720 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] TRTKernel_graph_torch_jit_15684953649142847852_6 (TensorrtExecutionProvider_TRTKernel_graph_torch_jit_15684953649142847852_6_6) 2023-09-06 16:45:59.604063521 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] MemcpyFromHost (Memcpy) 2023-09-06 16:45:59.604066111 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] MemcpyToHost (Memcpy_token_422) 2023-09-06 16:45:59.604068754 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] MemcpyToHost (Memcpy_token_423) 2023-09-06 16:45:59.604078119 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] MemcpyToHost (Memcpy_token_424) 2023-09-06 16:45:59.604081367 [V:onnxruntime:, session_state.cc:1155 VerifyEachNodeIsAssignedToAnEp] Node(s) placed on [CPUExecutionProvider]. Number of nodes: 1 2023-09-06 16:45:59.604086459 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] RoiAlign (/model/roi_heads/box_pooler/level_poolers.0/RoiAlign) 2023-09-06 16:45:59.604093948 [V:onnxruntime:, session_state.cc:1155 VerifyEachNodeIsAssignedToAnEp] Node(s) placed on [CUDAExecutionProvider]. Number of nodes: 5 2023-09-06 16:45:59.604099017 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] NonZero (/model/proposal_generator/NonZero) 2023-09-06 16:45:59.604103942 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] NonMaxSuppression (NonMaxSuppression_497) 2023-09-06 16:45:59.604108777 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] NonZero (/model/roi_heads/NonZero) 2023-09-06 16:45:59.604113159 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] NonMaxSuppression (NonMaxSuppression_796) 2023-09-06 16:45:59.604117903 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] NonZero (/model/NonZero)

I got the same issue in both C++ and python runtime APIs

To reproduce

I can't share my model for IP , but I see similar issues with public Detectron Model zoo faster-rcnn-rpn (see https://github.com/microsoft/onnxruntime/issues/16886) how to run it - but with this one even more nodes are fallback on CPU and cuda , among which the nodes in bold above. So maybe fixes investigating this one will lead to same fixes.

Urgency

I have been blocked for several months on trying to run the model on TRT EP (see https://github.com/microsoft/onnxruntime/issues/16886 thx for the ort staff that helped me) now to find out that this may not be worth. Looks like I am not fat - only actually 3 operator/nodes to go on TRT EP, but times up I will need in a couple of month to freeze the model to certify the results with no second chance certifying with TRT FP16 or better INT8. I am expecting a x2 perf improvement in TRT fp16 and another x2 improvement in INT8 (accuracy is still excellent in FP16).

Platform

Linux

OS Version

SLES15 SP4

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.15.1+ (using main latest for a fix to build TRT EP)

ONNX Runtime API

Python

Architecture

X64

Execution Provider

TensorRT

Execution Provider Library Version

TensorRT 8.6.1

Model File

I can't but could use fatser-rcnn-rpn from detectron2 model zoo (see https://github.com/microsoft/onnxruntime/issues/16886)

Is this a quantized model?

No

chilo-ms commented 1 year ago

onnx-trt parser filters out NonMaxSuppression, NonZero, and RoiAlign, so that's why you saw those nodes are placed on CUDA/CPU EP. i also think that many memcpy between CPU/GPU causes this long latency issue. It's the general issue for rcnn models meaning you will see many TRT/CUDA/CPU partitions. We will discuss with Nvidia to see whether it's possible/beneficial to incorporate those nodes (i doubt this) or we can find other ways to improve the performance.

chilo-ms commented 1 year ago

Tensorrt supports nmsplugin and rioAlignPlugin. Probably we can replace onnx NonMaxSuppression and RoiAlign nodes with those two TRT plugins to see the latency?

skottmckay commented 1 year ago

Typically the nodes from NonMaxSuppression and on are selecting the best bounding boxes. These are relatively cheap operations where it's more efficient to stay on CPU than go back to GPU. In the NNAPI EP we have the option to set an operator after which NNAPI is not used, and we do that for NonMaxSuppression. Maybe something similar would also work for TRT/CUDA for this type of model.

datinje commented 1 year ago

onnx-trt parser filters out NonMaxSuppression, NonZero, and RoiAlign, so that's why you saw those nodes are placed on CUDA/CPU EP. i also think that many memcpy between CPU/GPU causes this long latency issue. It's the general issue for rcnn models meaning you will see many TRT/CUDA/CPU partitions. We will discuss with Nvidia to see whether it's possible/beneficial to incorporate those nodes (i doubt this) or we can find other ways to improve the performance.

So, even since , according to @skottmckay, these 3 operators are cheaper on CPU, can we try to keep them on GPU to avoid the overhead of moving the data btw CPU and GPU (in my case images of 13MB) ? Is that the goal/capability of the nmsplugin and roiAlignPlugin ? I am ready to try . Any example how to do that ? Shall I modify the Model code, the resulting ONNX or is that a mere declaration in onnxruntime tensorRT EP configuration ? What about the third operator nonZero ? I could not find a plugin any possibility to keep it on GPU to avoid memory transfers due to other subgraph split ?

datinje commented 1 year ago

If I want to test the performance I get by not filtering out these operators by commenting out the lines https://github.com/onnx/onnx-tensorrt/blob/main/ModelImporter.cpp#L377, then where shall I modify the ModelImporter.cpp file before recompiling onnxruntime ?

I am recompiling onnxruntime with nvidia gpu and tensorrt EP in my docker image with: RUN git clone https://github.com/microsoft/onnxruntime WORKDIR /tmp/onnxruntime RUN CC=gcc-11 CXX=g++-11 ./build.sh --config RelWithDebInfo --use_cuda --cudnn_home /usr/local/cuda/lib64 --cuda_home /usr/local/cuda/ --use_tensorrt --tensorrt_home /usr/local/TensorRT --build_shared_lib --parallel --skip_tests --allow_running_as_root (I am using the latest main as a bug was fixed on ort 1.15.1 to compile tensorrt EP)

datinje commented 1 year ago

what if I compile onnxruntime with --use_tensorrt_builtin_parser : will teh nodes be filtered out ?

datinje commented 1 year ago

no change if I recompile onnxruntime with -use_tensorrt_builtin_parser The nodes are still placed on CPU

chilo-ms commented 1 year ago

Here are the steps to build OSS onnx-tensorrt parser with not filtering out those operators:

  1. add --use_tensorrt_oss_parser as one of the ORT build arguments and start building.
  2. At the beginning stage of ORT build, you will find onnx-tensorrt repos being downloaded to path ./build/Linux/Debug/_deps/onnx_tensorrt-src, simply comment out those lines of node filtering in ModelImporter.cpp
  3. Resume build. Note: you might encounter build error of CUDA_INCLUDE_DIR not found. Modify here to set(CUDA_INCLUDE_DIR ${CMAKE_CUDA_TOOLKIT_INCLUDE_DIRECTORIES})

I tested the not filtering out onnx-tensorrt parser with faster rcnn form onnx model zoo and it can include those nodes for TRT, but it failed to build the TRT engine. I need to investigate further, but you can try your faster-rcnn model.

Update: Checked with Nvidia, those nodes should only work with TRT api enqueueV3, and TRT EP is using enqueueV2, so it's expected to see enqueue error. As for engine build error that I saw, will follow up with Nvidia. TensorRT EP is planning to update to use latest TRT apis, but it's going to take some time.

chilo-ms commented 1 year ago

I think we can try the TRT plugins. please see the doc here. You need to modify the graph and replace RoiAlign and NonMaxSuppression with the custom ops that will later map to trt plugins. (Remember to correctly put the name and domain of the custom node). Unfortunately, there is no related NonZeroPlugin for now.

datinje commented 1 year ago

thx a lot @chilo-ms : I will try to integrate the 2 plugins in my model to test performance improvement. Hoping that ONNRT TRT EP to use TRT API enqueueV3 asap. Expect some time before next post as I am ooo next week.

datinje commented 1 year ago

after discussing with NVIDIA on how to integrate plugins , we found out that NMS and nonzero ARE implemented in tensorRT . cf

for ROIALign , the only way is via the TRT plugin, but is there a way to have TRT EP call the native TRT instruction to avoid data transfer between CPU and GPU ?

datinje commented 1 year ago

in 1.16.0 there is this new session option disable_cpu_ep_fallback. How can we set it ? and will this prevent falling back nonZero and NMS on CPU EP ?

chilo-ms commented 1 year ago

@datinje Last time from Nvidia, they mentioned NMS and NonZero are natively supported only by enqueueV3 (TRT EP currently uses enqueueV2). I am current working on a dev branch to use enqueueV3. Before the dev branch is merged to main, i think you can only try TRT NMS/NonZero plugins, please see my previous reply for how to use it. (Note: i encountered engine build error, so i might also update the engine build api as well. Will let you know once the dev branch is ready and merged to main)

Please see here for how to use disable_cpu_ep_fallback. But in your case, you still need CUDA EP or CPU to run those three nodes if you don't want to use TRT plugins. If you use TRT plugin and because the whole model can be run by TRT, regardless of native TRT or TRT plugins, there should be no data transfer betwee CPU and GPU except the model input/output.

datinje commented 11 months ago

As stated above by @chilo-ms , I tried in 1.16 to disable cpu ep fallback to try to avoid moving onnx operators to CPU if onnxrt parser estimated so , but effect is not to keep the operators on GPU with TRT as expected , it is preventing the program to continue .

Then what is the purpose of this option ? The mains interest for me would be for ONNRT to keep the Operators on the GPU even if faster on CPU because overhead of transferring the data would be offsetting the benefit.

2023-10-31 11:27:23.916547026 [E:onnxruntime:, inference_session.cc:1678 Initialize] This session contains graph nodes that are assigned to the default CPU EP, but fallback to CPU EP has been explicitly disabled by the user.

Traceback (most recent call last):

File "/cad-engine/run-onnx-pytorch.model.py", line 299, in

main()

File "/cad-engine/run-onnx-pytorch.model.py", line 60, in main

sess = ort.InferenceSession(model_path, sess_options=sess_opt, providers=providers)

File "/usr/local/lib/python3.10/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 419, in init

self._create_inference_session(providers, provider_options, disabled_optimizers)

File "/usr/local/lib/python3.10/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 471, in _create_inference_session

sess.initialize_session(providers, provider_options, disabled_optimizers)

onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : This session contains graph nodes that are assigned to the default CPU EP, but fallback to CPU EP has been explicitly disabled by the user.

datinje commented 11 months ago

something wrong in the copy paste above , sorry. forget about the "File ..." lines.

chilo-ms commented 10 months ago

@datinje

Then what is the purpose of this option ?

One of the purposes of using this disable_cpu_ep_fallback is to make sure all the nodes are placed on GPUs before ORT starts to run inference. ORT may place some nodes on CPU for performance, but in some cases, it might not be the case. So this option works as a check.

However, in your case, the error you got is expected because current ORT TRT doesn't support NonZero, NMS and RoiAlign, and cpu is the only ep to run these nodes. So, only if all the nodes in your model are supported by ORT TRT, you are suggested to use disable_cpu_ep_fallback. Otherwise, you will get this error.

As I mentioned previously, you can try following steps:

then you can see that ORT TRT can run all the nodes of your FasterRCNN model except RoiAlign.

jcdatin commented 6 months ago

closing since I realized that with ORT 1.16.3 I succeeded runing my model with TRT and it gets faster than Cuda EP in TF32

jcdatin commented 5 months ago

I tested again my model with latest onnxrt 1.17.1 and got same performance results between TRT EP and CUDA EP. I would have expected that TRT EP would have used NMS and nonzero TRT operator since onnxrt 1.17.1 now supports TRT API enqueue V3 which would allow to call them . Is there any date planeed for integrating these TRT op in onnxrt TRT EP ?

jcdatin commented 5 months ago

even NonZero op seems implemented in TRT : could it be implemented in ONNXRT TRT EP ? With these 3 operator ALL of the faster-rcnns would run on TRT and avoid host to device memory transfers !

jcdatin commented 5 months ago

https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/classnvinfer1_1_1_i_non_zero_layer.html

chilo-ms commented 5 months ago

I tested again my model with latest onnxrt 1.17.1 and got same performance results between TRT EP and CUDA EP. I would have expected that TRT EP would have used NMS and nonzero TRT operator since onnxrt 1.17.1 now supports TRT API enqueue V3 which would allow to call them . Is there any date planeed for integrating these TRT op in onnxrt TRT EP ?

@jcdatin Unfortunately, for ORT 1.17.x, TRT EP doesn't include those DDS operators (NMS/NonZero/RoiAlign). But, current ORT main branch + OSS onnx-tensorrt parser will make TRT EP use NMS/NonZero/RoiAlign TRT operators. You can simply build ORT main with --use_oss_trt_parser to achieve this.

We are testing TRT EP + TRT DDS output support (meaning including the NMS/NonZero/RoiAlign operators) to see the performance and then decide whether to enable this feature in the ORT official release.

If you could help test it and provide the feedback, that will be great!. Thank you!

jcdatin commented 5 months ago

Sure ! I will help.

jcdatin commented 5 months ago

shall --use_oss_trt_parser REPLACE --use_tensorrt_builtin_parser or simply complete it

chilo-ms commented 5 months ago

if no parser related option specified or --use_tensorrt_builtin_parser is specified --> TRT EP will dynamically link against built-in parser. if --use_oss_trt_parser is sepcified --> ORT will build the onnx-tensorrt parser and TRT EP will statically link against it.

jcdatin commented 5 months ago

for those who read, flag is actually --use_tensorrt_oss_parser . Retrying.

jcdatin commented 5 months ago

tested 👍 2024-04-13 13:51:39.673832151 [V:onnxruntime:, session_state.cc:1149 VerifyEachNodeIsAssignedToAnEp] All nodes placed on [TensorrtExecutionProvider]. Number of nodes: 1 inference is now 3 times faster than before . 1. because of NO device to host transfer anymore and also it seems that more graph node fusion optimization occur. This is incredible . Even in my dreams could not have believed this . Congratulations Onnxruntime team and a ig thx @chilo-ms who supported me all this time ! This onnxrtuntime release with TRT EP is a MAJOR improvement !

jcdatin commented 5 months ago

can't wait for the official release

jcdatin commented 5 months ago

forgot to say : of course accuracy of results are same between CUDA EP and this new TRT EP

jcdatin commented 5 months ago

@chilo-ms one question though : I am getting the same results of inference between ONNXRT +Cuda EP and ONNXRT + TRt EP for the same model and inputs for the same GPU . But the results are totally different between 2 GPUs (Turing-sm75 and ADA-sm89) my ONNXRT was compiled with flag --cmake_extra_defines "CMAKE_CUDA_ARCHITECTURES=75" is this the reason ? I was under the impression that this flag meant that all GPUs ABOVE this sm75 architecture would also work the same albeit not as optimized in performance. Note I started using this flag to fix this issue : https://github.com/microsoft/onnxruntime/issues/18579

Here is now my current onnxrt build command.

RUN CC=gcc-11 CXX=g++-11 ./build.sh --config RelWithDebInfo --use_cuda --cudnn_home /usr/local/cuda/lib64 --cuda_home /usr/local/cuda/ --use_tensorrt --use_tensorrt_oss_parser --tensorrt_home /usr/local/TensorRT --build_shared_lib --parallel --skip_tests --allow_running_as_root --cmake_extra_defines "CMAKE_CUDA_ARCHITECTURES=75" --cmake_extra_defines "CMAKE_CUDA_HOST_COMPILER=/usr/bin/gcc-11"

I intend to use the onnxrt on GPUs turing, ampere and ada (sm75, sm86 and sm89) : so is this flag correct ? Should I use a list of GPU architectures I would run onnxrt on as "-DCMAKE_CUDA_ARCHITECTURES=75;80;90" as stated by @snnn in https://github.com/microsoft/onnxruntime/issues/19606 ?

jcdatin commented 5 months ago

when using -DCMAKE_CUDA_ARCHITECTURES=75;86;89" or without using -DCMAKE_CUDA_ARCHITECTURES, then onnxrt build fails on my turing (sm_75) build machine on a sm80 cuda file: cc error : 'cicc' died due to signal 9 (Kill signal) gmake[2]: *** [CMakeFiles/onnxruntime_providers_cuda.dir/build.make:5808: CMakeFiles/onnxruntime_providers_cuda.dir/tmp/onnxruntime/onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_hdim160_bf16_sm80.cu.o] Error 9 I will try to build on the ADA board , but I would like to understand what is going on.

chilo-ms commented 5 months ago

But the results are totally different between 2 GPUs (Turing-sm75 and ADA-sm89) my ONNXRT was compiled with flag --cmake_extra_defines "CMAKE_CUDA_ARCHITECTURES=75" is this the reason ?

The output difference might be caused by ADA (SM89) which supports TF32. Please set an environment variable NVIDIA_TF32_OVERRIDE=0 as following link suggests: https://github.com/microsoft/onnxruntime/issues/19288#issuecomment-1912724866

jcdatin commented 5 months ago

Turing (sm75) also supports TF32 . Forgot to say that I am comparing the results of the ADA and Turing boards both in TF32. And even so the results would not be as different as I am having.

Btw , I think the topic of this issue is fixed. Shall I close this issue and create a new problem report for this one (difference of results btw ADA and turing) or shall I reopen https://github.com/microsoft/onnxruntime/issues/18579 ?

jcdatin commented 5 months ago

another question : in cuda execution provider I am seeing these warnings 2024-04-16 12:45:19.834391451 [W:onnxruntime:, transformer_memcpy.cc:74 ApplyImpl] 19 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message. 2024-04-16 12:45:19.839297350 [W:onnxruntime:, session_state.cc:1166 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.

This PR talks about TensorRT EP performance ,and we have seen that mapping all onnx op to tesnortRt has a dramatic performance improvement (a siggle subgraph is created with NO memcopy host-device .

Still can we do the same for Cuda Ep and avoid mapping some nodes to CPU ?

chilo-ms commented 5 months ago

Turing (sm75) also supports TF32 . Forgot to say that I am comparing the results of the ADA and Turing boards both in TF32. And even so the results would not be as different as I am having

As far as I know, Turing architecture (sm75) doesn't support TF32. TF32 is supported only on Ampere architecture or newer GPU architecture.

TF32 might introduce accuracy degradation, so i think the output difference could be due to Turing using FP32 and ADA using TF32. You can disable TF32 by set an environment variable NVIDIA_TF32_OVERRIDE=0 to test again?

chilo-ms commented 5 months ago

my ONNXRT was compiled with flag --cmake_extra_defines "CMAKE_CUDA_ARCHITECTURES=75" is this the reason ? I was under the impression that this flag meant that all GPUs ABOVE this sm75 architecture would also work the same albeit not as optimized in performance. Note I started using this flag to fix this issue : #18579

Here is now my current onnxrt build command.

RUN CC=gcc-11 CXX=g++-11 ./build.sh --config RelWithDebInfo --use_cuda --cudnn_home /usr/local/cuda/lib64 --cuda_home /usr/local/cuda/ --use_tensorrt --use_tensorrt_oss_parser --tensorrt_home /usr/local/TensorRT --build_shared_lib --parallel --skip_tests --allow_running_as_root --cmake_extra_defines "CMAKE_CUDA_ARCHITECTURES=75" --cmake_extra_defines "CMAKE_CUDA_HOST_COMPILER=/usr/bin/gcc-11"

I intend to use the onnxrt on GPUs turing, ampere and ada (sm75, sm86 and sm89) : so is this flag correct ? Should I use a list of GPU architectures I would run onnxrt on as "-DCMAKE_CUDA_ARCHITECTURES=75;80;90" as stated by @snnn in #19606 ?

First of all, the CMAKE_CUDA_ARCHITECTURES is only used for CUDA EP not TensorRT EP. TensorRT EP doesn't rely on this cmake variable to build proper binary for underlying GPU architecture to run. When TRT EP is building TRT engine, TRT detects the underlying architecture internally and generates the engine accordingly. So, that's why current engine built from TRT EP can only be run on specific GPU. (Note: We will be supporting TRT hardware compatibility in the future)

The CMAKE_CUDA_ARCHITECTURES only specifies exact that compute capability (SM), not the range or above. So, if you want CUDA EP to support Turing and ADA, please specify --cmake_extra_defines "CMAKE_CUDA_ARCHITECTURES=75;89"

chilo-ms commented 5 months ago

when using -DCMAKE_CUDA_ARCHITECTURES=75;86;89" or without using -DCMAKE_CUDA_ARCHITECTURES, then onnxrt build fails on my turing (sm_75) build machine on a sm80 cuda file: cc error : 'cicc' died due to signal 9 (Kill signal) gmake[2]: *** [CMakeFiles/onnxruntime_providers_cuda.dir/build.make:5808: CMakeFiles/onnxruntime_providers_cuda.dir/tmp/onnxruntime/onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_hdim160_bf16_sm80.cu.o] Error 9 I will try to build on the ADA board , but I would like to understand what is going on.

Signal 9 likely means the system ran out of virtual memory and killed nvcc.

So, you either reduce the parallelism (nvcc thread) when building ORT or reduce number of compute capability specified in CMAKE_CUDA_ARCHITECTURES (i think that's why it works when you only specify MAKE_CUDA_ARCHITECTURES=75 in your discussion https://github.com/microsoft/onnxruntime/issues/18579#issuecomment-1858246881).

jcdatin commented 5 months ago

tried with environment variable NVIDIA_TF32_OVERRIDE=0 : results on RTX8000 turing are same as without env var. Have to be something else. will try rebuilding onnxrt with sm89 and compare results on ada board with result on turing with onnxrt built with sm_75

chilo-ms commented 5 months ago

tried with environment variable NVIDIA_TF32_OVERRIDE=0 : results on RTX8000 turing are same as without env var.

Hmm, just to confirm, you should set NVIDIA_TF32_OVERRIDE=0 on ADA GPU to disable TF32 not Turing GPU, right? Or i misunderstood your meaning here.

jcdatin commented 5 months ago

How stupid I am . You are right . But I tested also TRT EP with FP16 (which Turing supports) for both Turing and Ada and the results are still different - for same model and same input. Also , I have seen that there is an official 1.17.3 , is the trt Queue V3 api and --use_tensorrt_oss_parser part of it or else what would be the release where this is part of ?

jcdatin commented 5 months ago

rebuilt onnxrt + trt EP with sm_75 flag and rebuilt my application and reinstalled my ada target. Then restested all configurations on both turing and ada GPU (cuda EP, TRT EP FP32, TFT EP TF32, TRT EP FP16 Now the results are similar in all configs (turing and ada) ! My mistake I guess. btw the turing results in TF32 are 5% slower than with FP32 (NVIDIA_TF32_OVERRIDE=0). You were right , Turing does not support TF32. Still results blindly faster with this onnxrt release for TRt EP thx to being able to use and run all nodes in TRT.

you can close the case . in which officiela release will this improvement be available ?

chilo-ms commented 5 months ago

you can close the case . in which officiela release will this improvement be available ?

We also tested the combination of TRT EP + DDS ops (NonZero, RoiAlign, NMS) support and we do see performance gain in terms of latency in some cases. But we also saw memory usage increased when testing the FasterRCNN model from onnx model zoo (This model, not like your fasterrcnn variant, it still has some nodes not being able to place on TRT EP). Just wondering in your case, how is the memory usage?

since we are still investigating where does this memory usage increase come from, so we can't make this feature to the official release yet.

chilo-ms commented 5 months ago

Also, is it possible for you to ship TRT EP which built with official release branch + oss onnx-tensorrt parser (which supports DDS ops)?

jcdatin commented 5 months ago

I will check the memory . good point . did not check yet. what do you mean "me to ship TRT EP ..." you said you cannot make this feature to the official release.

jcdatin commented 5 months ago

btw : I did see a crash (likely a TRT resource exhaustion according to internet ) on my newly retrained model without all the nodes on TRT (with onnxrt 1.17.1) Whereas with the DDS ops all on TRT , I have no problem. Note that on Cuda EP onnxrt 1.17.1 has no problem on my newly trained model. I guess TRT (or onnxrt TRT EP ?) do have a problem with memory when not all the nodes fit to GPU. for info here was the error : tensorrt Could not find any implementation for node {ForeignNode[/model/my_model/roi_heads/Greater]}

chilo-ms commented 5 months ago

you said you cannot make this feature to the official release.

we are still investigating the memory increase issue when using DDS ops support in some cases, so we can't make it to the official release yet.

But still, if you really want this feature, you can build ORT manually and ship it.

ORT release package: TRT EP dynamically links against TRT built-in parser. The built-in parser is released with other TRT rumtime libraries but does not support DDS ops.

Manually building ORT release branch with DDS ops support: TRT EP statically links against TRT oss onnx-tensorrt parser. The oss parser is same as built-in parser but with DDS ops support.

jcdatin commented 4 months ago

is this fix now integrated in ORT 1.18 ? Else when will this be available . (not clean to use unofficial version in the main branch)

jcdatin commented 3 months ago

I am building 1.18 : I will see

jcdatin commented 3 months ago

I guess if this issue is not closed , then it is not integrated ? or shall I close it myself ?

chilo-ms commented 3 months ago

@jcdatin Let me give you update here.

ORT 1.17 and 1.18 release don't natively support DDS ops on TRT. The way to enable DDS support with TRT is to build ORT with TRT OSS parser like you did for ORT 1.17.

Unfortunately, for ORT 1.18, firstly, you need to manually modify the deps.txt where you can see it now points to 10.0 GA branch of onnx-tensorrt. Modify the deps.txt to make it point to main branch of onnx-tensorrt, so that it can support DDS ops on TRT.

- onnx_tensorrt;https://github.com/onnx/onnx-tensorrt/archive/06adf4461ac84035bee658c6cf5df39f7ab6071d.zip;46dceef659d75d276e7914a8057c2282269d5e7b
+ onnx_tensorrt;https://github.com/onnx/onnx-tensorrt/archive/7ecb49a435bd881b9ac4011450315192885e5cc3.zip;fa0b13a3b5420d36c3f39f0d050bcf91d1ad0063 
chilo-ms commented 3 months ago

As for memory consumption of using TRT DDS ops support, we later didn't see it consumes significant memory. But still if you can help provide the comparison of memory consumption between DDS nodes placed on TRT and DDS nodes not placed on TRT, that will be great!

Also, for TRT 10, we found an issue when running Faster-RCNN model from ONNX model with DDS nodes placed on TRT. Nvidia is aware of this and is fixing it now.

Lastly, we will be discussing whether to enable the TRT DDS support in ORT release. So we really appreciate your feedback here.