[Performance] How to used pinned memory in onnxruntime.

Describe the issue

I wanna use Pinned memory instead of pageable to reduce the overhead. How is this done? According to the report do you have any other optimization recommendations?
Processing [report3.sqlite] with [/opt/nvidia/nsight-systems/2024.4.1/host-linux-x64/rules/cuda_memcpy_async.py]... 

 ** CUDA Async Memcpy with Pageable Memory (cuda_memcpy_async):

The following APIs use PAGEABLE memory which causes asynchronous CUDA memcpy
operations to block and be executed synchronously. This leads to low GPU
utilization.

Suggestion: If applicable, use PINNED memory instead.

 Duration (ns)  Start (ns)   Src Kind  Dst Kind  Bytes (MB)    PID    Device ID  Context ID  Green Context ID  Stream ID        API Name       
 -------------  -----------  --------  --------  ----------  -------  ---------  ----------  ----------------  ---------  ---------------------
        139392  10747229588  Pageable  Device         1.573  3319892          3           1                           65  cudaMemcpyAsync_v3020
         70432  10170086276  Pageable  Device         0.786  3319892          3           1                           65  cudaMemcpyAsync_v3020
         68896  10590939730  Pageable  Device         0.786  3319892          3           1                           65  cudaMemcpyAsync_v3020
         68736  10431056837  Pageable  Device         0.786  3319892          3           1                           65  cudaMemcpyAsync_v3020
         68064  10301303104  Pageable  Device         0.786  3319892          3           1                           65  cudaMemcpyAsync_v3020
          2784  10191130019  Device    Pageable       0.000  3319892          3           1                           65  cudaMemcpyAsync_v3020
          2112  10766534718  Device    Pageable       0.000  3319892          3           1                           65  cudaMemcpyAsync_v3020
          2048  10599240255  Device    Pageable       0.000  3319892          3           1                           65  cudaMemcpyAsync_v3020
          1952  10310477096  Device    Pageable       0.000  3319892          3           1                           65  cudaMemcpyAsync_v3020
          1952  10445651628  Device    Pageable       0.000  3319892          3           1                           65  cudaMemcpyAsync_v3020

Processing [report3.sqlite] with [/opt/nvidia/nsight-systems/2024.4.1/host-linux-x64/rules/cuda_memcpy_sync.py]... 

 ** CUDA Synchronous Memcpy (cuda_memcpy_sync):

The following are synchronous memory transfers that block the host. This does
not include host to device transfers of a memory block of 64 KB or less.

Suggestion: Use cudaMemcpy*Async() APIs instead.

 Duration (ns)  Start (ns)   Src Kind  Dst Kind  Bytes (MB)    PID    Device ID  Context ID  Green Context ID  Stream ID      API Name    
 -------------  -----------  --------  --------  ----------  -------  ---------  ----------  ----------------  ---------  ----------------
       3624394  10150740955  Device    Pageable      11.016  3319892          3           1                            7  cudaMemcpy_v3020
       2427601   8943637529  Pageable  Device        19.661  3319892          3           1                            7  cudaMemcpy_v3020
       2363281  10681843173  Pageable  Device        19.661  3319892          3           1                            7  cudaMemcpy_v3020
       2254802  10541563808  Pageable  Device        19.661  3319892          3           1                            7  cudaMemcpy_v3020
       2054196  10389230725  Pageable  Device        19.661  3319892          3           1                            7  cudaMemcpy_v3020
       1807221   8653801127  Pageable  Device        15.925  3319892          3           1                            7  cudaMemcpy_v3020
       1791541  10261128630  Pageable  Device        19.661  3319892          3           1                            7  cudaMemcpy_v3020
       1665270  10736044697  Device    Pageable      11.016  3319892          3           1                            7  cudaMemcpy_v3020
       1339224   8673840652  Pageable  Device        11.944  3319892          3           1                            7  cudaMemcpy_v3020
       1311991  10585455572  Device    Pageable      11.016  3319892          3           1                            7  cudaMemcpy_v3020
       1225209  10293971597  Device    Pageable      11.016  3319892          3           1                            7  cudaMemcpy_v3020
       1043418   8810549223  Pageable  Device         9.437  3319892          3           1                            7  cudaMemcpy_v3020
       1040730   8822024001  Pageable  Device         9.437  3319892          3           1                            7  cudaMemcpy_v3020
       1034298   8831456039  Pageable  Device         9.437  3319892          3           1                            7  cudaMemcpy_v3020
       1034138  10424051664  Device    Pageable      11.016  3319892          3           1                            7  cudaMemcpy_v3020
        932826   8813562453  Pageable  Device         8.389  3319892          3           1                            7  cudaMemcpy_v3020
        866586   8664962499  Pageable  Device         7.963  3319892          3           1                            7  cudaMemcpy_v3020
        576253   8640188058  Pageable  Device         5.308  3319892          3           1                            7  cudaMemcpy_v3020
        575676   8667852529  Pageable  Device         5.308  3319892          3           1                            7  cudaMemcpy_v3020
        574396   8686669150  Pageable  Device         5.308  3319892          3           1                            7  cudaMemcpy_v3020
        573532   8635239129  Pageable  Device         5.308  3319892          3           1                            7  cudaMemcpy_v3020
        573181   8670750399  Pageable  Device         5.308  3319892          3           1                            7  cudaMemcpy_v3020
        492893   8679199148  Pageable  Device         4.719  3319892          3           1                            7  cudaMemcpy_v3020
        450109   8819611856  Pageable  Device         4.194  3319892          3           1                            7  cudaMemcpy_v3020
        449853   8833241757  Pageable  Device         4.194  3319892          3           1                            7  cudaMemcpy_v3020
        445885   8826315495  Pageable  Device         4.194  3319892          3           1                            7  cudaMemcpy_v3020
        442398   8834404309  Pageable  Device         4.194  3319892          3           1                            7  cudaMemcpy_v3020
        441917   8803559090  Pageable  Device         4.194  3319892          3           1                            7  cudaMemcpy_v3020
        316958   8630075256  Pageable  Device         2.986  3319892          3           1                            7  cudaMemcpy_v3020
        299358   8658523786  Pageable  Device         2.986  3319892          3           1                            7  cudaMemcpy_v3020
        299070   8661013819  Pageable  Device         2.986  3319892          3           1                            7  cudaMemcpy_v3020
        296990   8684124269  Pageable  Device         2.986  3319892          3           1                            7  cudaMemcpy_v3020
        294910   8677164344  Pageable  Device         2.986  3319892          3           1                            7  cudaMemcpy_v3020
        292254   8690762949  Pageable  Device         2.986  3319892          3           1                            7  cudaMemcpy_v3020
        271166   8628805152  Pageable  Device         2.654  3319892          3           1                            7  cudaMemcpy_v3020
        236766   8814933229  Pageable  Device         2.359  3319892          3           1                            7  cudaMemcpy_v3020
        235167   8816324004  Pageable  Device         2.359  3319892          3           1                            7  cudaMemcpy_v3020
        234590   8818658518  Pageable  Device         2.359  3319892          3           1                            7  cudaMemcpy_v3020
        234302   8642206478  Pageable  Device         2.359  3319892          3           1                            7  cudaMemcpy_v3020
        233503   8825265101  Pageable  Device         2.359  3319892          3           1                            7  cudaMemcpy_v3020
        233310   8829448756  Pageable  Device         2.359  3319892          3           1                            7  cudaMemcpy_v3020
        232510   8688164085  Pageable  Device         2.359  3319892          3           1                            7  cudaMemcpy_v3020
        231359   8835947788  Pageable  Device         2.359  3319892          3           1                            7  cudaMemcpy_v3020
        208031   8828207867  Pageable  Device         2.097  3319892          3           1                            7  cudaMemcpy_v3020
        205663   8824605361  Pageable  Device         2.097  3319892          3           1                            7  cudaMemcpy_v3020
        156320   8659611459  Pageable  Device         1.769  3319892          3           1                            7  cudaMemcpy_v3020
        155839   8680366404  Pageable  Device         1.769  3319892          3           1                            7  cudaMemcpy_v3020
        153631   8622599270  Pageable  Device         1.769  3319892          3           1                            7  cudaMemcpy_v3020
        118336   8626308751  Pageable  Device         1.327  3319892          3           1                            7  cudaMemcpy_v3020
        118207   8689864170  Pageable  Device         1.327  3319892          3           1                            7  cudaMemcpy_v3020

Only the top 50 results are displayed. More data may be available.

Processing [report3.sqlite] with [/opt/nvidia/nsight-systems/2024.4.1/host-linux-x64/rules/cuda_memset_sync.py]... 

 ** CUDA Synchronous Memset (cuda_memset_sync):

There were no problems detected related to synchronization APIs.

Processing [report3.sqlite] with [/opt/nvidia/nsight-systems/2024.4.1/host-linux-x64/rules/cuda_api_sync.py]... 

 ** CUDA Synchronization APIs (cuda_api_sync):

The following are synchronization APIs that block the host until all issued
CUDA calls are complete.

Suggestions:
   1. Avoid excessive use of synchronization.
   2. Use asynchronous CUDA event calls, such as cudaStreamWaitEvent() and
cudaEventSynchronize(), to prevent host synchronization.

 Duration (ns)  Start (ns)     PID      TID             API Name          
 -------------  -----------  -------  -------  ---------------------------
        716152  10149921840  3319892  3319892  cudaStreamSynchronize_v3020
         97341   8820586915  3319892  3319892  cudaStreamSynchronize_v3020
         96950   8817616389  3319892  3319892  cudaStreamSynchronize_v3020
         96910   8815553951  3319892  3319892  cudaStreamSynchronize_v3020
         96859   8816843971  3319892  3319892  cudaStreamSynchronize_v3020
         96667   8827067807  3319892  3319892  cudaStreamSynchronize_v3020
         96486   8804318562  3319892  3319892  cudaStreamSynchronize_v3020
         96332   8828899699  3319892  3319892  cudaStreamSynchronize_v3020
         96120   8817215611  3319892  3319892  cudaStreamSynchronize_v3020
         95916   8835356784  3319892  3319892  cudaStreamSynchronize_v3020
         95511  10423859074  3319892  3319892  cudaStreamSynchronize_v3020
         95358   8680837669  3319892  3319892  cudaStreamSynchronize_v3020
         95335   8830008957  3319892  3319892  cudaStreamSynchronize_v3020
         95059   8834756823  3319892  3319892  cudaStreamSynchronize_v3020
         95033  10262832301  3319892  3319892  cudaStreamSynchronize_v3020
         94942   8826671643  3319892  3319892  cudaStreamSynchronize_v3020
         94803   8828326309  3319892  3319892  cudaStreamSynchronize_v3020
         94693   8833602053  3319892  3319892  cudaStreamSynchronize_v3020
         94562   8822975316  3319892  3319892  cudaStreamSynchronize_v3020
         94423   8819972705  3319892  3319892  cudaStreamSynchronize_v3020
         94147   8803912394  3319892  3319892  cudaStreamSynchronize_v3020
         94050   8823837674  3319892  3319892  cudaStreamSynchronize_v3020
         94016   8824722220  3319892  3319892  cudaStreamSynchronize_v3020
         93983   8832401494  3319892  3319892  cudaStreamSynchronize_v3020
         93941   8811504167  3319892  3319892  cudaStreamSynchronize_v3020
         93547   8814407163  3319892  3319892  cudaStreamSynchronize_v3020
         92413   8635725240  3319892  3319892  cudaStreamSynchronize_v3020
         91310   8671237171  3319892  3319892  cudaStreamSynchronize_v3020
         91179   8668341929  3319892  3319892  cudaStreamSynchronize_v3020
         90694   8669198121  3319892  3319892  cudaStreamSynchronize_v3020
         90596   8687157903  3319892  3319892  cudaStreamSynchronize_v3020
         90373  10293808926  3319892  3319892  cudaStreamSynchronize_v3020
         89683   8640679794  3319892  3319892  cudaStreamSynchronize_v3020
         87970   8689899373  3319892  3319892  cudaStreamSynchronize_v3020
         87033   8672031522  3319892  3319892  cudaStreamSynchronize_v3020
         86955   8625256022  3319892  3319892  cudaStreamSynchronize_v3020
         86839   8685520647  3319892  3319892  cudaStreamSynchronize_v3020
         86726   8669632750  3319892  3319892  cudaStreamSynchronize_v3020
         86667   8641657526  3319892  3319892  cudaStreamSynchronize_v3020
         86633   8818812120  3319892  3319892  cudaStreamSynchronize_v3020
         86492   8684960530  3319892  3319892  cudaStreamSynchronize_v3020
         86224   8661860266  3319892  3319892  cudaStreamSynchronize_v3020
         86083   8643460858  3319892  3319892  cudaStreamSynchronize_v3020
         86001   8676570903  3319892  3319892  cudaStreamSynchronize_v3020
         85726   8636526451  3319892  3319892  cudaStreamSynchronize_v3020
         85636   8829601418  3319892  3319892  cudaStreamSynchronize_v3020
         85343   8816478835  3319892  3319892  cudaStreamSynchronize_v3020
         85232   8655528119  3319892  3319892  cudaStreamSynchronize_v3020
         85112   8643025339  3319892  3319892  cudaStreamSynchronize_v3020
         84880   8688316637  3319892  3319892  cudaStreamSynchronize_v3020

Only the top 50 results are displayed. More data may be available.

Processing [report3.sqlite] with [/opt/nvidia/nsight-systems/2024.4.1/host-linux-x64/rules/gpu_gaps.py]... 

 ** GPU Gaps (gpu_gaps):

The following are ranges where a GPU is idle for more than 500ms. Addressing
these gaps might improve application performance.

Suggestions:
   1. Use CPU sampling data, OS Runtime blocked state backtraces, and/or OS
Runtime APIs related to thread synchronization to understand if a sluggish or
blocked CPU is causing the gaps.
   2. Add NVTX annotations to CPU code to understand the reason behind the gaps.

 Row#  Duration (ns)  Start (ns)    PID    Device ID  Context ID
 ----  -------------  ----------  -------  ---------  ----------
    1     1141573011  8951073195  3319892          3           1

Processing [report3.sqlite] with [/opt/nvidia/nsight-systems/2024.4.1/host-linux-x64/rules/gpu_time_util.py]... 

 ** GPU Time Utilization (gpu_time_util):

The following are time regions with an average GPU utilization below 50%%.
Addressing the gaps might improve application performance.

Suggestions:
   1. Use CPU sampling data, OS Runtime blocked state backtraces, and/or OS
Runtime APIs related to thread synchronization to understand if a sluggish or
blocked CPU is causing the gaps.
   2. Add NVTX annotations to CPU code to understand the reason behind the gaps.

 Row#  In-Use (%)  Duration (ns)  Start (ns)    PID    Device ID  Context ID
 ----  ----------  -------------  ----------  -------  ---------  ----------
    1         9.3     2192698573  8573838257  3319892          3           1

Processing [report3.sqlite] with [/opt/nvidia/nsight-systems/2024.4.1/host-linux-x64/rules/dx12_mem_ops.py]... 
SKIPPED: report3.sqlite could not be analyzed because it does not contain the required DX12 data. Does the application use DX12 APIs?
To reproduce

I have a 2 stage model that first localize and then classifies the crops. providers = [ ( "CUDAExecutionProvider", { "device_id": 3, "arena_extend_strategy": "kNextPowerOfTwo", "gpu_mem_limit": 2 1024 1024 * 1024, "cudnn_conv_algo_search": "DEFAULT", "do_copy_in_default_stream": True, }, ), "CPUExecutionProvider" ]
Urgency

No response
Platform

Linux
OS Version

20.04
ONNX Runtime Installation

Released Package
ONNX Runtime Version or Commit ID

1.18
ONNX Runtime API

Python
Architecture

X64
Execution Provider

CUDA
Execution Provider Library Version

Cuda 12.04
Model File

No response
microsoft / onnxruntime