I wanna use Pinned memory instead of pageable to reduce the overhead. How is this done?
According to the report do you have any other optimization recommendations?
Processing [report3.sqlite] with [/opt/nvidia/nsight-systems/2024.4.1/host-linux-x64/rules/cuda_memcpy_async.py]...
** CUDA Async Memcpy with Pageable Memory (cuda_memcpy_async):
The following APIs use PAGEABLE memory which causes asynchronous CUDA memcpy
operations to block and be executed synchronously. This leads to low GPU
utilization.
Suggestion: If applicable, use PINNED memory instead.
Duration (ns) Start (ns) Src Kind Dst Kind Bytes (MB) PID Device ID Context ID Green Context ID Stream ID API Name
------------- ----------- -------- -------- ---------- ------- --------- ---------- ---------------- --------- ---------------------
139392 10747229588 Pageable Device 1.573 3319892 3 1 65 cudaMemcpyAsync_v3020
70432 10170086276 Pageable Device 0.786 3319892 3 1 65 cudaMemcpyAsync_v3020
68896 10590939730 Pageable Device 0.786 3319892 3 1 65 cudaMemcpyAsync_v3020
68736 10431056837 Pageable Device 0.786 3319892 3 1 65 cudaMemcpyAsync_v3020
68064 10301303104 Pageable Device 0.786 3319892 3 1 65 cudaMemcpyAsync_v3020
2784 10191130019 Device Pageable 0.000 3319892 3 1 65 cudaMemcpyAsync_v3020
2112 10766534718 Device Pageable 0.000 3319892 3 1 65 cudaMemcpyAsync_v3020
2048 10599240255 Device Pageable 0.000 3319892 3 1 65 cudaMemcpyAsync_v3020
1952 10310477096 Device Pageable 0.000 3319892 3 1 65 cudaMemcpyAsync_v3020
1952 10445651628 Device Pageable 0.000 3319892 3 1 65 cudaMemcpyAsync_v3020
Processing [report3.sqlite] with [/opt/nvidia/nsight-systems/2024.4.1/host-linux-x64/rules/cuda_memcpy_sync.py]...
** CUDA Synchronous Memcpy (cuda_memcpy_sync):
The following are synchronous memory transfers that block the host. This does
not include host to device transfers of a memory block of 64 KB or less.
Suggestion: Use cudaMemcpy*Async() APIs instead.
Duration (ns) Start (ns) Src Kind Dst Kind Bytes (MB) PID Device ID Context ID Green Context ID Stream ID API Name
------------- ----------- -------- -------- ---------- ------- --------- ---------- ---------------- --------- ----------------
3624394 10150740955 Device Pageable 11.016 3319892 3 1 7 cudaMemcpy_v3020
2427601 8943637529 Pageable Device 19.661 3319892 3 1 7 cudaMemcpy_v3020
2363281 10681843173 Pageable Device 19.661 3319892 3 1 7 cudaMemcpy_v3020
2254802 10541563808 Pageable Device 19.661 3319892 3 1 7 cudaMemcpy_v3020
2054196 10389230725 Pageable Device 19.661 3319892 3 1 7 cudaMemcpy_v3020
1807221 8653801127 Pageable Device 15.925 3319892 3 1 7 cudaMemcpy_v3020
1791541 10261128630 Pageable Device 19.661 3319892 3 1 7 cudaMemcpy_v3020
1665270 10736044697 Device Pageable 11.016 3319892 3 1 7 cudaMemcpy_v3020
1339224 8673840652 Pageable Device 11.944 3319892 3 1 7 cudaMemcpy_v3020
1311991 10585455572 Device Pageable 11.016 3319892 3 1 7 cudaMemcpy_v3020
1225209 10293971597 Device Pageable 11.016 3319892 3 1 7 cudaMemcpy_v3020
1043418 8810549223 Pageable Device 9.437 3319892 3 1 7 cudaMemcpy_v3020
1040730 8822024001 Pageable Device 9.437 3319892 3 1 7 cudaMemcpy_v3020
1034298 8831456039 Pageable Device 9.437 3319892 3 1 7 cudaMemcpy_v3020
1034138 10424051664 Device Pageable 11.016 3319892 3 1 7 cudaMemcpy_v3020
932826 8813562453 Pageable Device 8.389 3319892 3 1 7 cudaMemcpy_v3020
866586 8664962499 Pageable Device 7.963 3319892 3 1 7 cudaMemcpy_v3020
576253 8640188058 Pageable Device 5.308 3319892 3 1 7 cudaMemcpy_v3020
575676 8667852529 Pageable Device 5.308 3319892 3 1 7 cudaMemcpy_v3020
574396 8686669150 Pageable Device 5.308 3319892 3 1 7 cudaMemcpy_v3020
573532 8635239129 Pageable Device 5.308 3319892 3 1 7 cudaMemcpy_v3020
573181 8670750399 Pageable Device 5.308 3319892 3 1 7 cudaMemcpy_v3020
492893 8679199148 Pageable Device 4.719 3319892 3 1 7 cudaMemcpy_v3020
450109 8819611856 Pageable Device 4.194 3319892 3 1 7 cudaMemcpy_v3020
449853 8833241757 Pageable Device 4.194 3319892 3 1 7 cudaMemcpy_v3020
445885 8826315495 Pageable Device 4.194 3319892 3 1 7 cudaMemcpy_v3020
442398 8834404309 Pageable Device 4.194 3319892 3 1 7 cudaMemcpy_v3020
441917 8803559090 Pageable Device 4.194 3319892 3 1 7 cudaMemcpy_v3020
316958 8630075256 Pageable Device 2.986 3319892 3 1 7 cudaMemcpy_v3020
299358 8658523786 Pageable Device 2.986 3319892 3 1 7 cudaMemcpy_v3020
299070 8661013819 Pageable Device 2.986 3319892 3 1 7 cudaMemcpy_v3020
296990 8684124269 Pageable Device 2.986 3319892 3 1 7 cudaMemcpy_v3020
294910 8677164344 Pageable Device 2.986 3319892 3 1 7 cudaMemcpy_v3020
292254 8690762949 Pageable Device 2.986 3319892 3 1 7 cudaMemcpy_v3020
271166 8628805152 Pageable Device 2.654 3319892 3 1 7 cudaMemcpy_v3020
236766 8814933229 Pageable Device 2.359 3319892 3 1 7 cudaMemcpy_v3020
235167 8816324004 Pageable Device 2.359 3319892 3 1 7 cudaMemcpy_v3020
234590 8818658518 Pageable Device 2.359 3319892 3 1 7 cudaMemcpy_v3020
234302 8642206478 Pageable Device 2.359 3319892 3 1 7 cudaMemcpy_v3020
233503 8825265101 Pageable Device 2.359 3319892 3 1 7 cudaMemcpy_v3020
233310 8829448756 Pageable Device 2.359 3319892 3 1 7 cudaMemcpy_v3020
232510 8688164085 Pageable Device 2.359 3319892 3 1 7 cudaMemcpy_v3020
231359 8835947788 Pageable Device 2.359 3319892 3 1 7 cudaMemcpy_v3020
208031 8828207867 Pageable Device 2.097 3319892 3 1 7 cudaMemcpy_v3020
205663 8824605361 Pageable Device 2.097 3319892 3 1 7 cudaMemcpy_v3020
156320 8659611459 Pageable Device 1.769 3319892 3 1 7 cudaMemcpy_v3020
155839 8680366404 Pageable Device 1.769 3319892 3 1 7 cudaMemcpy_v3020
153631 8622599270 Pageable Device 1.769 3319892 3 1 7 cudaMemcpy_v3020
118336 8626308751 Pageable Device 1.327 3319892 3 1 7 cudaMemcpy_v3020
118207 8689864170 Pageable Device 1.327 3319892 3 1 7 cudaMemcpy_v3020
Only the top 50 results are displayed. More data may be available.
Processing [report3.sqlite] with [/opt/nvidia/nsight-systems/2024.4.1/host-linux-x64/rules/cuda_memset_sync.py]...
** CUDA Synchronous Memset (cuda_memset_sync):
There were no problems detected related to synchronization APIs.
Processing [report3.sqlite] with [/opt/nvidia/nsight-systems/2024.4.1/host-linux-x64/rules/cuda_api_sync.py]...
** CUDA Synchronization APIs (cuda_api_sync):
The following are synchronization APIs that block the host until all issued
CUDA calls are complete.
Suggestions:
1. Avoid excessive use of synchronization.
2. Use asynchronous CUDA event calls, such as cudaStreamWaitEvent() and
cudaEventSynchronize(), to prevent host synchronization.
Duration (ns) Start (ns) PID TID API Name
------------- ----------- ------- ------- ---------------------------
716152 10149921840 3319892 3319892 cudaStreamSynchronize_v3020
97341 8820586915 3319892 3319892 cudaStreamSynchronize_v3020
96950 8817616389 3319892 3319892 cudaStreamSynchronize_v3020
96910 8815553951 3319892 3319892 cudaStreamSynchronize_v3020
96859 8816843971 3319892 3319892 cudaStreamSynchronize_v3020
96667 8827067807 3319892 3319892 cudaStreamSynchronize_v3020
96486 8804318562 3319892 3319892 cudaStreamSynchronize_v3020
96332 8828899699 3319892 3319892 cudaStreamSynchronize_v3020
96120 8817215611 3319892 3319892 cudaStreamSynchronize_v3020
95916 8835356784 3319892 3319892 cudaStreamSynchronize_v3020
95511 10423859074 3319892 3319892 cudaStreamSynchronize_v3020
95358 8680837669 3319892 3319892 cudaStreamSynchronize_v3020
95335 8830008957 3319892 3319892 cudaStreamSynchronize_v3020
95059 8834756823 3319892 3319892 cudaStreamSynchronize_v3020
95033 10262832301 3319892 3319892 cudaStreamSynchronize_v3020
94942 8826671643 3319892 3319892 cudaStreamSynchronize_v3020
94803 8828326309 3319892 3319892 cudaStreamSynchronize_v3020
94693 8833602053 3319892 3319892 cudaStreamSynchronize_v3020
94562 8822975316 3319892 3319892 cudaStreamSynchronize_v3020
94423 8819972705 3319892 3319892 cudaStreamSynchronize_v3020
94147 8803912394 3319892 3319892 cudaStreamSynchronize_v3020
94050 8823837674 3319892 3319892 cudaStreamSynchronize_v3020
94016 8824722220 3319892 3319892 cudaStreamSynchronize_v3020
93983 8832401494 3319892 3319892 cudaStreamSynchronize_v3020
93941 8811504167 3319892 3319892 cudaStreamSynchronize_v3020
93547 8814407163 3319892 3319892 cudaStreamSynchronize_v3020
92413 8635725240 3319892 3319892 cudaStreamSynchronize_v3020
91310 8671237171 3319892 3319892 cudaStreamSynchronize_v3020
91179 8668341929 3319892 3319892 cudaStreamSynchronize_v3020
90694 8669198121 3319892 3319892 cudaStreamSynchronize_v3020
90596 8687157903 3319892 3319892 cudaStreamSynchronize_v3020
90373 10293808926 3319892 3319892 cudaStreamSynchronize_v3020
89683 8640679794 3319892 3319892 cudaStreamSynchronize_v3020
87970 8689899373 3319892 3319892 cudaStreamSynchronize_v3020
87033 8672031522 3319892 3319892 cudaStreamSynchronize_v3020
86955 8625256022 3319892 3319892 cudaStreamSynchronize_v3020
86839 8685520647 3319892 3319892 cudaStreamSynchronize_v3020
86726 8669632750 3319892 3319892 cudaStreamSynchronize_v3020
86667 8641657526 3319892 3319892 cudaStreamSynchronize_v3020
86633 8818812120 3319892 3319892 cudaStreamSynchronize_v3020
86492 8684960530 3319892 3319892 cudaStreamSynchronize_v3020
86224 8661860266 3319892 3319892 cudaStreamSynchronize_v3020
86083 8643460858 3319892 3319892 cudaStreamSynchronize_v3020
86001 8676570903 3319892 3319892 cudaStreamSynchronize_v3020
85726 8636526451 3319892 3319892 cudaStreamSynchronize_v3020
85636 8829601418 3319892 3319892 cudaStreamSynchronize_v3020
85343 8816478835 3319892 3319892 cudaStreamSynchronize_v3020
85232 8655528119 3319892 3319892 cudaStreamSynchronize_v3020
85112 8643025339 3319892 3319892 cudaStreamSynchronize_v3020
84880 8688316637 3319892 3319892 cudaStreamSynchronize_v3020
Only the top 50 results are displayed. More data may be available.
Processing [report3.sqlite] with [/opt/nvidia/nsight-systems/2024.4.1/host-linux-x64/rules/gpu_gaps.py]...
** GPU Gaps (gpu_gaps):
The following are ranges where a GPU is idle for more than 500ms. Addressing
these gaps might improve application performance.
Suggestions:
1. Use CPU sampling data, OS Runtime blocked state backtraces, and/or OS
Runtime APIs related to thread synchronization to understand if a sluggish or
blocked CPU is causing the gaps.
2. Add NVTX annotations to CPU code to understand the reason behind the gaps.
Row# Duration (ns) Start (ns) PID Device ID Context ID
---- ------------- ---------- ------- --------- ----------
1 1141573011 8951073195 3319892 3 1
Processing [report3.sqlite] with [/opt/nvidia/nsight-systems/2024.4.1/host-linux-x64/rules/gpu_time_util.py]...
** GPU Time Utilization (gpu_time_util):
The following are time regions with an average GPU utilization below 50%%.
Addressing the gaps might improve application performance.
Suggestions:
1. Use CPU sampling data, OS Runtime blocked state backtraces, and/or OS
Runtime APIs related to thread synchronization to understand if a sluggish or
blocked CPU is causing the gaps.
2. Add NVTX annotations to CPU code to understand the reason behind the gaps.
Row# In-Use (%) Duration (ns) Start (ns) PID Device ID Context ID
---- ---------- ------------- ---------- ------- --------- ----------
1 9.3 2192698573 8573838257 3319892 3 1
Processing [report3.sqlite] with [/opt/nvidia/nsight-systems/2024.4.1/host-linux-x64/rules/dx12_mem_ops.py]...
SKIPPED: report3.sqlite could not be analyzed because it does not contain the required DX12 data. Does the application use DX12 APIs?
To reproduce
I have a 2 stage model that first localize and then classifies the crops.
providers = [
(
"CUDAExecutionProvider",
{
"device_id": 3,
"arena_extend_strategy": "kNextPowerOfTwo",
"gpu_mem_limit": 2 1024 1024 * 1024,
"cudnn_conv_algo_search": "DEFAULT",
"do_copy_in_default_stream": True,
},
),
"CPUExecutionProvider"
]
This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.
Describe the issue
I wanna use Pinned memory instead of pageable to reduce the overhead. How is this done? According to the report do you have any other optimization recommendations?
To reproduce
I have a 2 stage model that first localize and then classifies the crops. providers = [ ( "CUDAExecutionProvider", { "device_id": 3, "arena_extend_strategy": "kNextPowerOfTwo", "gpu_mem_limit": 2 1024 1024 * 1024, "cudnn_conv_algo_search": "DEFAULT", "do_copy_in_default_stream": True, }, ), "CPUExecutionProvider" ]
Urgency
No response
Platform
Linux
OS Version
20.04
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.18
ONNX Runtime API
Python
Architecture
X64
Execution Provider
CUDA
Execution Provider Library Version
Cuda 12.04
Model File
No response
Is this a quantized model?
No