microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.14k stars 2.85k forks source link

[Question or BUG] ONNX Runtime CUDA Sessions in Unity Produce Empty Outputs When Running Multiple Models Sequentially on a Single Graphic Card #22146

Open abysslover opened 1 day ago

abysslover commented 1 day ago

Describe the issue

I encountered an issue with ONNX Runtime when running CUDA sessions in Unity. In Python, I am able to create three(mutiple) CUDA sessions for my models on a single graphic card and run them sequentially for inference without any issues. The GPU is utilized correctly, and each model returns the expected predictions.

However, when attempting to replicate this setup in Unity:

  1. If I create only one CUDA session, the inference runs correctly, and the output is as expected.
  2. If I create two CUDA sessions and run the models sequentially, the inference runs without errors, but the output values are empty. The same models work perfectly in Python with multiple CUDA sessions, but in Unity, only the first CUDA session seems to work as intended. Additional context:

    • GPU Model: nVidia A6000
    • GPU Memory: 48GB
    • Unity Version: 2023.2.20f1

To reproduce

  1. Create CUDA sessions for two models in Unity using ONNX Runtime.
  2. Load the models into the CUDA sessions.
  3. Run the models sequentially for inference.
  4. Observe that while the output values are produced, they are empty.

Python code equivalent (working):

See: https://github.com/PINTO0309/facemesh_onnx_tensorrt/blob/main/demo_video.py

Unity code (not working as expected):

if (!OrtEnv.IsCreated)
{
    var envOptions = new EnvironmentCreationOptions
    {
        logId = "FaceDetect",
        logLevel = OrtLoggingLevel.ORT_LOGGING_LEVEL_VERBOSE,
        loggingFunction = MyCustomLoggingFunction,
        threadOptions = null,
    };
    OrtEnv.CreateInstanceWithOptions(ref envOptions);
}

OrtCUDAProviderOptions cudaOptionFaceDetect = new OrtCUDAProviderOptions();
var providerOptionsDict = new Dictionary
{
    ["device_id"] = "0",
    ["gpu_mem_limit"] = "2147483648",
};

cudaOptionFaceDetect.UpdateOptions(providerOptionsDict);

//************************ Face Detection Model **********************
faceDetectSessionOptions = SessionOptions.MakeSessionOptionWithCudaProvider(cudaOptionFaceDetect);
faceDetectSessionOptions.GraphOptimizationLevel = GraphOptimizationLevel.ORT_DISABLE_ALL;
faceDetectSessionOptions.LogVerbosityLevel = 3;
faceDetectSessionOptions.LogSeverityLevel = OrtLoggingLevel.ORT_LOGGING_LEVEL_VERBOSE;

faceDetect = new FaceDetect(faceDetectModel.bytes, faceDetectOptions, faceDetectSessionOptions);

// ****************** Face Mesh Model ********************
OrtCUDAProviderOptions cudaOptionFaceMesh = new OrtCUDAProviderOptions();
cudaOptionFaceMesh.UpdateOptions(providerOptionsDict);
faceMeshSessionOptions = SessionOptions.MakeSessionOptionWithCudaProvider(cudaOptionFaceDetect);
faceMeshSessionOptions.GraphOptimizationLevel = GraphOptimizationLevel.ORT_DISABLE_ALL;
faceMeshSessionOptions.LogVerbosityLevel = 3;
faceMeshSessionOptions.LogSeverityLevel = OrtLoggingLevel.ORT_LOGGING_LEVEL_VERBOSE;

faceMesh = new FaceMesh(faceMeshModel.bytes, faceMeshOptions, faceMeshSessionOptions);

Explanation of Behavior Change When faceMesh is commented out: The code only initializes and runs the face detection model (faceDetect). In this case, the application will only perform face detection and not the more detailed face mesh analysis. Since only one model (face detection) is loaded, the ONNX Runtime is managing a single CUDA session, which might work without any issues.

When faceMesh is not commented out: Both the face detection model (faceDetect) and the face mesh model (faceMesh) are initialized. This creates two CUDA sessions using the same OrtCUDAProviderOptions. Initializing multiple sessions with the same CUDA provider settings may lead to conflicts in internal graph, resulting in empty outputs. This could explain why, when both models are used sequentially in Unity, the output values are empty.

Important logs:

[-ORT_LOGGING_LEVEL_INFO] onnxruntime (inference_session.cc:583 onnxruntime::InferenceSession::TraceSessionOptions): Session Options {  execution_mode:0 execution_order:DEFAULT enable_profiling:0 optimized_model_filepath: enable_mem_pattern:1 enable_mem_reuse:1 enable_cpu_mem_arena:1 profile_file_prefix:onnxruntime_profile_ session_logid: session_log_severity_level:0 session_log_verbosity_level:10 max_num_graph_transformation_steps:10 graph_optimization_level:0 intra_op_param:OrtThreadPoolParams { thread_pool_size: 0 auto_set_affinity: 0 allow_spinning: 1 dynamic_block_base_: 0 stack_size: 0 affinity_str:  set_denormal_as_zero: 0 } inter_op_param:OrtThreadPoolParams { thread_pool_size: 0 auto_set_affinity: 0 allow_spinning: 1 dynamic_block_base_: 0 stack_size: 0 affinity_str:  set_denormal_as_zero: 0 } use_per_session_threads:1 thread_pool_allow_spinning:1 use_deterministic_compute:0 config_options: {  } }
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (inference_session.cc:491 onnxruntime::InferenceSession::ConstructorCommon): Creating and using per session threadpools since use_per_session_threads_ is true
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (inference_session.cc:509 onnxruntime::InferenceSession::ConstructorCommon): Dynamic block base set to 0
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (inference_session.cc:1669 onnxruntime::InferenceSession::Initialize): Initializing session.
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (inference_session.cc:1706 onnxruntime::InferenceSession::Initialize): Adding default CPU execution provider.
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:29 onnxruntime::BFCArena::BFCArena): Creating BFCArena for Cuda with following configs: initial_chunk_size_bytes: 1048576 max_dead_bytes_per_chunk: 134217728 initial_growth_chunk_size_bytes: 2097152 max_power_of_two_extend_bytes: 1073741824 memory limit: 2147483648 arena_extend_strategy: 0
[FaceDetect-ORT_LOGGING_LEVEL_VERBOSE] onnxruntime (bfc_arena.cc:66 onnxruntime::BFCArena::BFCArena): Creating 21 bins of max chunk size 256 to 268435456
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:29 onnxruntime::BFCArena::BFCArena): Creating BFCArena for CudaPinned with following configs: initial_chunk_size_bytes: 1048576 max_dead_bytes_per_chunk: 134217728 initial_growth_chunk_size_bytes: 2097152 max_power_of_two_extend_bytes: 1073741824 memory limit: 18446744073709551615 arena_extend_strategy: 0
[FaceDetect-ORT_LOGGING_LEVEL_VERBOSE] onnxruntime (bfc_arena.cc:66 onnxruntime::BFCArena::BFCArena): Creating 21 bins of max chunk size 256 to 268435456
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:29 onnxruntime::BFCArena::BFCArena): Creating BFCArena for Cpu with following configs: initial_chunk_size_bytes: 1048576 max_dead_bytes_per_chunk: 134217728 initial_growth_chunk_size_bytes: 2097152 max_power_of_two_extend_bytes: 1073741824 memory limit: 18446744073709551615 arena_extend_strategy: 0
[FaceDetect-ORT_LOGGING_LEVEL_VERBOSE] onnxruntime (bfc_arena.cc:66 onnxruntime::BFCArena::BFCArena): Creating 21 bins of max chunk size 256 to 268435456
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (graph_partitioner.cc:898 onnxruntime::GraphPartitioner::InlineFunctionsAOT): This model does not have any local functions defined. AOT Inlining is not performed
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (graph_transformer.cc:15 onnxruntime::GraphTransformer::Apply): GraphTransformer EnsureUniqueDQForNodeUnit modified: 0 with status: OK
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (graph_transformer.cc:15 onnxruntime::GraphTransformer::Apply): GraphTransformer RemoveDuplicateCastTransformer modified: 0 with status: OK
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (graph_transformer.cc:15 onnxruntime::GraphTransformer::Apply): GraphTransformer CastFloat16Transformer modified: 0 with status: OK
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (graph_transformer.cc:15 onnxruntime::GraphTransformer::Apply): GraphTransformer MemcpyTransformer modified: 0 with status: OK
[-ORT_LOGGING_LEVEL_VERBOSE] onnxruntime (session_state.cc:1148 onnxruntime::VerifyEachNodeIsAssignedToAnEp): Node placements
[-ORT_LOGGING_LEVEL_VERBOSE] onnxruntime (session_state.cc:1151 onnxruntime::VerifyEachNodeIsAssignedToAnEp):  All nodes placed on [CUDAExecutionProvider]. Number of nodes: 94
[-ORT_LOGGING_LEVEL_VERBOSE] onnxruntime (session_state.cc:128 onnxruntime::SessionState::CreateGraphInfo): SaveMLValueNameIndexMapping
[-ORT_LOGGING_LEVEL_VERBOSE] onnxruntime (session_state.cc:174 onnxruntime::SessionState::CreateGraphInfo): Done saving OrtValue mappings.
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (cuda_execution_provider.cc:184 onnxruntime::CUDAExecutionProvider::PerThreadContext::PerThreadContext): cuDNN version: 90400
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (allocation_planner.cc:2567 onnxruntime::IGraphPartitioner::CreateGraphPartitioner): Use DeviceBasedPartition as default
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (session_state_utils.cc:276 onnxruntime::session_state_utils::SaveInitializedTensors): Saving initialized tensors.
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:347 onnxruntime::BFCArena::AllocateRawInternal): Extending BFCArena for Cuda. bin_num:0 (requested) num_bytes: 144 (actual) rounded_bytes:256
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:206 onnxruntime::BFCArena::Extend): Extended allocation by 1048576 bytes.
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:209 onnxruntime::BFCArena::Extend): Total allocated bytes: 1048576
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:212 onnxruntime::BFCArena::Extend): Allocated memory at 0000001410024400 to 0000001410124400
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:347 onnxruntime::BFCArena::AllocateRawInternal): Extending BFCArena for Cpu. bin_num:0 (requested) num_bytes: 144 (actual) rounded_bytes:256
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:206 onnxruntime::BFCArena::Extend): Extended allocation by 1048576 bytes.
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:209 onnxruntime::BFCArena::Extend): Total allocated bytes: 1048576
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:212 onnxruntime::BFCArena::Extend): Allocated memory at 0000026205E21080 to 0000026205F21080
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (session_state_utils.cc:427 onnxruntime::session_state_utils::SaveInitializedTensors): Done saving initialized tensors
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (inference_session.cc:2106 onnxruntime::InferenceSession::Initialize): Session successfully initialized.
Version: 1.20.0
Input:
[input_g1] shape: 1,3,128,128, type: System.Single isTensor: True

Output:
[classificators_g1] shape: 1,896,1, type: System.Single isTensor: True
[regressors_g1] shape: 1,896,16, type: System.Single isTensor: True

[ImageInference.AllocateTensors] Input: input_g1: shape: 1,3,128,128, type: System.Single isTensor: True
[ImageInference.AllocateTensors] Input: classificators_g1: shape: 1,896,1, type: System.Single isTensor: True
[ImageInference.AllocateTensors] Input: regressors_g1: shape: 1,896,16, type: System.Single isTensor: True
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (inference_session.cc:583 onnxruntime::InferenceSession::TraceSessionOptions): Session Options {  execution_mode:0 execution_order:DEFAULT enable_profiling:0 optimized_model_filepath: enable_mem_pattern:1 enable_mem_reuse:1 enable_cpu_mem_arena:1 profile_file_prefix:onnxruntime_profile_ session_logid: session_log_severity_level:0 session_log_verbosity_level:10 max_num_graph_transformation_steps:10 graph_optimization_level:0 intra_op_param:OrtThreadPoolParams { thread_pool_size: 0 auto_set_affinity: 0 allow_spinning: 1 dynamic_block_base_: 0 stack_size: 0 affinity_str:  set_denormal_as_zero: 0 } inter_op_param:OrtThreadPoolParams { thread_pool_size: 0 auto_set_affinity: 0 allow_spinning: 1 dynamic_block_base_: 0 stack_size: 0 affinity_str:  set_denormal_as_zero: 0 } use_per_session_threads:1 thread_pool_allow_spinning:1 use_deterministic_compute:0 config_options: {  } }
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (inference_session.cc:491 onnxruntime::InferenceSession::ConstructorCommon): Creating and using per session threadpools since use_per_session_threads_ is true
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (inference_session.cc:509 onnxruntime::InferenceSession::ConstructorCommon): Dynamic block base set to 0
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (inference_session.cc:1669 onnxruntime::InferenceSession::Initialize): Initializing session.
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (inference_session.cc:1706 onnxruntime::InferenceSession::Initialize): Adding default CPU execution provider.
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:29 onnxruntime::BFCArena::BFCArena): Creating BFCArena for Cuda with following configs: initial_chunk_size_bytes: 1048576 max_dead_bytes_per_chunk: 134217728 initial_growth_chunk_size_bytes: 2097152 max_power_of_two_extend_bytes: 1073741824 memory limit: 2147483648 arena_extend_strategy: 0
[FaceDetect-ORT_LOGGING_LEVEL_VERBOSE] onnxruntime (bfc_arena.cc:66 onnxruntime::BFCArena::BFCArena): Creating 21 bins of max chunk size 256 to 268435456
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:29 onnxruntime::BFCArena::BFCArena): Creating BFCArena for CudaPinned with following configs: initial_chunk_size_bytes: 1048576 max_dead_bytes_per_chunk: 134217728 initial_growth_chunk_size_bytes: 2097152 max_power_of_two_extend_bytes: 1073741824 memory limit: 18446744073709551615 arena_extend_strategy: 0
[FaceDetect-ORT_LOGGING_LEVEL_VERBOSE] onnxruntime (bfc_arena.cc:66 onnxruntime::BFCArena::BFCArena): Creating 21 bins of max chunk size 256 to 268435456
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:29 onnxruntime::BFCArena::BFCArena): Creating BFCArena for Cpu with following configs: initial_chunk_size_bytes: 1048576 max_dead_bytes_per_chunk: 134217728 initial_growth_chunk_size_bytes: 2097152 max_power_of_two_extend_bytes: 1073741824 memory limit: 18446744073709551615 arena_extend_strategy: 0
[FaceDetect-ORT_LOGGING_LEVEL_VERBOSE] onnxruntime (bfc_arena.cc:66 onnxruntime::BFCArena::BFCArena): Creating 21 bins of max chunk size 256 to 268435456
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (graph_partitioner.cc:898 onnxruntime::GraphPartitioner::InlineFunctionsAOT): This model does not have any local functions defined. AOT Inlining is not performed
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (graph_transformer.cc:15 onnxruntime::GraphTransformer::Apply): GraphTransformer EnsureUniqueDQForNodeUnit modified: 0 with status: OK
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (graph_transformer.cc:15 onnxruntime::GraphTransformer::Apply): GraphTransformer RemoveDuplicateCastTransformer modified: 0 with status: OK
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (graph_transformer.cc:15 onnxruntime::GraphTransformer::Apply): GraphTransformer CastFloat16Transformer modified: 0 with status: OK
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (graph_transformer.cc:15 onnxruntime::GraphTransformer::Apply): GraphTransformer MemcpyTransformer modified: 0 with status: OK
[-ORT_LOGGING_LEVEL_VERBOSE] onnxruntime (session_state.cc:1148 onnxruntime::VerifyEachNodeIsAssignedToAnEp): Node placements
[-ORT_LOGGING_LEVEL_VERBOSE] onnxruntime (session_state.cc:1151 onnxruntime::VerifyEachNodeIsAssignedToAnEp):  All nodes placed on [CUDAExecutionProvider]. Number of nodes: 498
[-ORT_LOGGING_LEVEL_VERBOSE] onnxruntime (session_state.cc:128 onnxruntime::SessionState::CreateGraphInfo): SaveMLValueNameIndexMapping
[-ORT_LOGGING_LEVEL_VERBOSE] onnxruntime (session_state.cc:174 onnxruntime::SessionState::CreateGraphInfo): Done saving OrtValue mappings.
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (cuda_execution_provider.cc:184 onnxruntime::CUDAExecutionProvider::PerThreadContext::PerThreadContext): cuDNN version: 90400
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (allocation_planner.cc:2567 onnxruntime::IGraphPartitioner::CreateGraphPartitioner): Use DeviceBasedPartition as default
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (session_state_utils.cc:276 onnxruntime::session_state_utils::SaveInitializedTensors): Saving initialized tensors.
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:347 onnxruntime::BFCArena::AllocateRawInternal): Extending BFCArena for Cuda. bin_num:1 (requested) num_bytes: 512 (actual) rounded_bytes:512
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:206 onnxruntime::BFCArena::Extend): Extended allocation by 1048576 bytes.
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:209 onnxruntime::BFCArena::Extend): Total allocated bytes: 1048576
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:212 onnxruntime::BFCArena::Extend): Allocated memory at 0000001411400000 to 0000001411500000
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:347 onnxruntime::BFCArena::AllocateRawInternal): Extending BFCArena for Cpu. bin_num:1 (requested) num_bytes: 512 (actual) rounded_bytes:512
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:206 onnxruntime::BFCArena::Extend): Extended allocation by 1048576 bytes.
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:209 onnxruntime::BFCArena::Extend): Total allocated bytes: 1048576
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:212 onnxruntime::BFCArena::Extend): Allocated memory at 000002620621E080 to 000002620631E080
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:347 onnxruntime::BFCArena::AllocateRawInternal): Extending BFCArena for Cuda. bin_num:13 (requested) num_bytes: 2936832 (actual) rounded_bytes:2936832
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:206 onnxruntime::BFCArena::Extend): Extended allocation by 4194304 bytes.
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:209 onnxruntime::BFCArena::Extend): Total allocated bytes: 5242880
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:212 onnxruntime::BFCArena::Extend): Allocated memory at 0000001411600000 to 0000001411A00000
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:347 onnxruntime::BFCArena::AllocateRawInternal): Extending BFCArena for Cpu. bin_num:13 (requested) num_bytes: 2936832 (actual) rounded_bytes:2936832
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:206 onnxruntime::BFCArena::Extend): Extended allocation by 4194304 bytes.
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:209 onnxruntime::BFCArena::Extend): Total allocated bytes: 5242880
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:212 onnxruntime::BFCArena::Extend): Allocated memory at 0000026206329080 to 0000026206729080
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:347 onnxruntime::BFCArena::AllocateRawInternal): Extending BFCArena for Cuda. bin_num:9 (requested) num_bytes: 131072 (actual) rounded_bytes:131072
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:206 onnxruntime::BFCArena::Extend): Extended allocation by 4194304 bytes.
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:209 onnxruntime::BFCArena::Extend): Total allocated bytes: 9437184
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:212 onnxruntime::BFCArena::Extend): Allocated memory at 0000001411A00000 to 0000001411E00000
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (session_state_utils.cc:427 onnxruntime::session_state_utils::SaveInitializedTensors): Done saving initialized tensors
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (inference_session.cc:2106 onnxruntime::InferenceSession::Initialize): Session successfully initialized.
Version: 1.20.0
Input:
[input_12_g2] shape: 1,3,256,256, type: System.Single isTensor: True

Output:
[Identity_g2] shape: 1,1,1,1434, type: System.Single isTensor: True
[Identity_1_g2] shape: 1,1,1,1, type: System.Single isTensor: True
[Identity_2_g2] shape: 1,1, type: System.Single isTensor: True

[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:347 onnxruntime::BFCArena::AllocateRawInternal): Extending BFCArena for Cuda. bin_num:11 (requested) num_bytes: 597040 (actual) rounded_bytes:597248
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:206 onnxruntime::BFCArena::Extend): Extended allocation by 2097152 bytes.
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:209 onnxruntime::BFCArena::Extend): Total allocated bytes: 3145728
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:212 onnxruntime::BFCArena::Extend): Allocated memory at 0000001412600000 to 0000001412800000
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:347 onnxruntime::BFCArena::AllocateRawInternal): Extending BFCArena for CudaPinned. bin_num:0 (requested) num_bytes: 16 (actual) rounded_bytes:256
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:206 onnxruntime::BFCArena::Extend): Extended allocation by 1048576 bytes.
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:209 onnxruntime::BFCArena::Extend): Total allocated bytes: 1048576
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:212 onnxruntime::BFCArena::Extend): Allocated memory at 0000000304C00600 to 0000000304D00600

Urgency

This issue is blocking a critical use case in our project. We need to run multiple models sequentially using CUDA sessions in Unity. Any delay in resolving this issue would impact our project timeline significantly.

Platform

Windows

OS Version

Windows 11 Pro

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

291a5352b27ded5714e5748b381f2efb88f28fb9

ONNX Runtime API

C#

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

CUDA 12.5.1, CUDNN 9.4, TensorRT 10.4.0.26

abysslover commented 21 hours ago

With only the Face Detect model(single CUDA session on a GPU device):

[FaceDetect.PostProcess] Output classificator[0]: 896, Output_regrssor: 14336, Thr: 0.6
[FaceDetect.NonMaxSuppression] # Original Faces: 5, # Filtered Faces: 1

With Face detection and Face mesh(multi CUDA sessions on a GPU device, sequential inference):

[FaceDetect.PostProcess] Output classificator[0]: 896, Output_regrssor: 14336, Thr: 0.6
[FaceDetect.NonMaxSuppression] # Original Faces: 0, # Filtered Faces: 0