[Performance] SetIntraOpNumThreads not take effect

Zhangts98 commented 1 month ago

Describe the issue

When using SetIntraOpNumThreads (1) and SetIntraOpNumThreads (10) on GPU, their inference time is similar, both around 30ms。I have already done warm-up before calculating the time consumption。 How to set it up to improve inference speed? Do I need to build ORT with OpenMP from the source code？

CUDA version can not be modified。

To reproduce

...

env = Ort::Env(ORT_LOGGING_LEVEL_WARNING, "ONNXRuntime inference");

//onnx session option
session_options.SetIntraOpNumThreads(1);
//session_options.SetIntraOpNumThreads(10);
OrtSessionOptionsAppendExecutionProvider_CUDA(session_options, 0);
sess =  Ort::Session(env, model_buffer, model_buffer_len, session_options);

// warm up
sess.run(xxxxxx)

...

Urgency

No response

Platform

Linux

OS Version

centos 7

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.9.0

ONNX Runtime API

C++

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

CUDA 11.2 cudnn 8.1.0

Model File

No response

Is this a quantized model?

No

tianleiwu commented 1 month ago

Intra op num threads is for CPU. For CUDA EP, usually most time is spent on CUDA kernel, which are not impacted by CPU thread setting.

Zhangts98 commented 1 month ago

Intra op num threads is for CPU. For CUDA EP, usually most time is spent on CUDA kernel, which are not impacted by CPU thread setting.

thanks

microsoft / onnxruntime