[Performance] CUDA EP with Strange Inference Time

zhanggd001 commented 1 year ago

Describe the issue

When testing the inference time with CUDA EP (Win10 x64, VS2017, CUDA 11.7, onnxruntime=1.12.1), I find that the inference time is unstable and varies enormously. For example, when testing 1000 times, the inference time will be larger suddenly.

To reproduce

Ort::Env env(ORT_LOGGING_LEVEL_WARNING);

Ort::SessionOptions session_options; session_options.SetIntraOpNumThreads(4); session_options.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_ALL);

// use gpu OrtSessionOptionsAppendExecutionProvider_CUDA(session_options, 0);

Ort::Session* ort_session = new Ort::Session(env, model_path, session_options);

Ort::MemoryInfo memory_info = Ort::MemoryInfo::CreateCpu(OrtArenaAllocator, OrtMemTypeDefault);

std::vector input_tensor_values(input_tensor_size);

Ort::Value input_tensor = Ort::Value::CreateTensor(memory_info, input_tensor_values.data(), input_tensor_size, input_node_dims.data(), 4);

auto start_inference = std::chrono::steady_clock::now();

auto output_tensors = ort_session->Run( Ort::RunOptions{ nullptr }, input_node_names.data(), &input_tensor, num_input_nodes, output_node_names.data(), num_output_nodes);

auto end_inference = std::chrono::steady_clock::now(); std::chrono::duration time_inference = end_inference - start_inference; std::cout << "Inference Time : " << time_inference.count() * 1000 << "ms" << std::endl;

Urgency

No response

Platform

Windows

OS Version

10

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.12.1-gpu

ONNX Runtime API

C++

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

CUDA 11.7

Model File

No response

Is this a quantized model?

No

BitCourier commented 1 year ago

What time period is between the inference calls?

Inference times may fluctuate due to GPU and CPU power control/energy savings.

Try to set your GPU to the PowerMizer mode "Maximum Performance" and turn the fan manually to a high speed and try again. Does the situation improve? What inference times are you targeting?

zhanggd001 commented 1 year ago

What time period is between the inference calls?

Inference times may fluctuate due to GPU and CPU power control/energy savings.

Try to set your GPU to the PowerMizer mode "Maximum Performance" and turn the fan manually to a high speed and try again. Does the situation improve? What inference times are you targeting?

Thanks. I find that this is caused by the verison of GPU driver (My GPU is GTX950m. Win10). When using the 47X.XX GPU driver, the inference time is stable. But the inference time is unstable and varies enormously with higher versions than 47X.XX.

BitCourier commented 1 year ago

Hi, thanks! That's interesting, because I have the same Problem and have never tried old drivers.

Forum post: https://forums.developer.nvidia.com/t/strange-cnn-inference-latency-behavior-with-cuda-and-tensorrt/237501

lvchigo commented 12 months ago

https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements:

microsoft / onnxruntime