[Performance] CNN Inference has latency spikes with TensorRT EP

BitCourier commented 1 year ago

Describe the issue

When running the inference, the latency is normally around 1ms. The maximum latency spike is around 3-4ms if the inference frequency is constant, which would fit our application. If the inference frequency is varying, sometimes latency spikes of 18ms occur, if there was a long pause.

The GPU and CPU clocks are fixed to mitigate latency spikes due to energy saving measures. The GPU is only used for inference - no display output connected. We use the inference cache and a excluded warm up step before measuring.

Model: shufflenet v2, nx1x256x256

To reproduce

Run inference with a fixed frequency of ~100 Hz. The latency should be without big spikes

Run inference with 100Hz, but sometimes add pauses of ~10s, the latencies should have spikes.

Urgency

Not extremly urgent, deadline: ~2/23

Platform

Linux

OS Version

Debian 11

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

rel-1.12.1

ONNX Runtime API

C++

Architecture

X64

Execution Provider

TensorRT

Execution Provider Library Version

No response

Model File

No response

Is this a quantized model?

No

Update

Using the CUDA EP shows the same result: the inference runs with ~3ms and the spikes reach ~20ms

BitCourier commented 1 year ago

Update 2:

When using the CPU EP, the latencies are without spikes. Therefore I think about a driver issue, because all energy savings are disabled (including PCIe ASPM). I forgot to mention: I'm running a PREEMPT_RT kernel. The driver was installed with "IGNORE_PREEMPT_RT_PRESENCE=1". Bad driver behaviour could be the problem at this point.

I could use a Tesla card for testing, too. Is there a chance of better behaviour?

BitCourier commented 1 year ago

Update 2:

When using the CPU EP, the latencies are without spikes. Therefore I think about a driver issue, because all energy savings are disabled (including PCIe ASPM). I forgot to mention: I'm running a PREEMPT_RT kernel. The driver was installed with "IGNORE_PREEMPT_RT_PRESENCE=1". Bad driver behaviour could be the problem at this point.

I could use a Tesla card for testing, too. Is there a chance of better behaviour?

Update 3:

With proper CPU-Locking the spikes are lower. Anyway: they are too high. Does someone have suggestions?

BitCourier commented 1 year ago

Update 4:

I tried a TensorRT example, which exposes the same timing behaviour. There's a thread on the Nvidia developer forum about this: https://forums.developer.nvidia.com/t/strange-cnn-inference-latency-behavior-with-cuda-and-tensorrt/237501/1

Conclusion: It's not a problem of Onnxruntime. Running the Nvidia driver on PREEMPT_RT is also not the problem.

microsoft / onnxruntime