Open BitCourier opened 1 year ago
When using the CPU EP, the latencies are without spikes. Therefore I think about a driver issue, because all energy savings are disabled (including PCIe ASPM). I forgot to mention: I'm running a PREEMPT_RT kernel. The driver was installed with "IGNORE_PREEMPT_RT_PRESENCE=1". Bad driver behaviour could be the problem at this point.
I could use a Tesla card for testing, too. Is there a chance of better behaviour?
Update 2:
When using the CPU EP, the latencies are without spikes. Therefore I think about a driver issue, because all energy savings are disabled (including PCIe ASPM). I forgot to mention: I'm running a PREEMPT_RT kernel. The driver was installed with "IGNORE_PREEMPT_RT_PRESENCE=1". Bad driver behaviour could be the problem at this point.
I could use a Tesla card for testing, too. Is there a chance of better behaviour?
With proper CPU-Locking the spikes are lower. Anyway: they are too high. Does someone have suggestions?
I tried a TensorRT example, which exposes the same timing behaviour. There's a thread on the Nvidia developer forum about this: https://forums.developer.nvidia.com/t/strange-cnn-inference-latency-behavior-with-cuda-and-tensorrt/237501/1
Conclusion: It's not a problem of Onnxruntime. Running the Nvidia driver on PREEMPT_RT is also not the problem.
Describe the issue
When running the inference, the latency is normally around 1ms. The maximum latency spike is around 3-4ms if the inference frequency is constant, which would fit our application. If the inference frequency is varying, sometimes latency spikes of 18ms occur, if there was a long pause.
The GPU and CPU clocks are fixed to mitigate latency spikes due to energy saving measures. The GPU is only used for inference - no display output connected. We use the inference cache and a excluded warm up step before measuring.
Model: shufflenet v2, nx1x256x256
To reproduce
Run inference with a fixed frequency of ~100 Hz. The latency should be without big spikes
Run inference with 100Hz, but sometimes add pauses of ~10s, the latencies should have spikes.
Urgency
Not extremly urgent, deadline: ~2/23
Platform
Linux
OS Version
Debian 11
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
rel-1.12.1
ONNX Runtime API
C++
Architecture
X64
Execution Provider
TensorRT
Execution Provider Library Version
No response
Model File
No response
Is this a quantized model?
No
Update
Using the CUDA EP shows the same result: the inference runs with ~3ms and the spikes reach ~20ms