microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.77k stars 2.94k forks source link

[Performance] Observing higher memory spikes in C++ when running multiple Inference `Run()` executions on CPU #22920

Open martinkorelic opened 8 hours ago

martinkorelic commented 8 hours ago

Describe the issue

Description:

I am observing high memory spikes after each run of the inference session when passing the previous outputs of the inference to the input of another iteration of the inference. This happens when the input values change during each iteration of the generation loop. The memory usage increases significantly after every Run() invocation, and the allocated memory becomes larger and larger. In contrast, the Python version of my code does not exhibit these spikes, and memory usage remains more or less linearly stable. This suggests that there may be an issue related to memory management in the C++ version, possibly tied to the memory arena, memory patterns, or session configuration.

Environment:

What I've Tried:

Additional Information:

Question:

How can I better manage memory usage during inference sessions when using a dynamically changing inputs in ONNX Runtime for the C++ API? Are there specific settings or techniques for reducing memory spikes that I may have missed? The same implementation does not persist in python version.

To reproduce

Use InferenceSession.Run() multiple times in a loop, perhaps by using dynamic increasing input shapes. Quantized ONNX model exported from torch.

Observations:

Expected Behavior:

Actual Behavior:

Image

Urgency

No response

Platform

Android

OS Version

-

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

ONNX Runtime 18

ONNX Runtime API

C++

Architecture

ARM64

Execution Provider

Default CPU

Execution Provider Library Version

No response

Model File

No response

Is this a quantized model?

Yes

skottmckay commented 5 hours ago

Are you sure you're managing memory correctly in your code? You mention passing outputs from one run to be inputs to the next. How is that memory freed once used as inputs?

We do this kind of execution for LLMs on Android without seeing constant memory growth.

Python will automatically reference count for you and handle the memory. I believe the Java OnnxTensor needs close() to be called.

https://github.com/microsoft/onnxruntime/blob/f6e1d4482941d43737d40723df16a6bf0da43ee5/java/src/main/java/ai/onnxruntime/OnnxTensor.java#L273-L278