[Performance] Java API lacks functionality to control allocator settings.

ivanthewebber commented 10 months ago

Describe the issue

The Java API is very limited with no way to control the arena allocator settings (e.g. "arena_extend_strategy" to "kSameAsRequested", "max_mem", "max_dead_bytes_per_chunk", "initial_chunk_size_bytes").

This of course means that memory will be wasted, and startup cannot be optimized. Also, if there is a memory leak it will OOMKilled the entire container instead of producing a reasonable error message (as it should with reasonable max_mem).

I've tried looking for any way to configure it but found nothing. It seems like it would be really easy to forward some configurations to the underlying C-implementation.

To reproduce

Use the Java API.

Urgency

It's causing problems for me at work.

Platform

Linux

OS Version

AKS Docker image based Mariner image

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.16.2

ONNX Runtime API

Java

Architecture

X64

Execution Provider

Default CPU

Execution Provider Library Version

No response

Model File

No response

Is this a quantized model?

No

ivanthewebber commented 10 months ago

I'm trying to use the ONNXRUNTIME for stream processing with Apache Flink in a low-latency, high-throughput, and memory-constrained setting.

See this paper comparing Onnx and alternatives for this use case; with source code which is similar to my own usage.

Craigacp commented 10 months ago

I think a bunch of those are possible for CUDA as we expose an add method to the CUDA EP options, but you're right we don't expose memory allocators at all for CPUs.

It's not straightforward to design an API which exposes the allocators, at the moment there's a single default allocator used everywhere and it's not exposed in any of the value construction methods so it would be a substantial effort to build an API around that, OrtMemoryInfo and OrtArenaCfg. It's on the todo list as it will enable direct allocation of GPU memory which can be useful, but needs careful designing.

ivanthewebber commented 10 months ago

It seems like you could follow the same patterns as the Python API and just translate some of the implementation. Let me know if you're able to add this to your backlog and the timeline. Otherwise I will be looking for a workaround or an alternative like onnx-scala.

Craigacp commented 10 months ago

Python is a little easier as it doesn't have to deal with concurrency, so they can get away with a laxer API. I'll scope out the amount of work in the new year.

github-actions[bot] commented 9 months ago

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

Craigacp commented 9 months ago

Keep this issue open, it can track CPU allocator settings.

ivanthewebber commented 9 months ago

Any updates? Also, if I set the number of inter-op and intra-op threads to 1 and share a session object across many threads would each thread calling run be able to run in parallel or would the affinity of the ONNX thread be tied to a single CPU?

Craigacp commented 9 months ago

No updates, I'm waiting for this PR (https://github.com/microsoft/onnxruntime/pull/18556) to be merged before starting on more memory management related issues.

I believe the thread you send in to ORT is used for compute, so if you have concurrent requesting threads then those threads will concurrently execute the model.

ivanthewebber commented 6 months ago

Any updates? I have my fingers crossed that some work on this will get planned

Craigacp commented 6 months ago

Not yet.

microsoft / onnxruntime