Inconsistent inference timing on CPU

jamjambles commented 2 years ago

Describe the bug I am deploying a model using ONNXRuntime on CPU. It is in an environment with a hard real-time budget of 20ms. The average inference time of the model is ~2ms. I have observed that sometimes the model inference time spikes to >> 20ms which is not acceptable in the deployment environment. More generally there is quite a lot of variance between inference calls in terms of timing which is also a concern.

System information

Mac OSX 10.15.7
Processor: 2.4 GHz 8-Core Intel Core i9
ONNX Runtime version: 1.10.2
Python version: 3.8.12

Additional context These are the session options used to initialise the ORT inference session

    sess_options = onnxruntime.SessionOptions()
    sess_options.intra_op_num_threads = 1
    sess_options.inter_op_num_threads = 1
    sess_options.execution_mode = onnxruntime.ExecutionMode.ORT_SEQUENTIAL
    sess_options.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL

Below is an example of inference times per inference call. This shows the inconsistent inference timing and spikes: Here is a trace using the profiling tool from onnxruntime, again we can see where the inference locks-up causing a spike:

I am looking to learn whether onnxruntime is appropriate for such a use-case. Are there any other options or configurations to ensure stability in inference time?

tianleiwu commented 2 years ago

Is your intend to use only one thread for computation while your CPU has 8 cores? You can tune intra_op_num_threads (like 4) and inter_op_num_threads to see whether there is less spike.

jamjambles commented 2 years ago

@tianleiwu the intention is to run in a single thread on a single core. I ran benchmarking using the options you suggested but I still see these inference time spikes occurring.

I believe this is due to memory allocs which occur during the forward inference call. Is there a way to avoid non deterministic operations (such as memory allocs) during inference ?

tianleiwu commented 2 years ago

@jamjambles, you can tune session options to see whether it could help: http://www.xavierdupre.fr/app/onnxruntime/helpsphinx/api_summary.html#sessionoptions

There are some optimization (like Reshape etc) in master branch that could reduce memory allocation. You can try nightly build https://test.pypi.org/project/ort-nightly/ to see whether it could improve it.

If the problem still exists, please share your model and test script so that someone could help debugging the issue.

jamjambles commented 2 years ago

@tianleiwu thanks for getting back to me!

Looking at the API docs I think the following options are related to memory allocation:

sess_options.use_deterministic_compute
sess_options.enable_mem_reuse
sess_options.enable_cpu_mem_arena
sess_options.enable_mem_pattern

Are you able to tell me how to set these in order to minimise the number of memory allocations?

I tried the nightly build and the spikes are still occurring.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

SolomidHero commented 1 year ago

Hi @tianleiwu! I would like to know answers to these questions too

Are you able to tell me how to set these in order to minimise the number of memory allocations?

Also, @jamjambles, have you managed to solve this issue about such instability?

I have opened similar issue about allocations when running with c++ api https://github.com/microsoft/onnxruntime/issues/17758

tianleiwu commented 1 year ago

@SolomidHero, to reduce memory allocation, you can try set arena settings (see C++ API https://onnxruntime.ai/docs/api/c/struct_ort_1_1_arena_cfg.html), like arena_extend_strategy to be 1 (kSameAsRequested). It could avoid Arena using more memory than needed. If you want to avoid allocation happens, you can set a large initial arena buffer.

See https://fs-eire.github.io/onnxruntime/docs/performance/tune-performance/memory.html if you have multiple sessions. You can share Arena among multiple sessions to reduce memory. You can also try mimalloc to see whether it can help.

If your input/output tensor is large, please use I/O Binding (https://fs-eire.github.io/onnxruntime/docs/performance/tune-performance/iobinding.html), that could save extra memory needed and avoid data copy (thus reduce latency).

For instability:

In https://fs-eire.github.io/onnxruntime/docs/performance/tune-performance/troubleshooting.html, it also mentioned that when latency has high variance like this issue, you can try settings like sess_options.add_session_config_entry('session.dynamic_block_base', '4') Or try change enable/disable spinning: sess_options.add_session_config_entry("session.intra_op.allow_spinning", "1")

If your machine has NUMA (usually in machine with > 32 cores), try set thread affinity to pin to some CPU cores: https://fs-eire.github.io/onnxruntime/docs/performance/tune-performance/threading.html#set-intra-op-thread-affinity.

You can also build from source with a flag --use_lock_free_queue to see whether it could help.

If your application has mixed use of PyTorch, Numpy and OnnxRuntime, try to move everything (like preprocessing and postprocessing) to OnnxRuntime custom op. It is because each of them has their own thread pool and memory management. Thread switch will introduce extra latency.

SolomidHero commented 1 year ago

@tianleiwu Thank you for such fast response! When such options are useful for multi-threaded environment I am trying to resolve instabilities for sequential inference. Say base config is same as for topic starter:

sess_options.intra_op_num_threads = 1
sess_options.inter_op_num_threads = 1
sess_options.execution_mode = onnxruntime.ExecutionMode.ORT_SEQUENTIAL
sess_options.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL
# and CPU as only execution provider
Can you confirm that following options (1-4) you described are helpful for sequential setup too to fix instabilities? 1) arena_extend_strategy for excessive amount of memory arena so no additional allocation needed I personally use:
EnableCpuMemArena();
EnableMemPattern();
2) "sharing allocator(s) between sessions" - don't think it could really impact additional allocs/deallocs when several sessions are defined in isolation. I.e. I guess it doesn't affect instabilities, but memory size only. So it is good technique to only reduce size, right? 3) sequential -> no thread pool -> no impact for "session.dynamic_block_base" and "session.intra_op.allow_spinning" on instabilities in sequential setup, right? From this issue https://github.com/microsoft/onnxruntime/issues/11347 I assume doing intra_op=1, inter_op=1 and sequential does not pin additional thread, what is great I think. But execution time variance is still high though as in issue opening message. 4) io binding used successfully, it is really useful thing :) But seems like instabilities still there

tianleiwu commented 1 year ago

@SolomidHero, I think "session.dynamic_block_base" and "session.intra_op.allow_spinning" has impact on OS thread scheduling, and you can have a try.

For multiple sessions, I recommend to try using one process per session, and use affinity to pin each process to some CPU cores. Without this, your thread might be moved by OS to another core, that might cause latency spiking.

Latency spiking has many causes, you will need monitor system level CPU, memory and process status;

Sometime, overheating could cause OS throttling CPU/GPU speed (but that will have impact on longer time so it is likely the cause).

If OS level memory usage is high, memory paging might cause latency spiking.

Sometime, thread is suspended by OS for some reason. check system event logs to see anything happens at the time.

You may add some logging to see whether the input shape is large if your input has dynamic shape. Or whether there is no input for some period so session becomes "cold", then new request will take longer time.

If no causes found using the above ways, I think investigating the root cause might need some advanced profiling tools.

SolomidHero commented 1 year ago

@tianleiwu I am quite confused about existence of thread pool. From your words it is assumed that in setup "intra_op=1, inter_op=1 and sequential execution" - there must be some new threads that can be managed by some options. But from @snnn answer in https://github.com/microsoft/onnxruntime/issues/11347 I assume no thread created

Some analysis

I also tried to profile setup of three sequential model running inference in a equally distant time calls. Say I wanted to run in 50Hz on single thread. Here is my results:

isolated - same env for all sessions - no global thread pool
```
env = Ort::Env(ORT_LOGGING_LEVEL_WARNING, "common_env");
```

global - same env for all sessions - env with global thread pool

OrtThreadingOptions* tp_options = nullptr;
Ort::GetApi().CreateThreadingOptions(&tp_options);
Ort::GetApi().SetGlobalIntraOpNumThreads(tp_options, 1);
Ort::GetApi().SetGlobalInterOpNumThreads(tp_options, 1);
Ort::GetApi().SetGlobalSpinControl(tp_options, 0); // i disabled it since it gives high cpu usage, and we don't need such
env = Ort::Env(tp_options, ORT_LOGGING_LEVEL_WARNING, "global_env");
Ort::GetApi().ReleaseThreadingOptions(tp_options);

See also distribution difference:

I just wanted to share such result. For me it is clear that (if @snnn is correct on "no new thread in intra_op=1, inter_op=1 and sequential execution setup") only difference is (1) is without any new threads and (2) uses single thread for processing models and spends a lot time on synchronizing with main thread. I also don't know if there any other difference. For example, in behaviour of memory management. Can you shed light on this diff? Can you explain why in isolated there might be two-modal kde plot?

p.s. I used onnxruntime==1.14.1 on macos

tianleiwu commented 1 year ago

@SolomidHero, For the global thread pool, it is right that there is no thread pool created since the setting of number of threads is 1. And node is executed directly in the main thread.

If you are interested in the detail of thread pool see the below code: https://github.com/microsoft/onnxruntime/blob/7417fd41e2c77cac5f09c3224918c9c23598a1cd/onnxruntime/core/util/thread_utils.cc#L87-L89 https://github.com/microsoft/onnxruntime/blob/7417fd41e2c77cac5f09c3224918c9c23598a1cd/onnxruntime/core/session/inference_session.h#L556-L566 https://github.com/microsoft/onnxruntime/blob/7417fd41e2c77cac5f09c3224918c9c23598a1cd/onnxruntime/core/session/environment.cc#L203-L215 https://github.com/microsoft/onnxruntime/blob/7417fd41e2c77cac5f09c3224918c9c23598a1cd/onnxruntime/core/session/inference_session.cc#L364-L406 https://github.com/microsoft/onnxruntime/blob/7417fd41e2c77cac5f09c3224918c9c23598a1cd/include/onnxruntime/core/platform/threadpool.h#L228-L235

SolomidHero commented 1 year ago

I was reading code, but also tried to empirically get improvement. So I tried different session options and options for allocator:

SetInterOpNumThreads(1);
SetIntraOpNumThreads(1);
EnableCpuMemArena(); // DisableCpuMemArena();
EnableMemPattern(); // DisableMemPattern();

AddConfigEntry("session.use_env_allocators", "1");

env = std::make_unique<Ort::Env>(std::move(Ort::Env(ORT_LOGGING_LEVEL_WARNING, "common_env")));
memoryInfo = std::make_unique<Ort::MemoryInfo>(std::move(Ort::MemoryInfo::CreateCpu(OrtDeviceAllocator, OrtMemTypeCPU)));
auto& api = Ort::GetApi();

const char* keys[] = {
    "max_mem",
    "arena_extend_strategy",
    "initial_chunk_size_bytes",
    "max_dead_bytes_per_chunk",
    "initial_growth_chunk_size_bytes"
};
const size_t values[] = {
    0,
    1, // kNextPowerOfTwo==0, kSameAsRequested==1 // i tried both 0 and 1
    0, // 300 * 1024 * 1024, // 1024 * 1024,
    // 0, // 0,
    // 0, // 1024 * 256,
};
OrtArenaCfg* arena_cfg;
auto status = api.CreateArenaCfgV2(keys, values, sizeof(values) / sizeof(size_t), &arena_cfg);
env->CreateAndRegisterAllocator(*memoryInfo, arena_cfg);

In the comments are values and options that i tried. So i created 3 sessions of different models on same env and memoryInfo. And I couldn't get different results! My program alway was consuming 201-204MB of data and processing time in loop was always about the same mean and std (so statically insignificant). So even MemPattern and CpuMemArena has no impact

Is it possible that shared allocator might be not used and its env options are not used too when there is sequential setup? I still read code to find answers 😔

microsoft / onnxruntime

Inconsistent inference timing on CPU #10270

Some analysis