microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.1k stars 2.84k forks source link

[Performance] High thread contention in BFCArena #21916

Open gootorov opened 2 weeks ago

gootorov commented 2 weeks ago

Describe the issue

Hi,

I've noticed that a significant chunk of time is spent on locks inside onnxruntime. Specifically, inside BFCArena::AllocateRawInternal https://github.com/microsoft/onnxruntime/blob/01673389b8c51dbea918900c2954966908c7fcaf/onnxruntime/core/framework/bfc_arena.cc#L328

The conditions are as follows:

See flamegraph screenshots below: image image image

strace shows that 92% of the application time is spent in futex calls:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 92.30 7155.519048        1521   4701931    311073 futex
  4.54  351.586941       52798      6659           clock_nanosleep
  2.03  157.700755         874    180328           epoll_wait
  0.75   58.015816          17   3245709           write
  0.23   17.686325          19    884903           sched_yield

Is this an expected BFCArena limitation, or is it something misconfigured on my side?

I'm expecting that having a Session object per worker thread should eliminate contention. However, I've seen developers here discourage people from setups like this. Why? What are the drawbacks? I'm assuming increased memory consumption (this is fine for me), anything else?

And if that is indeed an expected limitation, then, I'd say this needs some improvement. For example, a caller could pass their own BFCArena instance to Session.Run(), or BFCArena could track each thread_id and keep an array of arenas per each thread.

To reproduce

Initialize a single Session with the following settings:

Then, call Session.Run from many threads concurrently.

Urgency

No response

Platform

Linux

OS Version

NixOS, Gentoo

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.19.0

ONNX Runtime API

C++

Architecture

X64

Execution Provider

Default CPU

Execution Provider Library Version

No response

Model File

No response

Is this a quantized model?

No

pranavsharma commented 2 weeks ago

Lock contention in the BFC arena is a known issue. You have a couple of options.

  1. Disable BFC arena altogether
  2. Use mimalloc instead of BFC arena (requires building ORT with mimalloc)
  3. Plugin your own allocator
  4. Disable BFC arena altogether and link with your own allocator