[Performance] How does onnxruntime run in parallel mode?

zwyao commented 3 months ago

Describe the issue

In ONNX Runtime v1.18.1, it can set option with ExecutionMode::ORT_PARALLEL, it means ops will run in parallel mode, but i cant find any executors about multi-thread, it only have one sequential executor, am i wrong? How does onnxruntime run in parallel mode?

To reproduce

no step

Urgency

No response

Platform

Linux

OS Version

centos

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.18.1 master

ONNX Runtime API

C++

Architecture

X64

Execution Provider

Other / Unknown

Execution Provider Library Version

No response

Model File

No response

Is this a quantized model?

Unknown

zwyao commented 3 months ago

sequential executor will generate one logic stream with each provider type, and ops in the same logic stream will run one by one, multi logic streams can run in parallel mode.

zwyao commented 3 months ago

is sequential executor the only one executor in onnxruntime?

xadupre commented 3 months ago

Every kernl also runs in parallel. A matrix multiplication is parallelized. The default is run kernels in sequence as every kernel tries to parallelized its computation on its own unless the user disable that behaviour.

pranavsharma commented 2 months ago

There is no need to configure anything. By default, the execution of an individual op is parallelized. Additionally, you can call Run concurrently in multiple threads safely.

zwyao commented 2 months ago

There is no need to configure anything. By default, the execution of an individual op is parallelized. Additionally, you can call Run concurrently in multiple threads safely.

thanks for your reply ：） but i am confused with the concurrency mechanism. If i run a onnx model in a host with only CPU device, the kernels will be running one by one in topological order, even if i set concurrency mode. The only difference between ORT_SEQUENTIAL and ORT_PARALLEL is ORT_PARALLEL mode can use intra threads within a kernel. Is this correct？

zwyao commented 2 months ago

Every kernl also runs in parallel. A matrix multiplication is parallelized. The default is run kernels in sequence as every kernel tries to parallelized its computation on its own unless the user disable that behaviour.

thanks for your reply: ) "Every kernl also runs in parallel" means the kernel run in intra thread pool?

pranavsharma commented 2 months ago

There is no need to configure anything. By default, the execution of an individual op is parallelized. Additionally, you can call Run concurrently in multiple threads safely.

thanks for your reply ：） but i am confused with the concurrency mechanism. If i run a onnx model in a host with only CPU device, the kernels will be running one by one in topological order, even if i set concurrency mode. The only difference between ORT_SEQUENTIAL and ORT_PARALLEL is ORT_PARALLEL mode can use intra threads within a kernel. Is this correct？

Topological order is the only way ORT executes the graph today. This is also called sequential execution mode a.k.a. the kernels are executed sequentially. The execution of each kernel is parallelized using an intra op thread pool. This and the fact that we allow concurrent execution of the model in multiple threads is all that is there to know about CPU concurrency in ORT. ORT_PARALLEL is deprecated.

zwyao commented 2 months ago

There is no need to configure anything. By default, the execution of an individual op is parallelized. Additionally, you can call Run concurrently in multiple threads safely.

thanks for your reply ：） but i am confused with the concurrency mechanism. If i run a onnx model in a host with only CPU device, the kernels will be running one by one in topological order, even if i set concurrency mode. The only difference between ORT_SEQUENTIAL and ORT_PARALLEL is ORT_PARALLEL mode can use intra threads within a kernel. Is this correct？

Topological order is the only way ORT executes the graph today. This is also called sequential execution mode a.k.a. the kernels are executed sequentially. The execution of each kernel is parallelized using an intra op thread pool. This and the fact that we allow concurrent execution of the model in multiple threads is all that is there to know about CPU concurrency in ORT. ORT_PARALLEL is deprecated.

thanks a lot.

zwyao commented 2 months ago

There is no need to configure anything. By default, the execution of an individual op is parallelized. Additionally, you can call Run concurrently in multiple threads safely.

thanks for your reply ：） but i am confused with the concurrency mechanism. If i run a onnx model in a host with only CPU device, the kernels will be running one by one in topological order, even if i set concurrency mode. The only difference between ORT_SEQUENTIAL and ORT_PARALLEL is ORT_PARALLEL mode can use intra threads within a kernel. Is this correct？

Topological order is the only way ORT executes the graph today. This is also called sequential execution mode a.k.a. the kernels are executed sequentially. The execution of each kernel is parallelized using an intra op thread pool. This and the fact that we allow concurrent execution of the model in multiple threads is all that is there to know about CPU concurrency in ORT. ORT_PARALLEL is deprecated.

thanks a lot.

Emmm

How to speed up inference on CPU device? If i use the following configuration：

-------->>> session_options.AddConfigEntry(kNodePartitionConfigFile, xxx) <<<--------

then i manually split these nodes to multi logic stream(it is very hard) Will this method improve inference performance？

pranavsharma commented 2 months ago

How to speed up inference on CPU device?

Please profile the model first to understand which op is taking the most time.

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

microsoft / onnxruntime