microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.86k stars 2.94k forks source link

[Documentation/Performance] Parallelize model execution by chunking batch dimension #17349

Open cbourjau opened 1 year ago

cbourjau commented 1 year ago

Describe the documentation issue

Some models have a "batch dimension" in their inputs suggesting that entries along that dimension are independent of each other. Models of this kind are good candidates for embarrassingly parallel execution: Simply chunk the inputs along that dimension, execute each chunk in its own thread, and lastly concatenate the outputs. A simple parallelization model of this kind can better utilize the available hardware in some use-cases.

While onnxruntime has two prominent options for parallelization (intra_op_num_threads and inter_op_num_threads of the SessionOptions object) I did not find any documentation of this kind of parallelization. It appears to me that an embarrassingly parallel approach would have significant advantages over the aforementioned options as I understand them. Did I miss the possibility to somehow communicate to onnxruntime that a certain dimension is a batch dimension to be used for parallelization, or does that feature simply not exist?

Page / URL

No response

xadupre commented 1 year ago

The batch dimension is usually the first one but not necessarily. The matrix can be transposed. Each kernel is able to parallelize the computation but the strategy can be different based on the input dimensions. It is not necessarily parallelized on the first dimension. There is no kernel parallelization if intra_op_num_threads == 1. Some of the parameters are described at https://onnxruntime.ai/docs/api/python/api_summary.html#onnxruntime.SessionOptions.intra_op_num_threads.