[Documentation/Performance] Parallelize model execution by chunking batch dimension

Describe the documentation issue

Some models have a "batch dimension" in their inputs suggesting that entries along that dimension are independent of each other. Models of this kind are good candidates for embarrassingly parallel execution: Simply chunk the inputs along that dimension, execute each chunk in its own thread, and lastly concatenate the outputs. A simple parallelization model of this kind can better utilize the available hardware in some use-cases.

While onnxruntime has two prominent options for parallelization (intra_op_num_threads and inter_op_num_threads of the SessionOptions object) I did not find any documentation of this kind of parallelization. It appears to me that an embarrassingly parallel approach would have significant advantages over the aforementioned options as I understand them. Did I miss the possibility to somehow communicate to onnxruntime that a certain dimension is a batch dimension to be used for parallelization, or does that feature simply not exist?

Page / URL

No response

microsoft / onnxruntime

[Documentation/Performance] Parallelize model execution by chunking batch dimension #17349

Describe the documentation issue

Page / URL