microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.15k stars 2.86k forks source link

Performance reduction due to copying of output OrtValues to numpy arrays #11099

Open vvchernov opened 2 years ago

vvchernov commented 2 years ago

I tested hand-made multiple input/output onnx-model and observed that performance is reduced by copying of OrtValues to numpy arrays if InferenceSession "run" method is used. I used TVM EP but the issue is not depends on any type of execution provider. My view is that it is excess copying which can be skipped in some usual cases. I considered two approaches: 1. Construct numpy arrays after inference based on memory allocated for output ORT tensors (PyArray_SimpleNewFromData can be used instead of PyArray_SimpleNew). 2. Create numpy arrays with fixed output shapes before inference and construct output ORT tensors based on memory preallocated in the numpy arrays.

The first approach requires small changings in current code. But it leads to potential problem related to memory ownership. if InferenceSession was released when its output ORT tensors are released too and numpy array will link to spoiled memory. It should be noted the approach does not work for sparse tensors, the copying is still needed. I have questions to community: Can I and how to manage memory ownership for OrtValue and OrtTensor? And how safe is this management?

The second approach solves the ownership problem, but has own issues. It works (I checked it) for onnx-model with fixed architecture (all shapes are known). Obviously it does not work for dynamic onnx-model where some tensor shapes, particularly output tenors shapes, depends on input tensor data. In the latter case the current approach with copying should be used. The most common case in practice is dynamic onnx-models where some dimensions of shape are unknown (e.g. batch size). From one hand for this case the output shapes are not known before inference, but they are needed for correct memory allocation by numpy arrays. From other hand input shapes are known before inference and if we have shape inference mechanism, we can get output shapes. Other issue is InferenceSession "run" method returns list of python objects in general case, not numpy arrays only. It means we should know python objects types before inference. I'm not sure but it looks like types from outputs from InferenceSession::GetModelOutputs can be used for it. I have questions to community: The main one is can output shapes be calculated for given input shapes in ORT? Particularly do ORT have shape inference method? I found Graph::UpdateShapeInference(node), but it does not change outputs from InferenceSession::GetModelOutputs although they are binded to graph outputs. Can we used outputs from InferenceSession::GetModelOutputs to check types and construct correct vector of python objects before inference?

Urgency There is no special deadlines

System information There is no special system requirements. I performed tests on Linux (Ubuntu 18.04) using ONNX runtime python API

To Reproduce To observe the performance reduction any onnx-model can be used, but the effect is more highlighted in case of multiple output onnx-models. To evaluate performance reduction for specific model it is enough to measure time of the circle execution (by std::chrono for example) and compare with full inference performance.

yuslepukhin commented 2 years ago

It is possible to have a numpy array pointing to the original native buffer. In fact, it is already done in Sparse Tensors module. However, there the memory is not owned by numpy array. It can be done using py::capsule approach. Much more convenient when using Numpy C API.

vvchernov commented 2 years ago

Hello @yuslepukhin! Thank you for good tip! I'm trying to implement and check it