microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.51k stars 2.9k forks source link

[Performance] Get float from Tensor<float> is too slow #15178

Open youxin1996 opened 1 year ago

youxin1996 commented 1 year ago

Describe the issue

I'm using nuget package Microsoft.ML.OnnxRuntime to inference yolov7 model, use c# .net framework 4.8

After session.run, I have a Tnesor< float > as result, then I need to do some postprocessing, iterate over the Tnesor,but getting elements through [] is too slow,I have 1,867,320 floats need to traverse, it takes almost 300ms(i7 10700 cpu).I think it's because of its get/set methods.Can I just get the float* of Tensor's Buffer then use unsafe code blocks to speed up?

To reproduce

C# code:

Stopwatch sw = new Stopwatch();
foreach (DisposableNamedOnnxValue res in output)
{
    sw.Start();
    Tensor<float> f_data = res.AsTensor<float>();
    for (int i = 0; i < 143640; i++) // output dimension:{1,143640,13} 
         for (int j = 0; j < 13; j++)
               float a = f_data[0, i, j]; 
    sw.Stop(); // cost 300-310ms
}

c++ code

void PostProcess(vector<Value>& ort_outputs) {
    const float* pdata = ort_outputs[0].GetTensorMutableData<float>();
    for (int i = 0; i < 143640 * 13; i++) {
        float v = pdata[i];
    } // cost 2-4ms
}

Urgency

No response

Platform

Windows

OS Version

10

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.14.1

ONNX Runtime API

C#

Architecture

X64

Execution Provider

Default CPU

Execution Provider Library Version

No response

Model File

No response

Is this a quantized model?

No

wangyems commented 1 year ago

@yuslepukhin do you have any insight?

yuslepukhin commented 1 year ago

I do not have insights at this point. I will take a look at it as soon as I can.

yuslepukhin commented 1 year ago

Would it be possible to get the real data that you use internally?

yuslepukhin commented 1 year ago

Would you rather get direct access to the buffer via Memory? Also you can access it via a 1-D flat indexing by simply calculating the offset off 3-D.