microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.06k stars 2.83k forks source link

[Mobile][Kotlin] OnnxTensor.createTensor from floatBuffer takes up 7 seconds #16937

Open octavflorescu opened 1 year ago

octavflorescu commented 1 year ago

Describe the issue

I am collecting images from camera, converting them to OpenCV Mat, resize them to input format (640x640x3) and from that, to prediction.

OnnxTensor.createTensor(
            ortEnvironment,
            floatBuffer,
            shape
        )

takes an impressive 7 seconds to load. it seams absurde.

To reproduce

fun FloatBufferFromMat(mat: Mat): FloatBuffer {
    val floatArray = FloatArray(mat.width()*mat.height()*mat.channels())
    mat.convertTo(mat, CvType.CV_32F)
    mat.get(0, 0, floatArray)
    return FloatBuffer.wrap(floatArray)
}

fun getOnnxTensorFromMat(mat: Mat): OnnxTensor {
        val floatBuffer = FloatBufferFromMat(mat)
        val shape = longArrayOf(
            1, mat.channels().toLong(),
            mat.height().toLong(),
            mat.width().toLong()
        )
        return OnnxTensor.createTensor(
            ortEnvironment,
            floatBuffer,
            shape
        ) // <<<<<<<<<<<<<<<<<<<<<<<<<<< this line takes approx 7 seconds
    }

Urgency

No response

Platform

Android

OS Version

12

ONNX Runtime Installation

Released Package

Compiler Version (if 'Built from Source')

No response

Package Name (if 'Released Package')

onnxruntime-android

ONNX Runtime Version or Commit ID

latest.release

ONNX Runtime API

Java/Kotlin

Architecture

Other / Unknown

Execution Provider

Default CPU

Execution Provider Library Version

No response

octavflorescu commented 1 year ago

Note: creating the tensor from bytebuffer (converted from float buffer) is ~instant, but i need the Float option due to the shape check:
ai.onnxruntime.OrtException: Shape [1, 3, 640, 640], requires 1228800 elements but the buffer has 4915200 elements.
If there is an alternative, please do tell...

This is the snippet i used to convert the floatbuffer to bytebuffer:

        var localByteBuffer = ByteBuffer.allocateDirect(floatBuffer.remaining()*4)
        localByteBuffer.order(ByteOrder.nativeOrder())
        floatBuffer.mark();
        localByteBuffer.asFloatBuffer().put(floatBuffer)
        floatBuffer.reset()
        localByteBuffer.rewind()
        return OnnxTensor.createTensor(
            ortEnvironment,
            localByteBuffer,
            shape,
            OnnxJavaType.FLOAT
        )
Craigacp commented 1 year ago

You can use the asFloatBuffer version of the direct byte buffer like so:

        var localByteBuffer = ByteBuffer.allocateDirect(floatBuffer.remaining()*4)
        localByteBuffer.order(ByteOrder.nativeOrder())
        floatBuffer.mark();
        var directFloatBuffer = localByteBuffer.asFloatBuffer();
        directFloatBuffer.put(floatBuffer)
        floatBuffer.reset()
        directFloatBuffer.rewind()
        return OnnxTensor.createTensor(
            ortEnvironment,
            directFloatBuffer,
            shape
        )

On a separate note the inbounding shape checking logic is currently being improved, and it might have landed in the FP16 PR.

Craigacp commented 1 year ago

With the recent merges of the Java FP16 logic the byte buffer creation path now checks to see if the number of elements lines up for the supplied type, not just treating it as if it is a byte buffer, so your code snippet probably works on main. The one I posted will work on both main and older versions of ORT.

msomething123 commented 2 months ago

Hello, I'm having the exact same issue, with an input of 1792x2976, the createTensor method takes +30 seconds to execute

msomething123 commented 2 months ago

The @Craigacp answer didn't work unfortunately, it's still very slow even after converting to ByteBuffer

Craigacp commented 2 months ago

How are you calling createTensor and where is the input?

msomething123 commented 2 months ago

I'm calling it using the OnnxTensor class.

val paddedBuffer = getPaddedBuffer()
val paddedTensor = OnnxTensor.createTensor(
                            ortEnv,
                            paddedBuffer,
                            longArrayOf(
                                1L,
                                3L,
                                paddedBitmapHeight.toLong(),
                                paddedBitmapWidth.toLong(),
                            )
                        )

paddedBuffer being the a FloatBuffer.

paddedBitmapHeight = 1792
paddedBitmapWidth = 2976

About the getPaddedBuffer method:

fun getPaddedBuffer(imgInput: Mat): FloatBuffer {
        val chanBlue = Mat()
        val chanGreen = Mat()
        val chanRed = Mat()

        Core.extractChannel(imgInput, chanBlue, 2)
        Core.extractChannel(imgInput, chanGreen, 1)
        Core.extractChannel(imgInput, chanRed, 0)

        val arrayB = FloatArray(imgInput.rows() * imgInput.cols())
        val arrayG = FloatArray(imgInput.rows() * imgInput.cols())
        val arrayR = FloatArray(imgInput.rows() * imgInput.cols())
        chanBlue.get(0, 0, arrayB)
        chanGreen.get(0, 0, arrayG)
        chanRed.get(0, 0, arrayR)

        return FloatBuffer.wrap(arrayB + arrayG + arrayR)
    }

I tried reducing the input size (number of elements) but there is still this delay with ~1000x1000 images

Craigacp commented 2 months ago

FloatBuffer.wrap will copy it into the float buffer, but it's not a direct one so we need to copy it again in ORT. I recommend making the float buffer before the getPaddedBuffer call and then writing the values into it directly by calling floatBuffer.put(arrayB) etc.

msomething123 commented 2 months ago

You were totally right, the usage of a direct buffer did the trick. The tensor is now instantly created. Thanks a lot for your help!