ivilson / Yolov7net

Yolo Detector for .Net 8
83 stars 25 forks source link

Utils.ExtractPixels is very slow #17

Closed zgabi closed 1 year ago

zgabi commented 1 year ago

Utils.ExtractPixels is very slow. On my machine it is 300-500ms. Nested Parallel processing is unnecessary, it makes the function only slower. If I remove the Parallel loops, the result is 70ms.. which is still quite a lot. (Tensor indexer is very slow, use the tensor.Buffer)

In your code you already assume that the bitmapData is ARGB, 4 byte per pixel, so using the Stride is unnecerrasy, since (from the documetation):

The stride is the width of a single row of pixels (a scan line), rounded up to a four-byte boundary.

And in this case the width of a single row is always multiple of 4.

I rewrote the function, this is only 3ms and not an "unsafe" code: int pixelCount = width * height; var spanR = tensor.Buffer.Span; var spanG = spanR.Slice(pixelCount); var spanB = spanG.Slice(pixelCount);

    int sidx = 0;
    int didx = 0;
    for (int i = 0; i < pixelCount; i++)
    {
        spanR[didx] = data[sidx + 2] / 255.0F;
        spanG[didx] = data[sidx + 1] / 255.0F;
        spanB[didx] = data[sidx] / 255.0F;
        didx++;
        sidx += 4;
    }

Maybe you can make it even faster by using unsafe code.

This is just an idea how you could make it faster. If you expect higher models in the future (like 8K * 8K), you can keep the outer parallel loop or make "my" single loop parallel... but a nested parallel loop is overkill.... And for 640x640 pixels the parallel loop is unnecessary.

virgilKing commented 1 year ago

Source of data,"data " of where ,how about share your Utils.ExtractPixels totally code? thank's

zgabi commented 1 year ago

data is the bitmapData.Scan0

        Span<byte> data;
        unsafe
        {
            data = new Span<byte>((void*)bitmapData.Scan0, bitmapData.Height * bitmapData.Stride);
        }
virgilKing commented 1 year ago

and spanR、spanG、spanB how thrans to DenseTensor?

zgabi commented 1 year ago

This should be the full ExactPixels method:

        public static Tensor<float> ExtractPixels2(Bitmap bitmap)
        {
            var rectangle = new Rectangle(0, 0, bitmap.Width, bitmap.Height);
            BitmapData bitmapData = bitmap.LockBits(rectangle, ImageLockMode.ReadOnly, PixelFormat.Format32bppPArgb);

            var tensor = new DenseTensor<float>(new[] { 1, 3, bitmap.Height, bitmap.Width });

            Span<byte> data;
            unsafe
            {
                data = new Span<byte>((void*)bitmapData.Scan0, bitmapData.Height * bitmapData.Stride);
            }

            int pixelCount = bitmap.Width * bitmap.Height;
            var spanR = tensor.Buffer.Span;
            var spanG = spanR.Slice(pixelCount);
            var spanB = spanG.Slice(pixelCount);

            int sidx = 0;
            int didx = 0;
            for (int i = 0; i < pixelCount; i++)
            {
                spanR[didx] = data[sidx + 2] / 255.0F;
                spanG[didx] = data[sidx + 1] / 255.0F;
                spanB[didx] = data[sidx] / 255.0F;
                didx++;
                sidx += 4;
            }

            bitmap.UnlockBits(bitmapData);

            return tensor;
        }

Tensor is just an N dimensional array. In your case it is 4 dimensional: new DenseTensor<float>(new[] { 1, 3, bitmap.Height, bitmap.Width }); Where the 1st dimension has only 1 value, the 2nd has 3 (R, G, B), the 3rd is the height and the 4th is the width. So internally it is only a float[1 3 width * height] array.

So in the memory it contains RRRRRRR......(count: width height) GGGGGG......(count: width height) BBBBBBB......(count: width * height) values (where R, G, B is a float)

iwaitu commented 1 year ago

This should be the full ExactPixels method:

        public static Tensor<float> ExtractPixels2(Bitmap bitmap)
        {
            var rectangle = new Rectangle(0, 0, bitmap.Width, bitmap.Height);
            BitmapData bitmapData = bitmap.LockBits(rectangle, ImageLockMode.ReadOnly, PixelFormat.Format32bppPArgb);

            var tensor = new DenseTensor<float>(new[] { 1, 3, bitmap.Height, bitmap.Width });

            Span<byte> data;
            unsafe
            {
                data = new Span<byte>((void*)bitmapData.Scan0, bitmapData.Height * bitmapData.Stride);
            }

            int pixelCount = bitmap.Width * bitmap.Height;
            var spanR = tensor.Buffer.Span;
            var spanG = spanR.Slice(pixelCount);
            var spanB = spanG.Slice(pixelCount);

            int sidx = 0;
            int didx = 0;
            for (int i = 0; i < pixelCount; i++)
            {
                spanR[didx] = data[sidx + 2] / 255.0F;
                spanG[didx] = data[sidx + 1] / 255.0F;
                spanB[didx] = data[sidx] / 255.0F;
                didx++;
                sidx += 4;
            }

            bitmap.UnlockBits(bitmapData);

            return tensor;
        }

Tensor is just an N dimensional array. In your case it is 4 dimensional: new DenseTensor<float>(new[] { 1, 3, bitmap.Height, bitmap.Width }); Where the 1st dimension has only 1 value, the 2nd has 3 (R, G, B), the 3rd is the height and the 4th is the width. So internally it is only a float[1 3 width * height] array.

So in the memory it contains RRRRRRR......(count: width height) GGGGGG......(count: width height) BBBBBBB......(count: width * height) values (where R, G, B is a float)

Good job. However, compared to not using numpy, the performance is still a bit worse, but I still like your modification, and I will update it to the project immediately.

AVISIX commented 1 year ago

I love you. I was searching for a fix for this for HOURS!