hzeller / rpi-rgb-led-matrix

Controlling up to three chains of 64x64, 32x32, 16x32 or similar RGB LED displays using Raspberry Pi GPIO
GNU General Public License v2.0
3.57k stars 1.14k forks source link

C# SetPixel() managed -> unmanaged timing #966

Open wolfi-by opened 4 years ago

wolfi-by commented 4 years ago

I use the lib for building up a sports display. To control everything i use dotnet core on a RPi4 with C# as Programming language. The controlling is done by a blazor Webpage based on ASP.net core (see blazor.net) Everything works fine so far, but i found out, that the SetPixel-Function takes quite a long time.

Current Data: I use 4 matrices 64x64 2 on 2 parallel channels. In dotnet i use a Bitmap to pepare the image to display in memory. to build up anything it takes currently 0,6 milliseconds. but than transferring it to the library takes about 20..25 milliseconds. If found out that the colors are prepared inside the function in transformer.cc for the matrix. I think thats the bottleneck. Just for testing i set up a display with 10 matrices on 3 lines and image copy takes more than 130 milliseconds each.

So my question is... how must the data be prepared to save time. From my object System.Image.Bitmap (https://docs.microsoft.com/en-us/dotnet/api/system.drawing.bitmap.lockbits?view=netframework-4.8#System_Drawing_Bitmap_LockBits_System_Drawing_Rectangle_System_Drawing_Imaging_ImageLockMode_System_Drawing_Imaging_PixelFormat_) I get a byte array in 24bpp-format in BGR-Colororder.

Anyway i have a byte array I have to use the SexPixel-Function to draw to the matrix.

I had two ideas to solve the timing problem:

  1. Prepare the data before setting a pixel inside the dotnet app. After that transfer an array of data to the canvas.

  2. Transfer raw BGR-Color array to the canvas as make a new function to prepare the data.

I think the second idea would be better becaus then its inside the library which is hold on a single processor core, so the problem of interrupts from dotnet is away. The second thing is that it would later be possible to transfer videodata from a webcam to a canvas is maybe more easier.

So first i made a fork from the lib to try to implement an additional function to the lib. the problem is, that I dont know exactly hew the pixel bits have to be set that the color information is in a correct order and place at least.

Is somewhere information available or can someone tell something about how to change raw image data into canvas image data.

Sorry for my bad english. I hope it was understandable so far. Greetings from Germany

Wolfgang

hzeller commented 4 years ago

I suspect that the problem using it from the C#-binding is that it is fairly slow to call through to the managed->unmanaged code boundary with call to SetPixe() for every pixel. There was as similar problem in Python and the solution was essentially to have a function SetImage() that took the language specific (Python) object and called SetPixel() in the cython binding by directly reading the Python Pillow representation.

So I suspect you need to do something similar: have a C/C++ function that gets a full C# image buffer passed in, then have that data processed in a C/C++ function, which then calls SetPixel() in a loop natively (don't worry about trying to get a full framebuffer at once copied, see below). Then you only have to add the language-binding of your new function to C#. The suspected slowness is the managed->unmanaged transition for each function call.

That sounds like your second option: make a new function, that takes the specific data you have, then fill the Canvas on the C-side.

At first, I'd keep the necessary code in the C# binding code, not the core-lib. That allows to make it very specific to the array that comes from the C# (so similar to the Python binding). When you're done, send a pull-request.

Random

Speed limit

In general, SetPixel() is fairly slow (about 3.5 Megapixels/second in C++ on a Pi3) because of the particular way the data needs to be prepared; but in your case that would mean that it still takes less than 5ms per frame (faster on a Pi4 of course).

Framebuffer setting ?

Don't worry too much trying to optimize setting a complete RGB/BGR framebuffer vs. a loop of SetPixel(). It is not as simple as copying an array, but the data needs to be luma-corrected, then split into bitplanes, and then each plane is written as bitset flattened into the memory; this is the reason why it is only 3.5 Megapixel/s, but it makes sure that it can be written out in the time-critical part of the matrix setting.)

Transformer

The transformer.cc is not the problem: internally, the mapping is prepared once and SetPixel() internally then uses a look-up table. This is the PixelDesignator business inside framebuffer.

Och, und das Englisch geht doch ganz gut! Cheers, Henner.

wolfi-by commented 4 years ago

Thank you Henner for you quite fast response!

I will investigate the issue and give you response... Maybe a question will pop up i will ask you again Cheers, Wolfgang

wolfi-by commented 4 years ago

Hello again! I analyzed the lib a little bit and changed the functionality a little bit. I added a new function to give the lib a whole byte array directly from my Bitmap with parameters Uint8_t rawdata, int length, bool is BGR. The last one is to choose which byte order is used. So i could speed up the lib a little bit. Most improvement i reached because of checking if a pixel is black. so that nothing is done

I made a test by setting up 64x64 matrices, 10 in a row and 3 in parallel. Just for testing i wrote a little text and it took 3ms to generate the Text inside the Bitmap, and in compare setting the whole bitmap an the maxtrixdisplay using SetPixel took 114ms where as the new function took about 31ms. But this time can only be reached if having many black fields in a picture.

So at least i looked around on the lib where the Bitplanes are set. I think i found out that this technique is used for BCM (binary code modulation). I found similar things on some arduino projects. The difference is, that they use predefined arrays for setting the Bitplane-Values.

Thus my idea was to generate some constant arrays, each for every bitplane with gamma corrected predefined Bitplanevalues. The thought behind is that the lib does not have to calculate the corret PWM-value while setting a Pixel but get the correct value out from the predefined array.

What is your opinion on this strategy and what would be the best place to implement it for a test. I would just need a hint because of my quite poor c++ knowings...

Greetings!

hzeller commented 4 years ago

This library also uses pre-calculated look-up tables to get the luminance-corrected values.

But still we're limited to about 3.5Megapixel, which will translate to about 30ms if you fill 122880 pixels (3 x 10 x 64x64 panel) as you determined (but including black pixels).

The main problem is the loop that requires to do PWM-bits number of bit settings. So this loop with bit settings in 11 times 32 bit words in different parts of the memory makes things slow. Though timing each of the instructions might be worth-while.

If you have a full frame-buffer, it might be worthwhile to set-up all the bits in a vertical column in registers first before writing the whole word. However that only works in the very simplest of cases in which the pixels are arranged as you think they are; that has already been taken care of by the pixel-designators now. So you'd need to have another look-up table to find all the pixels that correlate with one 32 bit word. It quickly gets complicated but it might speed up things considrably (or not, due to the more complicated approach).

There might be operations in the ARM instruction set that allows to do that quickly in assembly (some processors have this kind of transposing operations) It might also be worthwhile to consider going one step further with the look-up table and returning 11 pre-masked 32 bit values, but I suspect that could be slower than doing the bit operations directly. If you want to improve the SetPixel() time, I am happy to accept a pull request that speeds that up. You might need to dive into ARM instruction sets and assembly.

However, also be aware that the 30 64x64 matrix example is hypothetical: you won't get enough refresh rate below flicker perception with this amount of panels so try to benchmark a practical amount of LED panels and see if this is good enough for your performance requirements.

wolfi-by commented 4 years ago

Ok! I think i understand that but have to rethink... Anyway is that lib quite great! I try to Push to github but First will Do some time measurement so that you could decide to implement the change