API question: can one write directly into the framebuffer used for display?

hzeller / rpi-rgb-led-matrix

Controlling up to three chains of 64x64, 32x32, 16x32 or similar RGB LED displays using Raspberry Pi GPIO

GNU General Public License v2.0

3.71k stars 1.17k forks source link

API question: can one write directly into the framebuffer used for display? #1674

Closed marcmerlin closed 4 months ago

marcmerlin commented 4 months ago

Hi @hzeller, this is related to https://github.com/marcmerlin/FastLED_RPIRGBPanel_GFX which basically allows displaying my multi API, multi OS framebuffer on RGBPanels through your lib (making your lib one of the many hardware backends for it, and allowing nice things like running all that display code on linux with a local linux render, and then being able to run the exact same code on rPi on RGBPanels with just the flip of a flag).

I wrote this as a quick and dirty proof of concept and just realized that I never cleaned it up to actually be efficient, as it's doing a terrible amount of setpixel calls. https://github.com/marcmerlin/FastLED_RPIRGBPanel_GFX/blob/ec1ac0b8c436da7c087f6bc218de7403334ae6c2/FastLED_RPIRGBPanel_GFX.cpp#L25

Back to the questions 1) Am I correct that in the backend, you use an RGB888 structure to store pixels? (if so it's what I do too) 2) If so, can I find some private array in some file somewhere, just make it public and then move my code's framebuffer pointer to directly point to your structure? This will cause potential tearing in some cases, but I'll take that in exchange for 0 copy 3) if option 2 is not easy or feasible, is there some function to just copy the entire framebuffer at once with a memcpy instead of setting it pixel by pixel?

hzeller commented 4 months ago

1) No, the framebuffer internally is not an RGB88 data structure, as SetPixel() already expands everything into gamma-corrected bit-planes at whatever location the pixel mappers place it. This is so that it is ready to go to stream out GPIO without further processing as there is no processing time available in the display loop. So the only way to fill a FrameCanvas is indeed by calling SetPixel().

2) does not apply because 1) does not apply.

3) you can somewhat do: there is a way to pre-compute frame buffers and store in a 'stream' (which can be in-memory or even on-disk). It does essentially memcpy() the internals of the FrameCanvas (which is in its expanded, bit-planed and pixel-positioned internal representation).

https://github.com/hzeller/rpi-rgb-led-matrix/blob/master/include/content-streamer.h

for an example, you can check out the utils/image-viewer.cc. It has a way to stream out the expanded data to disk (with the -O option) and it can read the same format as well. In that use-case, it is to pre-process a canned output to show otherwise too expensive video display.

Your case though sounds like you might be happy with double or triple-buffering, for which you want to create a few RGBMatrix::CreateFrameCanvas() which you then circle through these and use SwapOnVSync() to atomically swap them.

But there is no way around of using the SetPixel()... it does all the complicated things.

marcmerlin commented 4 months ago

Thanks @hzeller that helps. That was wishful thinking on my part then that the framebuffers would be compatible :) so what I observed on rPi3 is that sometimes it slows down enough (overheat or CPU or mem cache running out due to the generation code being run) that I can see the actual frame being copied from my code and updating the whole frame which takes maybe 1 second to fully refresh the frame. I can see it's linear in the order my driver copies everything pixel by pixel, and while this is happening the backend driver correctly refreshes the page at full speed on the display, but the page itself is taking a second to be updated (I can see the new frame progressing on top of the new one while the entire FB is being refreshed at full speed to the panels) And then it goes away and continues working fast enough that I don't see the frame copy . I moved all the same code, to rPi4 and of course the problem is gone there likely because the CPU/mem cache is better, the compiler much newer and better optimized code. At the same time, the issue where your driver starves the CPUs enough that linux can't run other tasks, including ssh, also goes away. But then I have to change --led-slowdown-gpio=1 all the way up to --led-slowdown-gpio=4 or the display is completely messed up past the 4th 64x32 panel so the refresh is actually slightly slower but close enough.

Either way, I was hoping to make the display code zero copy on rPi3 but from what you just said, it's not possible, and I'll still end up with a loop of setpixel, so that's not going to improve anything with the issue I see. Swapping buffers like you said will make the visible refresh not visible but the frame will hang for a full second instead, which is just hiding the likely cache stall or overheat in another way.

For now, I think rPi4 is just the solution to all this, wanted to make sure I didn't have some stupid inefficient programming that was easy to fix, and it sounds like I do not. Thanks for your answer.

marcmerlin commented 4 months ago

As a side note, I also noticed that running V-mapper: C:9 P:3. Turning W:192 H:288 Physical into W:576 H:96 Virtual and adding Rotate:90 is so slow on rPi3 that I had to give up on it (I only get a few frames per second). On rPi4 rotate is fast enough that I don't notice it running or not.

hzeller commented 4 months ago

Make sure to use the SwapOnVSync(), and update a 'dark' frame. Don't call SetPixel() directly on the RGBMatrix, otherwise you see the 'life' update issues (tearing etc). Also I can imagine that it will mess with cache-updates so makes it a bit slower if the same memory is shared by two cores. (that was the very first interface that is still around for backward compatibility).

SwapOnVSync() will allow you to swap two buffers atomically.

The other observation is interesting: any arbitrary amount of mapping should not actually affect the runtime as it only is calculated once into a look-up table, and then this look-up table is directly used at SetPixel() time. I can only imagine that after the mapping adjacent pixels are so far apart that there might be cache-line thrashing going on ?

marcmerlin commented 4 months ago

So the cache comment is obviously just a guess from simple observation. When I have time I also need to update that old rPi3 raspbian with an up to date dietpi that will have a much better compiler, and that may help. I get your point on tearing but I'm not actually getting tearing, it's not high FPS video that is being updated while the screen is being pushed, it's a single demo computered frame that is literally taking about one second, or a bit more to be copied to the framebuffer being displayed at full speed (setpixel loop). Swapping buffers will indeed remove the visible update, but cause a 1sec or so hang during which the code would wait for the next frame to be ready and then I'd have no clue if it's the copy that is slow or if the generating code somehow didn't generate a new frame because it stalled. but you are right that having one core write to the buffer while the other one reads from it, could cause contention and slow down things, mmmmh. As for cache trashing, I have 20K pixels and indeed I've only noticed those issues with higher pixel displays, so that's another reason why I suspect a cache being overrun and the same reason why rPi4 just sidesteps this.

I think my first step will be to take your long time ago advice of re-installing everything with dietpi. Try the latest compilers first (and the same distro for pi3 and pi4 as right now I have different versions between the 2). Then your buffer flip point with double access to the same area is a very good point, I'll probably rule that out too, even if it will cause the full frame hangs for a sec or so if it doesn't fix the issue.

thanks much for the thoughs.

marcmerlin commented 4 months ago

And I'll contribute a picture as thanks 😊

hzeller commented 4 months ago

Nice!

I found that anything that is not in the CPU cache is abysmally slow in the Rasbperry Pi, so I suspect if the active framebuffer is so large as to thrashing the cache, it might be a problem (and, uhm, in that case using double-buffering could actually slow things down as now the CPU deals with two active frame buffers). Worth experimenting with. I suspect the newer Pi's have larger caches, this is why the behave better.

The slow non-cached RAM is also the reason why I abandoned early experiments using DMA to send the framebuffer to the panel. It would've been running independently from the CPU (yay), but it would also directly read from slow DRAM not cache (booh). Here were my throughput experiments: https://github.com/hzeller/rpi-gpio-dma-demo

marcmerlin commented 4 months ago

Yes, we agree on the cache issue. Honestly the other time I saw this was on my even bigger (in pixels) 384x256 array, and I was seeing occasional multi second refreshes of the screen (like the display here but worse since that one was almost 100K pixels, around the biggest you can drive with your driver with only 3 channels). There too, switching to rPi4 made the problem go away. And all my smaller displays, including my LED outfit which is 128x192 is fine on rPI3, while going up to 288x192 (the display in this post), introduces the problem while changing nothing else.

Good info and experiment on DMA. Thank you for posting this, which I realize you did many years ago, now :) On the plus side, we now have rPi5, which I'm sure will/would take more work to port too, but it may allow extra things, even if it seems at this point rPi4 is more than fast enough since effectively it's too fast and I have to significantly slow down its output so that it doesn't overwhelm my panels. Oooh, and now I understand why I'm getting this, the 4 panels on the left are newer panels that can work at higher speed (slowdown GPIO 2) and I get 140Hz for demo 0 plus corrupted output on the 5 sets of panel on the right (the older panels) Interestingly --led-slowdown-gpio=3 increases the output to 233Hz (why faster with more slowdown?) and the output is even more corrupted looking for the panels on the right (chains are 3 chains of 9 panels left to right, 3 sets of 4 new faster panel on the left)

while --led-slowdown-gpio=4 goes back down to 170Hz (faster than --led-slowdown-gpio=2, why?) and the output looks pristine. I'm honestly confused as to what's going on, feels like black magic to me :) but the point remains that rPi4 seems more than fast enough that rPi5 is probably not needed.

Now we just need one of those new 6 output cards or more to drive even more chains in parallel, and we won't need those FPGA cards afterall :) (I did look into FPGAs and I'll admit that working with them felt so alien and hard to debug, that honestly I'm probably going to stick to regular CPUs, I hope it's not quitter talk :) ).

marcmerlin commented 4 months ago

Just to see if I could understand slowdown gpio a bit better, I retested the above on rPi3 and got: 1: 156Hz, 2: 110Hz, 3: 84 Hz, 4: 68 Hz. On rPi3 it behaves like expected and my panels only need slowdown 1 on that hardware. It's just really weird that slowdown 3 is much faster than 2 on rPi4, and even 4 is faster than 2 (170Hz vs 140Hz) and somehow I get non corrupted output with 4/170Hz and slightly corrupted with 2/140Hz.
Black magic, I tell you :)

hzeller commented 4 months ago

Interesting, I have not a good idea what is going on underneath. I can only imagine that the GPIO subsystem has a different clock domain on the ASIC and thus runs into clock-domain crossing issues in which something too fast might require some fifo wait for the next cycle. I have heard from people who change the clocking of the Pi to overclock to actually get worse performance when interacting with GPIO, so this is why I think it might be a plausible explanation.

marcmerlin commented 4 months ago

@hzeller this has become a fascinating hardware architecture discussion and I very much like your explanation: it holds water and would match what I've been seeing, so I will go with that ;) Thanks for sharing your thoughts about this, quite interesting indeed.