jgarff / rpi_ws281x

Userspace Raspberry Pi PWM library for WS281X LEDs
BSD 2-Clause "Simplified" License
1.78k stars 622 forks source link

Optimization of ws2811_render function using precomputed tables #468

Closed mcaralp closed 1 year ago

mcaralp commented 3 years ago

Hi,

I wrote an optimization of the ws2811_render() function, it uses precomputed tables to convert the strip colors to ws2811 symbols. It is much faster than the actual implementation: on my RPI 3A+ with 500 LEDs at 50fps, the CPU consumption goes from 30% to 1% approximately.

It works well with the non-inverted PWM driver, and I will also test the non-inverted SPI and PCM drivers in a few days. But I can't test it with inverted data, can someone do it?

mcaralp commented 3 years ago

I also tested the PCM and SPI drivers, it works well. On my logic analyser, the inverted drivers seem fine too.

Gadgetoid commented 3 years ago

Can confirm - at least - that this still works in rpi-ws281x-python driving Unicorn HAT and saves roughly 5% CPU (Pi 400) taking rainbow.py from 25% to 20% utilization. No doubt makes a huge difference on slower Pi models.

I don't have any way to test inverted signals since I don't have an inverting level shifter, but your logic looks sound.

mcaralp commented 3 years ago

This is a big performance difference with my tests (going from 30% to 1% cpu usage). I use the -O3 option with my builds, can you tell me if rpi-ws281x-python uses this option or not? It can make a huge difference.

Gadgetoid commented 3 years ago

Bear in mind a huge chunk of that 20% (or 25%) is Python crunching all the numbers needed to render a rainbow across the 8x8 display. That said it might be worth me checking a less intensive example and also checking the optimization.

mcaralp commented 3 years ago

Ah yes I see. To be sure I checked the rpi_ws281x CMakeLists.txt and i can't find any optimization option. Is there a reason to not add one?

Gadgetoid commented 3 years ago

No reason that I'm aware of!

Excusing my naive understanding of C for a minute, is there any reason the converted values couldn't be stored as uint32_t like so:

    static uint32_t convert_table[256] = {
        0x924924, 0x924926, 0x924934, 0x924936, 0x9249a4, 0x9249a6, 0x9249b4, 0x9249b6,
        0x924d24, 0x924d26, 0x924d34, 0x924d36, 0x924da4, 0x924da6, 0x924db4, 0x924db6,
        0x926924, 0x926926, 0x926934, 0x926936, 0x9269a4, 0x9269a6, 0x9269b4, 0x9269b6,
        0x926d24, 0x926d26, 0x926d34, 0x926d36, 0x926da4, 0x926da6, 0x926db4, 0x926db6,
        0x934924, 0x934926, 0x934934, 0x934936, 0x9349a4, 0x9349a6, 0x9349b4, 0x9349b6,
        0x934d24, 0x934d26, 0x934d34, 0x934d36, 0x934da4, 0x934da6, 0x934db4, 0x934db6,
        0x936924, 0x936926, 0x936934, 0x936936, 0x9369a4, 0x9369a6, 0x9369b4, 0x9369b6,
        0x936d24, 0x936d26, 0x936d34, 0x936d36, 0x936da4, 0x936da6, 0x936db4, 0x936db6,
        0x9a4924, 0x9a4926, 0x9a4934, 0x9a4936, 0x9a49a4, 0x9a49a6, 0x9a49b4, 0x9a49b6,
        0x9a4d24, 0x9a4d26, 0x9a4d34, 0x9a4d36, 0x9a4da4, 0x9a4da6, 0x9a4db4, 0x9a4db6,
        0x9a6924, 0x9a6926, 0x9a6934, 0x9a6936, 0x9a69a4, 0x9a69a6, 0x9a69b4, 0x9a69b6,
        0x9a6d24, 0x9a6d26, 0x9a6d34, 0x9a6d36, 0x9a6da4, 0x9a6da6, 0x9a6db4, 0x9a6db6,
        0x9b4924, 0x9b4926, 0x9b4934, 0x9b4936, 0x9b49a4, 0x9b49a6, 0x9b49b4, 0x9b49b6,
        0x9b4d24, 0x9b4d26, 0x9b4d34, 0x9b4d36, 0x9b4da4, 0x9b4da6, 0x9b4db4, 0x9b4db6,
        0x9b6924, 0x9b6926, 0x9b6934, 0x9b6936, 0x9b69a4, 0x9b69a6, 0x9b69b4, 0x9b69b6,
        0x9b6d24, 0x9b6d26, 0x9b6d34, 0x9b6d36, 0x9b6da4, 0x9b6da6, 0x9b6db4, 0x9b6db6,
        0xd24924, 0xd24926, 0xd24934, 0xd24936, 0xd249a4, 0xd249a6, 0xd249b4, 0xd249b6,
        0xd24d24, 0xd24d26, 0xd24d34, 0xd24d36, 0xd24da4, 0xd24da6, 0xd24db4, 0xd24db6,
        0xd26924, 0xd26926, 0xd26934, 0xd26936, 0xd269a4, 0xd269a6, 0xd269b4, 0xd269b6,
        0xd26d24, 0xd26d26, 0xd26d34, 0xd26d36, 0xd26da4, 0xd26da6, 0xd26db4, 0xd26db6,
        0xd34924, 0xd34926, 0xd34934, 0xd34936, 0xd349a4, 0xd349a6, 0xd349b4, 0xd349b6,
        0xd34d24, 0xd34d26, 0xd34d34, 0xd34d36, 0xd34da4, 0xd34da6, 0xd34db4, 0xd34db6,
        0xd36924, 0xd36926, 0xd36934, 0xd36936, 0xd369a4, 0xd369a6, 0xd369b4, 0xd369b6,
        0xd36d24, 0xd36d26, 0xd36d34, 0xd36d36, 0xd36da4, 0xd36da6, 0xd36db4, 0xd36db6,
        0xda4924, 0xda4926, 0xda4934, 0xda4936, 0xda49a4, 0xda49a6, 0xda49b4, 0xda49b6,
        0xda4d24, 0xda4d26, 0xda4d34, 0xda4d36, 0xda4da4, 0xda4da6, 0xda4db4, 0xda4db6,
        0xda6924, 0xda6926, 0xda6934, 0xda6936, 0xda69a4, 0xda69a6, 0xda69b4, 0xda69b6,
        0xda6d24, 0xda6d26, 0xda6d34, 0xda6d36, 0xda6da4, 0xda6da6, 0xda6db4, 0xda6db6,
        0xdb4924, 0xdb4926, 0xdb4934, 0xdb4936, 0xdb49a4, 0xdb49a6, 0xdb49b4, 0xdb49b6,
        0xdb4d24, 0xdb4d26, 0xdb4d34, 0xdb4d36, 0xdb4da4, 0xdb4da6, 0xdb4db4, 0xdb4db6,
        0xdb6924, 0xdb6926, 0xdb6934, 0xdb6936, 0xdb69a4, 0xdb69a6, 0xdb69b4, 0xdb69b6,
        0xdb6d24, 0xdb6d26, 0xdb6d34, 0xdb6d36, 0xdb6da4, 0xdb6da6, 0xdb6db4, 0xdb6db6
    };

And avoid the inner loop in lieu of something like this:

            for (j = 0; j < array_size; j++)               // Color
            {
                                        uint32_t val = convert_table[color[j]];
                                        val = driver_mode == SPI ? val : __bswap_32(val) >> 8;

                                        if ((driver_mode != PWM) && channel->invert) val = ~val;

                                        pxl_raw[wordpos * 4 + 0] = (val >> 16) & 0xff;
                                        pxl_raw[wordpos * 4 + 1] = (val >> 8) & 0xff;
                                        pxl_raw[wordpos * 4 + 2] = val & 0xff;
                                        wordpos += driver_mode == PWM ? 2 : 1; 
            }

Looking at the unpacking to bytes and shifts I'm doing I suspect the answer might be that this is slower than a well-optimized, unrolled loop using already unpacked bytes. It feels more understandable, though, albeit since rpi_ws281x is such a monster of a codebase I'm not sure we should necessarily favour approachability over speed.

mcaralp commented 3 years ago

I'm not sure you can drop the bytepos variable, as you have to fill each byte of pxl_raw. Maybe something like this can work:

for (j = 0; j < array_size; j++)               // Color
{
        uint32_t val = convert_table[color[j]];
        val = driver_mode == SPI ? val : __bswap_32(val) >> 8;

        if ((driver_mode != PWM) && channel->invert) val = ~val;

        pxl_raw[wordpos * 4 + bytepos + 0] = (val >> 16) & 0xff;
        pxl_raw[wordpos * 4 + bytepos + 1] = (val >> 8) & 0xff;
        pxl_raw[wordpos * 4 + bytepos + 2] = val & 0xff;

       if(++bytepos == 4) 
       { 
              bytepos = 0; 
              wordpos += driver_mode == PWM ? 2 : 1; 
       }
}

Regarding the performance, with the O3 option I don't think there is a big difference between the two solutions. I originally wrote the first convert table for embedded systems where memory really matters, and the 256 extra bytes of the uint32_t solution can be a real problem.

But I think this solution is nice too, and I agree, a bit more readable.

Gadgetoid commented 3 years ago

Oh you're right about bytepos, I totally misunderstood what it was doing. In this case l is going 0,1,2 and bytepos is going out of phase from 0,1,2,3 to place bytes correctly into the destination.

I was going to say that since pxl_raw is already a pointer into ws2811->device->pxl_raw might as well:

for (j = 0; j < array_size; j++)               // Color
{
    uint32_t val = convert_table[color[j]];
    val = driver_mode == SPI ? val : __bswap_32(val) >> 8;

    if ((driver_mode != PWM) && channel->invert) val = ~val;

    *pxl_raw = (val >> 16) & 0xff;
    *(pxl_raw + 1) = (val >> 8) & 0xff;
    *(pxl_raw + 2) = val & 0xff;
    pxl_raw += driver_mode == PWM ? 8 : 4;
}

But this fails to factor in bytepos and I lack the understanding of C pointers required to make the first two bytes a single uint16_t assignment.

How the heck this appears to be working for me is anyone's guess!

All that said- I think what you've done here is great and works fine as is!

mcaralp commented 3 years ago

I'm not sure the pointer solution will be better than the other solution. The PWM driver makes things difficult, it would be easier if wordpos could be incremented by 1 with every driver.

All that said- I think what you've done here is great and works fine as is!

That's good to hear :)

Gadgetoid commented 2 years ago

Would love for @jgarff to cast an eye over this before I pull the trigger. And perhaps any other major stakeholders in rpi_ws281x... @timonsku perhaps?

Gadgetoid commented 2 years ago

This seems to have been left by the wayside- without so much as a drive-by comment! - which is an awful shame since the potential performance gains are enormous.

Tempted to just merge this for the masses to find any issues we might have overlooked.

@mcaralp do you have a moment to implement the suggested additional comment? I can probably do it otherwise.

Xartrick commented 2 years ago

I merged this PR into current master for my everyday use of the library without any issue so far. I'm using non-inverted PWM driver effectively running at 1% CPU.

timonsku commented 2 years ago

Ah, missed this. Don't think I could have added anything. If its been tested I don't see a reason not to release this, sounds great!

jgarff commented 2 years ago

Oh hey sorry Phillip. I totally missed that you wanted me to review. I'll take a look later tonight. Sounds like a great enhancement.

On Mon, Jan 24, 2022 at 9:29 AM Timon @.***> wrote:

Ah, missed this. Don't think I could have added anything. If its been tested I don't see a reason not to release this, sounds great!

— Reply to this email directly, view it on GitHub https://github.com/jgarff/rpi_ws281x/pull/468#issuecomment-1020287695, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACB55G7BTIE635LMZKPMFPDUXV46VANCNFSM5AEN4UEQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

jgarff commented 2 years ago

Looks good, thanks for the enhancement! The question I have is regarding testing as follows:

I'm just a bit concerned about taking this patch and unintended breaks downstream. Thanks.

mcaralp commented 2 years ago

Sorry for the late reply!

@mcaralp do you have a moment to implement the suggested additional comment? I can probably do it otherwise.

Do you talk about the conversion from uint8_t[3] to uint32_t of the convert_table array? I can write it but I can't test it for now, as I don't have ws2812b strips at home. Maybe next week!

Gadgetoid commented 2 years ago

@mcaralp no worries. No code changes, just a comment next to the LUT to explain all the magic numbers. Something like:

// Pre-computed table of bytes (from 0 - 255) represented as WS281x timing signals.
// These are split into three bytes arranged MSB to LSB.
// 3 bits in a lookup table entry correspond to 1 real bit: 0b100 for a 0, 0b110 for a 1
// So:
// 0 is 100100100100100100100100
// 1 is 100100100100100100100110
// 2 is 100100100100100100110110
// and so on...

If someone wants to finagle this to use uint32_t in future I think this adequately documents what's going on.

Aside: This is why I love PIO on the RP2040- it does the bitstream conversion basically for "free" on the outgoing, untouched databytes. Magic.

I'll see if I can eke out the time to sign off on @jgarff's concerns. However at a glance I'm pretty confident this doesn't break anything low level, since it's just the "high" level conversion of the bitstream.

d8ahazard commented 2 years ago

Looks good, thanks for the enhancement! The question I have is regarding testing as follows:

  • Has this been tested various outputs (PWM, SPI, etc.)?
  • Has this been test with inverted output?

I'm just a bit concerned about taking this patch and unintended breaks downstream. Thanks.

FWIW, I've been using this patch with my code for my project Glimmr. Tested using PWM (dual channels) and SPI with no issues.

d8ahazard commented 2 years ago

@Gadgetoid -

Any chance this one could get merged too? I've verified with SPI and PWM, can't test with inverted.

With the "new" changes, I'd prefer to not have to pull the main, then cherry-pick this on top of it. ;)

Gadgetoid commented 2 years ago

Thanks for weighing in @d8ahazard.

I have still not had an opportunity to grab a scope and check inverted output. As near as I can tell my assertions about it affecting only the bitstream conversion and not the inversion still stand- if it works in normal mode, then inverted must logically also work. (Famous last words?)

It would be useful if someone who's actually using an inverted setup could weigh in, since I'm not particularly well equipped to set one up.

But otherwise- and accepting I can nip in and edit my above clarification comment into the code after merging- I think it might be a good idea to move forward and merge this. @jgarff what say you?

As a thought- I suspect anyone relying on the speed of updates to pace their animations is going to be in for a surprise...

jgarff commented 2 years ago

Thanks Philip. I don't have a way to test the inversion case either. At this point though, I'd say let's merge it, given the risk seems pretty low.

On Mon, Jun 27, 2022 at 2:41 PM Philip Howard @.***> wrote:

Thanks for weighing in @d8ahazard https://github.com/d8ahazard.

I have still not had an opportunity to grab a scope and check inverted output. As near as I can tell my assertions about it affecting only the bitstream conversion and not the inversion still stand- if it works in normal mode, then inverted must logically also work. (Famous last words?)

It would be useful if someone who's actually using an inverted setup could weigh in, since I'm not particularly well equipped to set one up.

But otherwise- and accepting I can nip in and edit my above clarification comment into the code after merging- I think it might be a good idea to move forward and merge this. @jgarff https://github.com/jgarff what say you?

As a thought- I suspect anyone relying on the speed of updates to pace their animations is going to be in for a surprise...

— Reply to this email directly, view it on GitHub https://github.com/jgarff/rpi_ws281x/pull/468#issuecomment-1167866820, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACB55G2NLJTZNISDE4ZWEYTVRIGWTANCNFSM5AEN4UEQ . You are receiving this because you were mentioned.Message ID: @.***>