careyk007 commented 5 years ago

This increases the performance of the I2C interface and possibly the SPI interface as well, but I don't have a SPI-based SSD1306 for verification. The main idea comes from the fact that the SSD1306 has support for writing to a subset of the display memory buffer as if it were contiguous. This means that if you only have a small window within the display to update, you don't have to update the entire display.

This is achieved by keeping track of the locations of the pixels that have changed since the last call to flush or reset. The only values we need to track are the max and min values for the x and y coordinates of each pixel. Once a new call to flush happens (fast_flush in this pull request), the elements of the display buffer on the micro-controller that are within the minimum bounding rectangle of changed pixels are copied to a secondary buffer, the SSD1306 is commanded to work with the window subset of the minimum bounding rectangle, and the secondary buffer is sent to the SSD1306.

I have been developing on the STM32F411-DISCOVERY board since my Blue Pills are still on their way across the ocean so I can't speak to the results with the Blue Pill, but I have seen some truly impressive results with the Discovery board. If you are only updating half of the display size, the fast_flush results in a 28.5% faster update. If you are only updating a quarter of the display, that jumps up to a 61.9% faster update.

The downside of this approach is that the micro-controller does have to do some additional computation while copying the changed pixels into a secondary buffer. This means that as the size of the minimum bounding rectangle approaches the display size, the fast_flush approach actually becomes slower than just updating the entire display. When the update area is greater than 75% of the display area, fast_flush calls flush to do a full display update.

Perhaps the biggest performance jump may come from changing the draw function to take an iterator as a parameter instead of a slice, but I'm not sure how that would work and if that would introduce any breaking changes.

Improvements

Faster screen updates when using an I2C DisplayInterface
- The smaller the area of the screen you are writing to, the faster the update

Problems

Need to double the buffer size
- This would only be for 128x68 displays since smaller displays could reuse space in the buffer
- This may also be negated by using iterators instead of slices
Faster screen update times might not be a good thing
- Since the current flush call updates the entire screen, the function always takes the same amount of time to return
- Maybe some users are using this consistent return time as a key part of their program?

therealprof commented 5 years ago

Very nice!

A few observations from my side:

I don't think it is actually necessary do keep a second buffer. It should be sufficient if the windowing is done while sending out the buffer.
You're doing a lot of divisions in your code, those should be avoided if possible; the compiler is not smart enough to avoid them in some cases and will emit emulated divisions on some MCUs which are abysmally slow and bloaty; ideally the thresholding and separate function should be completely irrelevant and provide close to the same speed.
It would be really cool if this could be simplified so it only comes with negligible overhead so it could always be used as the standard flush function

Re: Faster screen update times might not be a good thing

I don't think any sane person would ever rely on the timing of an I2C bus for their application; there are so many things which can go sideways here that it's completely unusable.

careyk007 commented 5 years ago

Thanks for the feedback!

You're doing a lot of divisions in your code, those should be avoided if possible

Great catch. It looks like most of the divisions are by powers of 2, so I am working on replacing those with bit shift operations.

It would be really cool if this could be simplified so it only comes with negligible overhead so it could always be used as the standard flush function

I definitely agree. My first thought was to use Rust's powerful iterators, but I haven't been having much success with this approach. My other thought was to modify DisplayProperties::draw() and DisplayInterface::send_data() to take the bounds of the update window and apply the filter while sending the data. Both approaches would probably eliminate the need for a secondary buffer, which would be awesome.

therealprof commented 5 years ago

Great catch. It looks like most of the divisions are by powers of 2, so I am working on replacing those with bit shift operations.

Divisions of unsigned integers by powers of 2 are unproblematic. The same for signed integers can sometimes be problematic. It's more the odd divisors I'm worried about...

I also tried iterators a while ago without much success. It's probably best to add the windowing directly to lowlevel send functions.

careyk007 commented 5 years ago

I have moved the windowing into I2cInterface and SpiInterface. I decided to add a new send_bounded_data() method instead of re-writing send_data(), which also required adding a bounded_draw() method to DisplayProperties.

This approach has removed most of the divisions, except for 2 initial divide-by-8 calculations. As far as performance boost goes, updating a quarter of the display is now 68.5% faster than the original flush (increased from 61.9% previously), updating half of the display is now 37.5% faster (increased from 28.5% previously), and updating the full display is actually 0.5% faster (as opposed to a 10% hit previously).

therealprof commented 5 years ago

Great job, sounds fantastic. @jamwaffles do you have time to give it a try? I'm a bit tied up at the moment.

jamwaffles commented 5 years ago

Thanks for this PR! The perf numbers are looking good. I'll test and review it more thoroughly when I have some time to get my Blue Pill out.

This approach has removed most of the divisions, except for 2 initial divide-by-8 calculations.

To satisfy my own curiosity, Rust is smart enough to convert all these to lsrs (shift right) instructions for thumbv7. See this example on Godbolt. I think it's also converted in debug mode, but has some additional bounds checks thrown in.

I don't think any sane person would ever rely on the timing of an I2C bus for their application

And if they do, it's not our problem :joy:. Regular screen updates should be handled by interrupts or something IMO, so I2C timing isn't much to worry about from the driver's side.

therealprof commented 5 years ago

@jamwaffles

If you divide by a non-power of two on thumbv6m you'll get this:

example::divide:
        push    {r7, lr}
        subs    r0, r0, r1
        movs    r1, #7
        bl      __aeabi_uidiv
        pop     {r7, pc}

on thumbv7m it's a good deal better:

example::divide:
        subs    r0, r0, r1
        movw    r1, #18725
        movt    r1, #9362
        umull   r1, r2, r0, r1
        subs    r0, r0, r2
        add.w   r0, r2, r0, lsr #1
        lsrs    r0, r0, #2
        bx      lr

As I said: Divides by power of 2 should be okay for unsigned variables on any platform, anything else depends...

jamwaffles commented 5 years ago

Tested on a physical I2C device and it looks to be behaving as expected. The photo below is two demos run back to back with the clear/in it calls removed in the second. You can see the partial update working so yay!

IMG_20191004_131717

jamwaffles commented 5 years ago

I'll test on an SPI device this evening.

It would be good to:

Use fast_flush() on both SPI and I2C devices and
Integrate the fast_flush() logic into flush() as @therealprof mentioned - it keeps the API cleaner with only one method then.

careyk007 commented 5 years ago

Thanks a bunch for the feedback. I have made the style changes and run cargo fmt. I have kept fast_flush() and flush() separate for now for easier testing, but will integrate the fast_flush() logic into flush() once SPI performance has been benchmarked.

I have been testing in release mode, but ran my tests again in debug mode and got similar performance boosts. For those interested, the performance boost of simply changing from debug to release is about 70%.

jamwaffles commented 5 years ago

I'm seeing some odd behaviour with SPI displays:

I'll do a bit of digging with this SPI display and see where I get.

The image_i2c example runs much faster to the eye when using fast_flush(), however it'd be great to get some numbers - how are you benchmarking this PR?

jamwaffles commented 5 years ago

It also just occurred to me that we should make sure fast_flush() works in rotations other than Rotate0. Has this been tested yet?

jamwaffles commented 5 years ago

The examples use the following pattern to initialise and clear the display:

disp.reset(&mut rst, &mut delay).unwrap();
disp.init().unwrap();
disp.fast_flush().unwrap();

Because no pixels have been set in the buffer at this point, fast_flush() is a noop. To maintain existing behaviour, I think fast_flush() should clear the display if no pixels have been set yet. This could be changed in the future to introduce a clear() method, but for now I'd like to keep the existing behaviour.

Also, here's another example of the odd SPI behaviour. I ported the image_i2c example to use the SPI interface:

careyk007 commented 5 years ago

I found an error in my SPI implementation where I had a ..= instead of a .., which would explain the offset from your last image. I'm not entirely convinced that it would explain the artifacts from your first image though.

I've put together a gist with the logic I've been using for benchmarking. For some reason, I haven't been able to get my machine to compile for the Blue Pill, so there's no guarantee that the gist will compile, but the program logic should be sound. If you'd like, I can share the code I'm using to test on my STM32F4 Discovery board.

I hadn't thought about support for rotated displays, but I've added support in my latest commit. Please let me know if there is a more Rusty way of writing that logic.

As far as fast_flush defaulting to a clear when no pixels have changed, I feel like this might result in slower performance if the user calls a double fast_flush by accident. Would an alternative be to have the init method call clear so that the first call to fast_flush does indeed clear the entire screen, but subsequent calls will operate as fast as possible?

jamwaffles commented 4 years ago

I found an error in my SPI implementation where I had a ..= instead of a .., which would explain the offset from your last image. I'm not entirely convinced that it would explain the artifacts from your first image though.

Yep, that's fixed it! The artifacts are from the power-on state of this display - it seems to just leave some pixels randomly turned on. Using a whole-screen flush as in master clears the whole display which is why this issue hasn't cropped up until this PR.

Would an alternative be to have the init method call clear

Yeah, I think this is a good idea - the artifacts mentioned above would be cleared as before then. I also noticed that the display isn't cleared on restarting the chip, so clear on init is a definite requirement IMO. For example, this is me testing rotations 3 times:

I've put together a gist with the logic I've been using for benchmarking.

Thanks for putting that up in a gist. It looks like you're checking timings with an oscilloscope or logic analyser hooked up to a pin? Unfortunately I don't have either of those bits of kit so I'll trust your I2C numbers :grin:

I hadn't thought about support for rotated displays, but I've added support in my latest commit. Please let me know if there is a more Rusty way of writing that logic.

I'll do another review pass, but rotation results aren't quite right for SPI. I2C works fine strangely enough. This is the image_i2c example running over SPI with a rotation of Rotate270 (ignore the artifacts - I had just powered the display on):

careyk007 commented 4 years ago

I also noticed that the display isn't cleared on restarting the chip, so clear on init is a definite requirement IMO.

I've added a call to clear within the init function, this should get rid of any residual display artifacts at start.

Thanks for putting that up in a gist. It looks like you're checking timings with an oscilloscope or logic analyser hooked up to a pin? Unfortunately I don't have either of those bits of kit so I'll trust your I2C numbers grin

No problem! Unfortunately I don't have an oscilloscope or logic analyzer either, so I'm relying on the "super precise" method of hooking up an ESP32 to the GPIO pin and measuring the time between edges. I probably should have mentioned this in my initial comment.

I'll do another review pass, but rotation results aren't quite right for SPI. I2C works fine strangely enough.

That is quite strange. I'll try to get my hands on a SPI display so that I can do local testing and figure out what's going on.

careyk007 commented 4 years ago

Out of curiosity, what opt-level are you testing at? I ran into an issue with another crate that had bad SPI writes if the opt-level wasn't set to 1, 2, or 'z'. I don't think this is the issue since you are actually getting data transferred, but I figured it wouldn't hurt to ask.

jamwaffles commented 4 years ago

I've added a call to clear within the init function, this should get rid of any residual display artifacts at start.

Mint. I'll do another test run later :)

I've just been running it with cargo run --example <example> on a Blue Pill over OpenOCD. I'll try running with --release and make sure it still works.

jamwaffles commented 4 years ago

Pahaha please ignore my dumb ass, the rotation is working fine - the examples set an offset to center the Rust logo in the center of the screen horizontally. This of course changes when rotated by 90/270 degrees so instead of translating by Point::new(0, 32), the fix is to simply translate by Point::new(32, 0) in the demo code - sorry about that!

I've added a call to clear within the init function, this should get rid of any residual display artifacts at start.

I think the examples and doc examples can have the initial call to display.flush().unwrap(), etc removed in this case. Aside from this last cleanup step I think we're good to merge unless @therealprof has some extra comments.

careyk007 commented 4 years ago

Awesome, glad you got it sorted!

Just to be clear, should I create a new commit with the initial calls to display.flush().unwrap(), etc. in the examples and docs removed? And should I also replace flush with fast_flush at this point?

jamwaffles commented 4 years ago

should I create a new commit with the initial calls to display.flush().unwrap(), etc. in the examples and docs removed?

Yes please

And should I also replace flush with fast_flush at this point?

Ah yes please. I forgot about that

careyk007 commented 4 years ago

Done and done!

jamwaffles commented 4 years ago

Sweet! I'll do a final review/physical device test as soon as I can.

jamwaffles commented 4 years ago

Released in 0.3.0-alpha.2. Thanks!

rust-embedded-community / ssd1306

Improve I2C Flush (Update) Speed #94

Improvements

Problems