rogerclarkmelbourne / Arduino_STM32

Arduino STM32. Hardware files to support STM32 boards, on Arduino IDE 1.8.x including LeafLabs Maple and other generic STM32F103 boards
Other
2.52k stars 1.26k forks source link

Adafruit_ILI9341_STM (stm32f1): Allow user to force the use of DMA. #828

Closed kiki-lamb closed 3 years ago

kiki-lamb commented 3 years ago

After switching from an older version of this library, I noticed some drawing code was running a lot more slowly. After some testing, I narrowed it down to the behaviour of DMA_ON_LIMIT. It seems like, in at least some cases, the current setting can result in a big slowdown: in my tests on an STM32F103C8T6 'blue pill' board, I got the following results:

This adds up: when you're trying to call the function 2000+ times per second (e.g., to draw out a waveform on screen in something approaching realtime), it makes a big difference.

This pull request makes the following changes:

stevstrong commented 3 years ago

This PR changes the intended scope of DMA_ON_LIMT which is to use DMA access above the specific number of bytes to write and use non-DMA access below the value.

Initially this was introduced because DMA is only effective if you write many data in one shot. If you only have few data to write, then DMA introduces an overhead in runtime, compared to simple write, which should be avoided. The non-DMA SPI functions are highly run-time optimized. Your change disables non-DMA access if the value of this define is 0. This is not desired.

If I understand your point, the #ifdefs should speed up the write process because of avoiding the "if" condition check?

In that case I think the best solution would be to make "DMA only" access if this value is, let's say >=250, make "non-DMA only" access for value=0 and do the check as it is now for any other value >0 and <250.

kiki-lamb commented 3 years ago

If you only have few data to write, then DMA introduces an overhead in runtime, compared to simple write, which should be avoided.

That certainly makes sense on the surface, but something about it doesn't seem to be bearing out in practice, at least for me: I'm definitely getting much better performance with the test and non-DMA case disabled.

I'm going to write up a simplified test-case that demonstrates the issue, I'll attach it to the PR when it's done.

stevstrong commented 3 years ago

That would be nice, I could then test it as well.

kiki-lamb commented 3 years ago

While concocting a minimal example to demonstrate the problem I discovered that the problem is actually a bit more intricate than I'd thought - on the upside, the solution might be simpler.

The short version is this: if there are high-frequency timer interrupts enabled, the ISR can interrupt non-DMA SPI transfers mid-transfer, slowing them down. As a result, in these scenarios it will be faster to always use DMA even when the data size is small.

My new solution is as follows: Add an ILI9341_STM_FORCEDMA preprocessor conditional, which, if ILI9341_STM_FORCEDMA is defined, does what it says on the tin and forces the use of DMA in all cases.

I've uploaded a sketch demonstrates the issue here: https://github.com/kiki-lamb/ili9341-stm-test

The attached screenshot shows the board menu options that were in use while testing. Version 1.5.3 of the Adafruit GFX Library was in use.

test-settings

Test results:

I examined the SPI port's output on a logic analyzer, and have placed the results for both cases alongside the .ino file in the sketch linked above. The files are .sr files, viewable in PulseView. Examining unpatched_output.sr, the gaps in the SPI clock whenever the interrupt fires are clearly visible. In patched_output.sr, the SPI clock (and other signals) are much steadier, as DMA can continue the transmission even during the execution of the ISR.

Looking at the behaviour of the chip select pin in the logic analyzer dumps is an easy way to see that the forced-DMA version is indeed completing many more write operations in the same period of time compared to the unpatched version.

Note: The Arduino IDE may not respect sticking #define ILI9341_STM_FORCEDMA at the start of a sketch, and may try to use libraries it's already compiled. To test from the Arduino IDE, it's best to add -DILI9341_STM_FORCEDMA to compiler.cpp.flags in platform.txt or platform.local.txt and restart the application.

In my opinion, these test results show that there are situations (like when high frequency timer interrupts are in use) where it can be beneficial to force the use of DMA even for smaller SPI transmissions.

Since the change only affects behavior when ILI9341_STM_FORCEDMA is defined, it will have no negative impact on anyone not interested in using this feature and shouldn't affect compatibility.

stevstrong commented 3 years ago

OK, I admit that under these circumstances the DMA version is outperforming the non-DMA version, the timer interrupt is simply disturbing too often the SPI data flow. But this effect is only considerably when the number of data to be written is large.

Have you tried the non-patched version and set DMA_ON_LIMIT to a low value (or 0)? How would the timing change in this case for your example?

kiki-lamb commented 3 years ago

But this effect is only considerably when the number of data to be written is large.

Sure - these situations do arise though, as I have shown here... in the real-world case I as working on when I discovered this issue, my application is using the timer interrupt to generate audio and send it to a DAC on SPI2 while drawing the waveform on screen on SPI1 (using repeated calls to drawFastVLine, just as in the test case). When switching from an older version of the core (before DMA_ON_LIMIT had been added), it became unable to draw fast enough to keep up and frame rates became visibly chunky.

So, having some way to force DMA would certainly help in these scenarios.

Have you tried the non-patched version and set DMA_ON_LIMIT to a low value (or 0)? How would the timing change in this case for your example?

I did another quick test and it looks like setting DMA_ON_LIMIT to 0 has the same effect: the draw times in the test become 121 μs each and the logic analyzer output looks much as patched_output.sr does.

So, it looks like either option would work: a binary toggle like my ILI9341_STM_FORCEDMA or allowing configuration of DMA_ON_LIMIT so it can be set to 0 would both allow the possibility of forcing DMA in situations where there's a benefit to doing so.

In either case, having one of these two options would be beneficial in scenarios like this one. Which option do you prefer?

If you prefer the second option, I might suggest changing the name as I had previously to something like ILI9341_STM_DMA_ON_LIMIT: if the user's going to be configuring this somewhere outside of the library (by defining it in platforms.local.txt or platformio.ini or a Makefile or whereever), it would be nice for the name to provide an indicator which library it's affecting.

I'm happy to proceed with whichever option you'd like, as either one allows me the option of my desired (forced DMA) behaviour.

stevstrong commented 3 years ago

I think the second option would be acceptable, including rename.

kiki-lamb commented 3 years ago

Done, let me know if it looks good.

stevstrong commented 3 years ago

it looks fine, I will merge it.