flash algorithm execution speed

nerdralph commented 3 years ago

I've been trying to optimize pyOCD flash write speed with my debug probe. After configuring trace debug logs (thanks Chris!), I noticed less than half the time is spent doing block writes. Most of the time seems to be spent generating the AP read and write commands in between block transfers. Here's an excerpt from the logs:

0002482:DEBUG:ap:_write_block32:001014 }
0002482:DEBUG:ap:read_mem:001018 (ap=0x0; addr=0xe000edf0, size=32) {
0002482:DEBUG:ap:write_ap:001019 cached (ap=0x0; addr=0x00000000) = 0x23000012
0002482:DEBUG:dap:write_ap:001020 (addr=0x00000004) = 0xe000edf0
0002482:DEBUG:dap_access_cmsis_dap:get_request_space(1, 05:w)[wc=7, rc=0, ba=1->0] -> (sz=1, free=5, delta=-247)
0002482:DEBUG:dap_access_cmsis_dap:add(1, 05:w) -> [wc=8, rc=0, ba=0]
0002482:DEBUG:dap_access_cmsis_dap:get_request_space(1, 0f:r)[wc=8, rc=0, ba=0->0] -> (sz=1, free=15, delta=-246)
0002482:DEBUG:dap_access_cmsis_dap:add(1, 0f:r) -> [wc=8, rc=1, ba=0]
0002483:DEBUG:dap:read_ap:001021 (addr=0x0000000c) -> ...
0002483:DEBUG:dap_access_cmsis_dap:New _Command
0002485:DEBUG:dap:read_ap:001021 ...(addr=0x0000000c) -> 0x01030003
0002485:DEBUG:ap:read_mem:001018 (ap=0x0; addr=0xe000edf0, size=32) -> 0x01030003 }
0002485:DEBUG:ap:read_mem:001022 (ap=0x0; addr=0xe000edf0, size=32) {
0002485:DEBUG:ap:write_ap:001023 cached (ap=0x0; addr=0x00000000) = 0x23000012
0002485:DEBUG:dap:write_ap:001024 (addr=0x00000004) = 0xe000edf0
0002485:DEBUG:dap_access_cmsis_dap:get_request_space(1, 05:w)[wc=0, rc=0, ba=1->1] -> (sz=1, free=14)
0002485:DEBUG:dap_access_cmsis_dap:add(1, 05:w) -> [wc=1, rc=0, ba=1]
0002485:DEBUG:dap_access_cmsis_dap:get_request_space(1, 0f:r)[wc=1, rc=0, ba=1->0] -> (sz=1, free=15, delta=-253)
0002485:DEBUG:dap_access_cmsis_dap:add(1, 0f:r) -> [wc=1, rc=1, ba=0]
0002485:DEBUG:dap:read_ap:001025 (addr=0x0000000c) -> ...
0002486:DEBUG:dap_access_cmsis_dap:New _Command
0002488:DEBUG:dap:read_ap:001025 ...(addr=0x0000000c) -> 0x00030003
0002488:DEBUG:ap:read_mem:001022 (ap=0x0; addr=0xe000edf0, size=32) -> 0x00030003 }
0002488:DEBUG:ap:write_mem:001026 (ap=0x0; addr=0xe000edf4, size=32) = 0x00000000 {
0002488:DEBUG:ap:write_ap:001027 cached (ap=0x0; addr=0x00000000) = 0x23000012
0002488:DEBUG:dap:write_ap:001028 (addr=0x00000004) = 0xe000edf4
0002488:DEBUG:dap_access_cmsis_dap:get_request_space(1, 05:w)[wc=0, rc=0, ba=1->1] -> (sz=1, free=14)

It takes until the 2504ms timestamp to get a full packet to write. In total from the end of one write_block32 until the start of the next write_block32 is 45ms. pyOCD seems to be doing some sort of caching of the parsed flash algorithm, as the time between the 1st and 2nd block writes is about twice as long (~100ms). Is there any easy way to speed this up? The host CPU is a 3.2GHz Intel Core 64-bit.

nerdralph commented 3 years ago

I did a comparison between pyOCD and openocd flashing a 20kB file to a stm32f103. pyOCD runtime was 3.9s vs 2.45s for openocd.

flit commented 3 years ago

Obviously a big part of that is simply that Python is slower than C. Are you using Python 3.9? It's noticeably faster than 2.7, or even earlier 3.x versions. Another option is pypy3.

There are probably multiple optimizations that could be implemented within pyocd, but I'm not aware of anything that's low hanging fruit. Profiling like you're doing is the best way to start.

nerdralph commented 3 years ago

My tests were with Python 3.7.9. I'll try 3.9.1 to see if it is any faster. Do you know if pyOCD works with pypy3? My experience with it was not great in the past.

I've noticed pyOCD only lightly loads 2 of 4 cores. If my python programming skills were better I'd consider modifying the AP code to use a separate thread for generating the sequence of commands required for the flash algorithm.

nerdralph commented 3 years ago

Python 3.9 doesn't run on Win7. I just tried upgrading to 3.8.7, and pyOCD is a bit faster now, taking 3.6s instead of 3.9s.

flit commented 3 years ago

I haven't tested a ton with pypy3, but I didn't run into any problems. Ymmv. (And I've had plenty of trouble with it in other cases.)

The problem with multithreading in CPython is the GIL: Global Interpreter Lock. That about says it. Performance is only improved for multithreaded CPython if your threads are mostly IO-bound, or call into C-based libraries that release the GIL while they run. The multiprocessing package makes it easy to use separate processes for performance instead of threads, but the overhead probably would make any gains disappear. …… There are benefits to using Python, but also some serious drawbacks.

One option we've talked about is to have sort of a meta-flash-algorithm, code that runs on the target to orchestrate calls to the algo. For instance, instead of having to call-and-wait for every sector to be erased, drop an array of sector addresses in RAM and call the orchestrator, which will then call the algo for each address. The question is, would it really improve performance when double-buffering is being used? Usually the flash operations are the real bottleneck.

It's been several years since I looked at flash programming performance in detail, and there may very well be areas that could be improved that I'm not thinking of.

flit commented 3 years ago

Well, that's something at least…

One thing that may improve pyocd performance is how it handles data buffers. Instead of using an efficient type like array.array or bytebuffer, it uses lists of ints. This unfortunate design choice was deeply ingrained from the beginning. I've had fixing it on my todo list for a long time (probably waiting til after dropping Python 2 would be a good idea). I wonder if it would even make a measurable difference, since for the most part the buffers that pyocd works with are quite small.

nerdralph commented 3 years ago

I also don't think the data buffer handling is a performance problem. If it were, then I think the write_block32 would be slow. It's too bad the flash algorithms weren't simpler (and therefore more efficient). The way I'd do it is have the code that is loaded into SRAM loop through a chunk of data to write to flash, and jump back to the start. I'd have the debug probe set a breakpoint at the branch instruction, and continue execution when the next chunk of data is transferred to SRAM.

As for simpler ways of getting pyOCD to run faster, I just tried nuitka with a small program that failed with pypy on Windows 7. The nuitka-compiled version works fine. I'm not sure how to get it to compile a whole package like pyOCD, but from skimming the docs, I think nuitka can do that.

Hoohaha commented 3 years ago

@flit pyOCD is really slow if flashing a large application. Set keep_unwritten=False it can improve the flashing performance and we used the way for NXP IMXRT series. I do not know why but it is useful.

flit commented 3 years ago

@Hoohaha That's a really good point. By default, pyocd will try to a) only erase and program sectors that change, b) not lose data in a partially-programmed sector. These behaviours are controlled by the smart_flash and keep_unwritten options, respectively. smart_flash requires reading existing flash contents. Whether it really improves performance, or slows it down, depends a lot on how you're using pyocd, i.e. writing entirely new code versus frequent updates of mostly the same code in a development cycle. I'm not sure exactly why keep_unwritten would noticeably affect performance (it's been a while since I wrote that code), but the large sector sizes for the QSPI devices used with RT series probably plays a role in what you're seeing.

@nerdralph I didn't know about nuitka, very cool! I had been thinking about using cython.

The way I'd do it is have the code that is loaded into SRAM loop through a chunk of data to write to flash, and jump back to the start. I'd have the debug probe set a breakpoint at the branch instruction, and continue execution when the next chunk of data is transferred to SRAM.

That's basically how it works. First the algo is loaded to RAM and inited. (Sectors erased first.) Then a buffer of data is copied to RAM, and the algo started running. It loops over the data, writing to flash in whatever size the flash word is (usually 4, 8, or 16 bytes, though more modern ~40nm parts have 256- or 512-byte words). When done it returns to a bkpt instruction. While the algo is running, a second buffer is being filled with the next page to write, and the buffer ping-ponging begins. On almost all devices, pyocd is finished writing the next buffer before the algo completes for the previous.

pyocd / pyOCD

flash algorithm execution speed #1079