Closed nerdralph closed 3 years ago
I did a comparison between pyOCD and openocd flashing a 20kB file to a stm32f103. pyOCD runtime was 3.9s vs 2.45s for openocd.
Obviously a big part of that is simply that Python is slower than C. Are you using Python 3.9? It's noticeably faster than 2.7, or even earlier 3.x versions. Another option is pypy3.
There are probably multiple optimizations that could be implemented within pyocd, but I'm not aware of anything that's low hanging fruit. Profiling like you're doing is the best way to start.
My tests were with Python 3.7.9. I'll try 3.9.1 to see if it is any faster. Do you know if pyOCD works with pypy3? My experience with it was not great in the past.
I've noticed pyOCD only lightly loads 2 of 4 cores. If my python programming skills were better I'd consider modifying the AP code to use a separate thread for generating the sequence of commands required for the flash algorithm.
Python 3.9 doesn't run on Win7. I just tried upgrading to 3.8.7, and pyOCD is a bit faster now, taking 3.6s instead of 3.9s.
I haven't tested a ton with pypy3, but I didn't run into any problems. Ymmv. (And I've had plenty of trouble with it in other cases.)
The problem with multithreading in CPython is the GIL: Global Interpreter Lock. That about says it. Performance is only improved for multithreaded CPython if your threads are mostly IO-bound, or call into C-based libraries that release the GIL while they run. The multiprocessing
package makes it easy to use separate processes for performance instead of threads, but the overhead probably would make any gains disappear. …… There are benefits to using Python, but also some serious drawbacks.
One option we've talked about is to have sort of a meta-flash-algorithm, code that runs on the target to orchestrate calls to the algo. For instance, instead of having to call-and-wait for every sector to be erased, drop an array of sector addresses in RAM and call the orchestrator, which will then call the algo for each address. The question is, would it really improve performance when double-buffering is being used? Usually the flash operations are the real bottleneck.
It's been several years since I looked at flash programming performance in detail, and there may very well be areas that could be improved that I'm not thinking of.
Well, that's something at least…
One thing that may improve pyocd performance is how it handles data buffers. Instead of using an efficient type like array.array
or bytebuffer
, it uses lists of ints. This unfortunate design choice was deeply ingrained from the beginning. I've had fixing it on my todo list for a long time (probably waiting til after dropping Python 2 would be a good idea). I wonder if it would even make a measurable difference, since for the most part the buffers that pyocd works with are quite small.
I also don't think the data buffer handling is a performance problem. If it were, then I think the write_block32 would be slow. It's too bad the flash algorithms weren't simpler (and therefore more efficient). The way I'd do it is have the code that is loaded into SRAM loop through a chunk of data to write to flash, and jump back to the start. I'd have the debug probe set a breakpoint at the branch instruction, and continue execution when the next chunk of data is transferred to SRAM.
As for simpler ways of getting pyOCD to run faster, I just tried nuitka with a small program that failed with pypy on Windows 7. The nuitka-compiled version works fine. I'm not sure how to get it to compile a whole package like pyOCD, but from skimming the docs, I think nuitka can do that.
@flit pyOCD is really slow if flashing a large application. Set keep_unwritten=False
it can improve the flashing performance and we used the way for NXP IMXRT series. I do not know why but it is useful.
@Hoohaha That's a really good point. By default, pyocd will try to a) only erase and program sectors that change, b) not lose data in a partially-programmed sector. These behaviours are controlled by the smart_flash
and keep_unwritten
options, respectively. smart_flash
requires reading existing flash contents. Whether it really improves performance, or slows it down, depends a lot on how you're using pyocd, i.e. writing entirely new code versus frequent updates of mostly the same code in a development cycle. I'm not sure exactly why keep_unwritten
would noticeably affect performance (it's been a while since I wrote that code), but the large sector sizes for the QSPI devices used with RT series probably plays a role in what you're seeing.
@nerdralph I didn't know about nuitka, very cool! I had been thinking about using cython.
The way I'd do it is have the code that is loaded into SRAM loop through a chunk of data to write to flash, and jump back to the start. I'd have the debug probe set a breakpoint at the branch instruction, and continue execution when the next chunk of data is transferred to SRAM.
That's basically how it works. First the algo is loaded to RAM and inited. (Sectors erased first.) Then a buffer of data is copied to RAM, and the algo started running. It loops over the data, writing to flash in whatever size the flash word is (usually 4, 8, or 16 bytes, though more modern ~40nm parts have 256- or 512-byte words). When done it returns to a bkpt
instruction. While the algo is running, a second buffer is being filled with the next page to write, and the buffer ping-ponging begins. On almost all devices, pyocd is finished writing the next buffer before the algo completes for the previous.
I've been trying to optimize pyOCD flash write speed with my debug probe. After configuring trace debug logs (thanks Chris!), I noticed less than half the time is spent doing block writes. Most of the time seems to be spent generating the AP read and write commands in between block transfers. Here's an excerpt from the logs:
It takes until the 2504ms timestamp to get a full packet to write. In total from the end of one write_block32 until the start of the next write_block32 is 45ms. pyOCD seems to be doing some sort of caching of the parsed flash algorithm, as the time between the 1st and 2nd block writes is about twice as long (~100ms). Is there any easy way to speed this up? The host CPU is a 3.2GHz Intel Core 64-bit.