Pyocd speed under Mbed Studio for NUCLEO_F767ZI (25.5sec -> 4.9sec)

I would like to share some recent speedup I made in my pyocd setup. First I would like to indicate that this was done for STM32F767, on NUCLEO_F767ZI board (with st-link v2), Mac OS X and optimized to run directly from Mbed Studio. So this is a mix of Mbed Studio issues, mac and pyocd issues. I will list them all at the end - please help to advise if you can follow up on some of them.

So default setup of programming small example program (65Kb) under Mbed Studio takes 25.5sec. STM32F767 has 2MB of flash and Mbed Studio run chip erase by default. Mbed Studio also hardcodes SWD frequency to 1.8Mhz for unknown to me reason. Both things are not configurable... How can we hack it with pyOCD? Project user scripts!

def will_connect(board):
    # Set SWD to 4MHz
    probe.set_clock(int(4e6))

    # Force sector chip errase
    flash_loader_init = pyocd.flash.loader.FlashLoader.__init__
    def init_and_overwrite_chip_erase(self, *args, **kwargs):
        flash_loader_init(self, *args, **kwargs)
        self._chip_erase = 'sector'
    pyocd.flash.loader.FlashLoader.__init__ = init_and_overwrite_chip_erase

It's ugly, but it works. It is even more ugly as Mbed Studio sets project directory to... python binary directory instead of the project itself (so under Mac I have to put my user scripts to '/Library/Application Support/Mbed Studio/mbed-studio-tools/python/bin' and is shared across all the project).

This change already reduce time to 12.5sec, but we can do better.

To understand where the time is going, two techniques are useful: logs messages timestamps and profiler.

Log messages timestamps are giving a very nice timeline of what is happening, but it is sometimes hard to identify where in the there is a problem.

To run profiler I've simply used cProfiler:

python -m cProfile -s cumulative venv3/bin/pyocd ...pyocd...flags...

This immediately shows another bottleneck: disconnecting.

5.006s coresight_target.py:301(disconnect)

It looks like for unknown to me reason NUCLEO_F767ZI never confirms turn down of debugging system, so disconnect wait for hardcoded 5sec timeout. This probably should be somehow configurable, but it also can be hacked with user script to save 4.9sec:

def will_connect(board):
  pyocd.coresight.dap.DP_POWER_REQUEST_TIMEOUT = 0.1

At this point there is no major chunk of wasted time, but we can still do better.

The first problem is Mac specific.

0.813s helpers.py:118(choose_probe)

Probe lookup is slow. The reason is simple - implementation of 'probe/stlink/detect/darwin.py' is slow. Unfortunately it can't be hacked with user script. The user script code is executed after the probe lookup (but still thanks to open sourced nature of pyocd it can be tweaked). Ideally Mbed Studio could just pass already identified probe to Pyocd.

The last easy speedup is driven by two pyocd options: keep_unwritten and smart_flash. As in most cases I need to program new binary completely to the chip these options are slowing down instead of speeding up the process. The first one will makes sure that if we write part of the sector, the remaining part will be unchanged. But if you program the entire binary you care about it will cause read of chip garbage and attach it to your binary to align it to full sector. The second one (smart_flash) will compute crc32 of existing data to avoid writing sectors that are not changed - again as my binary is almost always changed (it is enough to remove/add single byte in first block of the binary to cause unalignment of all sectors). Disabling it will speedup programming. Disabling both options give 0.9sec of speedup.

In total the all these tweaks decrease program time in Mbed Studio from 25.5s to 4.9sec. So after executing programming 30 times, you can save 10 minutes of your life (and spend 2 hours on optimizing Pyocd :D ).

Can we do better ?

Let's look at STM32_Programmer_CLI: It takes 2.1sec to write the test binary (1.52s of actual programming). Where is the next improvements hidden ?

Profiler gives some hints: 3.9sec (out of 5.1sec total) is used by file_programmer.py:94(program). So the speedup can be done in two places:

initialization 1.2sec to connect to target is quite a lot.
programming is done page, by page. It takes a lot of communication back and forth between host and probe and not use bulk transfer effective. It should be possible to use more ram on the chip to program multiple pages at once.

Issues mention in the entire above - please let me know if you can follow up with Mbed Studio or recommend alternative solution:

Mbed Studio - allow configuration of pyocd flags.
Mbed Studio setting project directory correctly (project directory instead of python bin directory).
Pyocd disconnect timeout - configuration or decrease disconnect timeout default to 0.1sec.
Speedup probe/stlink/detect/darwin.py or allow passing probe configuration without running detection code.
Recommend keep_unwritten=false and smart_flash=false options in documentation to speedup programming.
Programming multiple pages at once without halting core between pages.

edit: sector vs page

pyocd / pyOCD

Pyocd speed under Mbed Studio for NUCLEO_F767ZI (25.5sec -> 4.9sec) #892