PyOCD too pessimistic with occasional SWD communication glitches?

ahhuhtal commented 3 years ago

I'm receiving occasional memory transfer faults while code is running (seemingly always while querying the target state in gdbserver.py), resulting in the following trace:

0183965:WARNING:gdbserver:Exception while target was running: Memory transfer fault (read) @ 0xe000edf0-0xe000edf7
Traceback (most recent call last):
  File "/home/antti/.local/lib/python3.8/site-packages/pyocd/gdbserver/gdbserver.py", line 609, in resume
    if self.target.get_state() == Target.State.HALTED:
  File "/home/antti/.local/lib/python3.8/site-packages/pyocd/coresight/cortex_m.py", line 831, in get_state
    dhcsr = self.read_memory(CortexM.DHCSR)
  File "/home/antti/.local/lib/python3.8/site-packages/pyocd/coresight/cortex_m.py", line 436, in read_memory
    result = self.ap.read_memory(addr, transfer_size, now)
  File "/home/antti/.local/lib/python3.8/site-packages/pyocd/probe/stlink_probe.py", line 255, in read_memory
    result = conversion.byte_list_to_u32le_list(self._link.read_mem32(addr, 4, self._apsel))[0]
  File "/home/antti/.local/lib/python3.8/site-packages/pyocd/probe/stlink/stlink.py", line 385, in read_mem32
    return self._read_mem(addr, size, Commands.JTAG_READMEM_32BIT, self.MAXIMUM_TRANSFER_SIZE, apsel)
  File "/home/antti/.local/lib/python3.8/site-packages/pyocd/probe/stlink/stlink.py", line 343, in _read_mem
    raise exc
pyocd.core.exceptions.TransferFaultError: Memory transfer fault (read) @ 0xe000edf0-0xe000edf7

I don't quite understand the mechanism, but this fault will cause the debug session to fail. This is quite annoying as the fault sometimes occurs within 5 minutes.

My SWD signal integrity is probably not the best, which may explain the origins of the faults themselves. However, just as a test I added a retry loop in stlink_probe.py in the read_memory function. Turns out that in my case the read succeeds on the next try, and the debug session continues to work despite a single read failure.

Would it make sense for PyOCD to allow a few retries before raising an exception in such a case?

Background:

I'm running PyOCD 0.30.3 on a Linux machine (Ubuntu 20.04)
I'm debugging a custom board, which has SWD through a pogo pin connection. MCU is STM32F103C8 (very boring indeed).
Adapter is an authentic ST-LINK/V2
I'm debugging in STM32CubeIDE. As a side note, the built-in ST-LINK gdb server has no stability issues with the same config, but otherwise I would prefer PyOCD.
My pyocd.yaml: target_override: stm32f103c8 enable_semihosting: False swv_system_clock: 48000000 swv_clock: 2000000 swv_raw_port: 3344 enable_swv: True

flit commented 3 years ago

Whether to perform retries on faults is a good question. Generally you'd expect the connection to be solid and want to know immediately if there is a failure. However, I can see the potential need for a little resilience in the face of faults for situations like yours.

A big part of why pyocd doesn't already do this is that it's not necessarily safe to retry memory transfers, specifically of Device memory (system and peripheral registers). Even reads can have side effects.

Another question is, if the root cause of the faults in your case is an unreliable connection caused by the pogo pins, why doesn't the parity check on the SWD transfer catch the error? Of course that's only a 1-bit parity per 32-bit data. But if the failures are indeed random, it should trigger a parity failure in at least some of the cases. A parity check failure will appear as a TransferError exception instead of TransferFaultError.

Probably the best case is to have an option to enable a certain number of retries on transfer failures.

That's really interesting the STLink gdbserver doesn't fail! I'm curious how other debuggers handle this. I'll do some research and see if I find anything useful.

ahhuhtal commented 3 years ago

I of course agree on the fact that there are situations where a double read or write would have unexpected results. However, imho it's not very common to read or write such registers in the debugger. Those operations would also almost certainly be manually requested, so seeing a notification of the retry could alert the user to expect weird things. Anyway, an option to set the fault tolerance does sound like best of both worlds.

Meanwhile, a new observation I made is that the issue seems to be sensitive to SWO activity. Not sure if it's crosstalk in my crappy setup, or if it's the ST-LINK struggling under the SWO load. I can try to acquire a logic trace of the SWD lines to see if there are parity errors.

With only minimal SWO traffic (just prints on ITM channel 0) I didn't see the issue at all over a course of several hours. However, when I enabled interrupt tracing, I immediately started seeing the exceptions again. However with the retry loop hack I have in place, the session itself remained solid.

I didn't notice before, but it appears that the ST-LINK gdbserver is also hiccupping under the same conditions. It is outputting "Target is not responding, retrying..." messages, but the debug session continues without other issues. I have a suspicion that this is the exact same fault.

flit commented 3 years ago

Thanks for the update.

My guess is crosstalk. I'm pretty certain the STLink firmware uses DMA for SWO, and I've never heard of STLink issues related to SWO load.

I'll spend some time this weekend whilst working pyocd to think about the optimal place to issue retries.

flit commented 3 years ago

Looking at the code some more, there are some pretty straightforward things that can be done to improve the resilience specifically for certain operations in the gdbserver. More general issuing of retries is a lot harder, and is far more susceptible to the potential undesirable side effects mentioned earlier. (The higher levels of code know what is being accessed and whether it's safe to retry, but require performing retries in possibly many places.)

ahhuhtal commented 3 years ago

Yeah, the particular point at which the exception has always occurred with me is on line 609 in gdbserver.py, while calling self.target.get_state(). Just changing the exception handling there would get rid of at least this particular version of the issue.

I got a logic trace of the signals at the ST-LINK end. There are no parity errors, at least if the protocol decoder is to be trusted.

It looks like the ST-LINK just decides to reset the line out of the blue. It requests a write to TAR and gets a good ACK, then proceeds to write 0xE000EDF0 to TAR, followed immediately with a 59 bit reset, some garbage and a 56 bit reset. All of this is then followed later by a couple of ACK FAULTs by the target and further resets. Unfortunately I'm not at all familiar with the SWD protocol to understand exactly what is going on. Anyway makes me think that it might be an issue with the ST-LINK after all and not just my setup.

flit commented 3 years ago

Well, even if it's the STLink, this was a good discussion to have, and as I said, something that's been on my list for a long time.

The feature/fault_resiliance branch on my fork has an early prototype of the GDBServer.resume() method that will hopefully make it less sensitive to spurious faults. On the first fault, a warning is logged and it starts a timeout timer running. If it successfully reads the CPU status the timer is stopped and everything returns to normal. If the timer expires without a successful read, an error will be logged and gdb will be told the target faulted (same thing that happened before on the first fault). The timeout duration is controlled by the debug.status_fault.retry_timeout option that defaults to 1 minute.

There are plenty of issues with this first pass. It really should be looking for a series of consecutive successful transfers before deciding things are back to normal, instead of just one. And if it gets lots of faults within a certain period it probably should just give up.

There's also the problem that faults are handled like this only in one place, while pyocd is continually checking status waiting for the target to halt (e.g., a debug event) or the user to interrupt execution. Spurious faults during connection, or any other gdb commands, are not gracefully handled.

But hopefully it's a start! 😄 If you can try it out, please let me know how it goes.

The STLink behaviour pretty clearly thinks something has gone wrong and is trying to get back to a working state. What's missing is why it decides to start sending the resets before the faults happen. (cc @schstm If you get a chance to look at this, please see the previous comment with the info about the SWD trace. I'm also curious what your take is on handling of spurious faults.)

schstm commented 3 years ago

Hello, At ST-Link firmware level (ST-Link/V2 and ST-Link/V2-1), a retry is done after SWD raw communication error (no ACK), 3 times, and if the issue persists, an SWD Init sequence is launched (doing the SWD reset sequence). Note that this retry sequence is not very good because in some cases (for instance when using the address auto-increment feature of the MCU), it may have unwanted side effects (case of ACK correctly sent by the target but badly received by the probe). That's why it was removed on STLINK-V3 probes. Excepted if you want to play with the chance, I recommend to secure the SWD communication, consider the first error as fatal and react accordingly (I mean, such issue after a write may be more fatal than after some read). Securing the SWD communication means to have clean connections; for instance if you are using wires, choose them as short as possible, and if you suspect cross-talking or if you are working in a "noisy" environment, wind a ground signal around them.

flit commented 3 years ago

Hi @schstm, thanks for the comments! I completely agree with the concern about side effects. That's why pyocd originally did not attempt retries.

ahhuhtal commented 3 years ago

Sorry for the delay, I was working on something else for a while and didn't have this board connected to the debugger.

I'm happy to report that your feature/fault_resiliance branch is able to handle the transfer faults and continue running :-).

The output during a fault is the following:

0051121:WARNING:gdbserver:Transfer error while checking target status; retrying: Memory transfer fault (read) @ 0xe000edf0-0xe000edf7
Traceback (most recent call last):
  File "/home/antti/.local/lib/python3.8/site-packages/pyocd/gdbserver/gdbserver.py", line 629, in resume
    state = self.target.get_state()
  File "/home/antti/.local/lib/python3.8/site-packages/pyocd/coresight/cortex_m.py", line 831, in get_state
    dhcsr = self.read_memory(CortexM.DHCSR)
  File "/home/antti/.local/lib/python3.8/site-packages/pyocd/coresight/cortex_m.py", line 436, in read_memory
    result = self.ap.read_memory(addr, transfer_size, now)
  File "/home/antti/.local/lib/python3.8/site-packages/pyocd/probe/stlink_probe.py", line 255, in read_memory
    result = conversion.byte_list_to_u32le_list(self._link.read_mem32(addr, 4, self._apsel))[0]
  File "/home/antti/.local/lib/python3.8/site-packages/pyocd/probe/stlink/stlink.py", line 385, in read_mem32
    return self._read_mem(addr, size, Commands.JTAG_READMEM_32BIT, self.MAXIMUM_TRANSFER_SIZE, apsel)
  File "/home/antti/.local/lib/python3.8/site-packages/pyocd/probe/stlink/stlink.py", line 343, in _read_mem
    raise exc
pyocd.core.exceptions.TransferFaultError: Memory transfer fault (read) @ 0xe000edf0-0xe000edf7
0051143:INFO:gdbserver:Target control reestablished.

For this test I'm saturating the SWO output, which leads to a few faults per minute. However, on the GDB side, no ill effects are seen. Debugging just keeps on working. Looks very good!

flit commented 3 years ago

Thanks for the update, that's great to hear! I will finish up that branch and get it into an upcoming release (not sure if it will make the next one, which will be pretty soon).

pyocd / pyOCD

PyOCD too pessimistic with occasional SWD communication glitches? #1165