Some debug tests like are UnavailableHaltedTest broken

aap-sc commented 1 year ago

@timsifive , @en-sc , @AnastasiyaChernikova FYI.

The effects of the issue can be observed in https://github.com/riscv/riscv-openocd/pull/928 (though this commit has nothing to do with the issue, the root cause seems to be in the test suite itself).

Basically, If I take current riscv-openocd TOT most Unavailable... tests begin to fail in my environment. I've debugged the issue for a while now and I think I understand the root cause.

The problem is with def command(self, cmd) function from debug/testlib.py. What this function does is it tries to obtain results of OpenOCD command by parsing OpenOCD log instead of fetching the result directly (more on that later). Moreover, instead of sending only one command - we actually send 2 "commands". One real command and another is an artificial "command-like" comment to enforce OpenOCD to print prompt-like construct. The issue is that this approach does not always work.

Here are some examples (taken from log files after the failure).

Example 1

Debug: 5781 4300 riscv.c:3097 riscv_openocd_poll(): [riscv.cpu] should_remain_halted=0, should_resume=0
Debug: 5782 4300 target.c:3034 handle_target(): [riscv.cpu] target_poll() -> 0, next attempt in 100ms
targets
^@Debug: 5783 4313 command.c:152 script_debug(): command - targets
    TargetName         Type       Endian TapName            State       
--  ------------------ ---------- ------ ------------------ ------------
 0* riscv.cpu          riscv      little riscv.cpu          halted

> Debug: 5784 4399 riscv.c:3041 riscv_openocd_poll(): [riscv.cpu] Polling all harts.
Debug: 5785 4399 riscv.c:2864 riscv_poll_hart(): [riscv.cpu] polling, target->state=2
Debug: 5786 4399 riscv-013.c:393 scan(): 40b r 00000000 @11 -> + 00000000 @11; 0i
Debug: 5787 4399 riscv-013.c:403 scan(): dmstatus=0 { version=none, ...
Debug: 5788 4399 riscv-013.c:393 scan(): 40b - 00000000 @11 -> + 00430382 @11; 0i
Debug: 5789 4399 riscv-013.c:403 scan(): dmstatus=0 { version=none, ...}
Debug: 5790 4399 riscv.c:1664 riscv_flush_registers(): [riscv.cpu] Flushing register cache
Debug: 5791 4399 riscv.c:1682 riscv_flush_registers(): [riscv.cpu] Flush of register cache completed
Debug: 5792 4399 riscv.c:3097 riscv_openocd_poll(): [riscv.cpu] should_remain_halted=0, should_resume=0
Debug: 5793 4399 target.c:3034 handle_target(): [riscv.cpu] target_poll() -> 0, next attempt in 100ms
# 7x
^@
> Debug: 5794 4499 riscv.c:3041 riscv_openocd_poll(): [riscv.cpu] Polling all harts.
Debug: 5795 4499 riscv.c:2864 riscv_poll_hart(): [riscv.cpu] polling, target->state=2

Prompt for # 7x never matched.

Example 2

Another, more hilarious example is the following:

> targets
^@Debug: 8476 2452 command.c:152 script_debug(): command - targets
Debug: 8477 2452 riscv.c:3041 riscv_openocd_poll(): [riscv.cpu0] Polling all harts.
Debug: 8478 2452 riscv.c:2864 riscv_poll_hart(): [riscv.cpu0] polling, target->state=2
...
Info : 8493 2452 riscv.c:2974 riscv_poll_hart(): [riscv.cpu1] became unavailable.
[riscv.cpu1] became unavailable.
...
Debug: 8516 2453 target.c:3034 handle_target(): [riscv.cpu1] target_poll() -> 0, next attempt in 100ms
    TargetName         Type       Endian TapName            State       
--  ------------------ ---------- ------ ------------------ ------------
 0  riscv.cpu0         riscv      little riscv.cpu          halted
 1* riscv.cpu1         riscv      little riscv.cpu          unavailable

> # 88x
^@
> Debug: 8517 2552 riscv.c:3041 riscv_openocd_poll(): [riscv.cpu0] Polling all harts

Here the prompt is matched. But the output is not parsed correctly, leading to the following error:

Traceback (most recent call last):
  File "/home/ecco/projects/sc_openocd/build/Debug/RISCVTests/debug/testlib.py", line 1333, in run
    result = self.test()    # pylint: disable=no-member
             ^^^^^^^^^^^
  File "/home/ecco/projects/sc_openocd/build/Debug/RISCVTests/debug/gdbserver.py", ...
    self.server.set_available(
  File "/home/ecco/projects/sc_openocd/build/Debug/RISCVTests/debug/testlib.py", line 515, in set_available
    if target["State"] == "unavailable":
       ~~~~~~^^^^^^^^^
KeyError: 'State'

This happens because targets state is parsed like this:

[
   {'TargetName': 'riscv.cpu0', 'Type': 'riscv', 'Endian': 'little', 'TapName': 'riscv.cpu', 'State': 'halted'},
   {'TargetName': 'riscv.cpu1', 'Type': 'riscv', 'Endian': 'little', 'TapName': 'riscv.cpu', 'State': 'halted'},
   {'TargetName': 'Info', 'Type': ':', 'Endian': '5749', 'TapName': '3031', 'State': 'riscv.c:2974'}, 
   {'TargetName': '\x08\x08[riscv.cpu1]', 'Type': 'became', 'Endian': 'unavailable.'}
]

Note: these are but a few examples. The number of situations resulting in test failure is quite large. Though it all comes down to incorrect parsing of the result.

TLDR

Looks like internal debug logging and processing of user commands have their respected IO facilities desynchronized (most likely they use separate buffering)
Obtaining results of command via parsing of log files between 2 "commands" is a bad idea (lot's of things can happen between the first and second "commands") as can be observed in Example 2.

Proposed solution:

Instead of parsing OpenOCD log files (and connecting to a pipe for an input) I suggest to communicate with OpenOCD via TCL-rpc server. TCL-RPC server is almost the same that ordinary telnet server, but it is more suitable for machine-machine interaction (we have a special watermark to indicate message end). We don't need to parse OpenOCD logs to get instruction results - we can fetch the result string directly.

P.S: I'll try to propose a patch to address the issue. This may take some time, though. So if anyone has an urgent desire to fix this - don't be shy ).

aap-sc commented 1 year ago

Oh, and by the way. I think the reason I can reproduce it reliably is that we always run these tests with "-d3" passed to OpenOCD. So the amount of debug logs is increased. This may be the reason why people could have difficulties with reproducing this.

timsifive commented 1 year ago

FWIW, even with -d3 this doesn't reproduce for me locally.

aap-sc commented 1 year ago

The patch with the suggested fixes: https://github.com/riscv-software-src/riscv-tests/pull/522

aap-sc commented 1 year ago

Looks like the patch fixes this issue, so closing the ticket

riscv-software-src / riscv-tests

Some debug tests like are UnavailableHaltedTest broken #520