segfault if dmi_busy_delay gets too high

mwachs5 commented 6 years ago

I get a segfault if I try to connect to a core that is in reset, s.t. dmi_busy_delay gets over 4000. I'm not sure what's happening, unless the jtag_scan can't fit all the cycles necessary for the delay and runs out of memory?

mwachs5 commented 6 years ago

This is what's happening when I get a segfault:

Debug: 11826 871580 riscv-013.c:346 increase_dmi_busy_delay(): dtmcontrol_idle=5, dmi_busy_delay=4036, ac_busy_delay=0
Debug: 11827 871660 riscv-013.c:335 dtmcontrol_scan(): DTMCS: 0x10000 -> 0x5471
Debug: 11828 871839 riscv-013.c:278 scan(): 41b - 00000000 @11 -> b 00000000 @00
Debug: 11829 871839 riscv-013.c:346 increase_dmi_busy_delay(): dtmcontrol_idle=5, dmi_busy_delay=4440, ac_busy_delay=0
Debug: 11830 871920 riscv-013.c:335 dtmcontrol_scan(): DTMCS: 0x10000 -> 0x5471

timsifive commented 6 years ago

What version of the OpenOCD code were you using? If I hack OpenOCD to just use dmi_busy_delay of 5000 then it works, and if I hack spike to always return 0x5471 when reading DTMCS then OpenOCD exits very differently than your trace.

Can you share the entire log file?

mwachs5 commented 6 years ago

Maybe my environment has tighter process memory limits or something, and it's killing the process when it runs out of memory, or some malloc is failing in my environment that isn't failing in yours...

mwachs5 commented 6 years ago

going to try to generate a coredump

timsifive commented 6 years ago

If you can reproduce it easily, can you run valgrind on OpenOCD? That's probably pretty insightful.

mwachs5 commented 6 years ago

Yeah, it's pretty deterministic failure. I'll run with valgrind. I also may try switching to remote-bitbang instead of jtag-vpi interface

mwachs5 commented 6 years ago

Ok, here is the result. This is the result of running riscv test-compliance against a simulation target (at the part where it checks NDMRESET but I can make it happen just by connecting to a target that's in reset.

I am again guessing this may be an underlying jtag_vpi problem so I'll also try just using remote-bitbang:

==18645== Invalid read of size 1
==18645==    at 0x4C353F8: __memcpy_chk (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==18645==    by 0x419EF3: memcpy (string3.h:53)
==18645==    by 0x419EF3: jtag_vpi_queue_tdi_xfer (jtag_vpi.c:178)
==18645==    by 0x41A11B: jtag_vpi_runtest (jtag_vpi.c:321)
==18645==    by 0x41A11B: jtag_vpi_execute_queue (jtag_vpi.c:345)
==18645==    by 0x408610: default_interface_jtag_execute_queue (core.c:839)
==18645==    by 0x413AEE: interface_jtag_execute_queue (driver.c:381)
==18645==    by 0x40875F: jtag_execute_queue_noclear (core.c:906)
==18645==    by 0x408D08: jtag_execute_queue (core.c:924)
==18645==    by 0x5577FC: dmi_scan (riscv-013.c:384)
==18645==    by 0x557BEC: dmi_read (riscv-013.c:449)
==18645==    by 0x557FF5: deassert_reset (riscv-013.c:1383)
==18645==    by 0x55CBFA: riscv013_test_compliance (riscv-013.c:2568)
==18645==    by 0x452794: run_command (command.c:611)
==18645==    by 0x452794: script_command_run (command.c:208)
==18645==  Address 0x22a is not stack'd, malloc'd or (recently) free'd
==18645== 
==18645== 
==18645== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==18645==  Access not within mapped region at address 0x22A
==18645==    at 0x4C353F8: __memcpy_chk (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==18645==    by 0x419EF3: memcpy (string3.h:53)
==18645==    by 0x419EF3: jtag_vpi_queue_tdi_xfer (jtag_vpi.c:178)
==18645==    by 0x41A11B: jtag_vpi_runtest (jtag_vpi.c:321)
==18645==    by 0x41A11B: jtag_vpi_execute_queue (jtag_vpi.c:345)
==18645==    by 0x408610: default_interface_jtag_execute_queue (core.c:839)
==18645==    by 0x413AEE: interface_jtag_execute_queue (driver.c:381)
==18645==    by 0x40875F: jtag_execute_queue_noclear (core.c:906)
==18645==    by 0x408D08: jtag_execute_queue (core.c:924)
==18645==    by 0x5577FC: dmi_scan (riscv-013.c:384)
==18645==    by 0x557BEC: dmi_read (riscv-013.c:449)
==18645==    by 0x557FF5: deassert_reset (riscv-013.c:1383)
==18645==    by 0x55CBFA: riscv013_test_compliance (riscv-013.c:2568)
==18645==    by 0x452794: run_command (command.c:611)
==18645==    by 0x452794: script_command_run (command.c:208)
==18645==  If you believe this happened as a result of a stack
==18645==  overflow in your program's main thread (unlikely but
==18645==  possible), you can try to increase the size of the
==18645==  main thread stack using the --main-stacksize= flag.
==18645==  The main thread stack size used in this run was 8388608.
==18645== 
==18645== HEAP SUMMARY:
==18645==     in use at exit: 3,208,790 bytes in 5,541 blocks
==18645==   total heap usage: 28,380 allocs, 22,839 frees, 4,046,219,512 bytes allocated
==18645== 
==18645== 9 bytes in 2 blocks are definitely lost in loss record 148 of 1,134
==18645==    at 0x4C2FB55: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==18645==    by 0x51B401: riscv_gdb_thread_packet (riscv_debug.c:72)
==18645==    by 0x44867C: gdb_input_inner (gdb_server.c:2996)
==18645==    by 0x44867C: gdb_input (gdb_server.c:3056)
==18645==    by 0x44C073: server_loop (server.c:517)
==18645==    by 0x405B3F: openocd_thread (openocd.c:306)
==18645==    by 0x405B3F: openocd_main (openocd.c:343)
==18645==    by 0x557F82F: (below main) (libc-start.c:291)
==18645== 
==18645== 16 bytes in 1 blocks are definitely lost in loss record 234 of 1,134
==18645==    at 0x4C2FB55: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==18645==    by 0x51B1E1: riscv_gdb_v_packet (riscv_debug.c:226)
==18645==    by 0x448F2F: gdb_v_packet (gdb_server.c:2595)
==18645==    by 0x448F2F: gdb_input_inner (gdb_server.c:2964)
==18645==    by 0x448F2F: gdb_input (gdb_server.c:3056)
==18645==    by 0x44C073: server_loop (server.c:517)
==18645==    by 0x405B3F: openocd_thread (openocd.c:306)
==18645==    by 0x405B3F: openocd_main (openocd.c:343)
==18645==    by 0x557F82F: (below main) (libc-start.c:291)
==18645== 
==18645== 436 bytes in 12 blocks are definitely lost in loss record 995 of 1,134
==18645==    at 0x4C2FB55: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==18645==    by 0x51B401: riscv_gdb_thread_packet (riscv_debug.c:72)
==18645==    by 0x448E66: gdb_input_inner (gdb_server.c:2865)
==18645==    by 0x448E66: gdb_input (gdb_server.c:3056)
==18645==    by 0x44C073: server_loop (server.c:517)
==18645==    by 0x405B3F: openocd_thread (openocd.c:306)
==18645==    by 0x405B3F: openocd_main (openocd.c:343)
==18645==    by 0x557F82F: (below main) (libc-start.c:291)
==18645== 
==18645== LEAK SUMMARY:
==18645==    definitely lost: 461 bytes in 15 blocks
==18645==    indirectly lost: 0 bytes in 0 blocks
==18645==      possibly lost: 0 bytes in 0 blocks
==18645==    still reachable: 3,208,329 bytes in 5,526 blocks
==18645==         suppressed: 0 bytes in 0 blocks
==18645== Reachable blocks (those to which a pointer was found) are not shown.
==18645== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==18645== 
==18645== For counts of detected and suppressed errors, rerun with: -v
==18645== Use --track-origins=yes to see where uninitialised values come from
==18645== ERROR SUMMARY: 27200 errors from 12 contexts (suppressed: 0 from 0)
Segmentation fault (core dumped)

timsifive commented 6 years ago

This is definitely a bug in jtag_vpi.c. The code where we crash is:

        memcpy(vpi.buffer_out, bits, nb_bytes);

and vpi.buffer_out is of constant size.

A bit further up the stack we see:

static int jtag_vpi_queue_tdi(uint8_t *bits, int nb_bits, int tap_shift)
{
    int nb_xfer = DIV_ROUND_UP(nb_bits, XFERT_MAX_SIZE * 8);
...
    while (nb_xfer) {

which is clearly an attempt to not overflow that buffer. Evidently there's a bug here somewhere. Are there directions on how to talk to a VPI target somewhere? If I can reproduce this it's probably pretty straightforward to fix.

mwachs5 commented 6 years ago

You need VCS.

mwachs5 commented 6 years ago

I've basically stopped using JTAG_VPI so this isn't that critical. maybe better to just file an issue upstream since this seems like something that would happen any time a too-large burst is tried to be sent over JTAGVPI.

timsifive commented 6 years ago

Alright. Can you file the upstream issue, since you're the one who can reproduce it?

riscv-collab / riscv-openocd

segfault if dmi_busy_delay gets too high #213