Closed mwachs5 closed 6 years ago
This is what's happening when I get a segfault:
Debug: 11826 871580 riscv-013.c:346 increase_dmi_busy_delay(): dtmcontrol_idle=5, dmi_busy_delay=4036, ac_busy_delay=0
Debug: 11827 871660 riscv-013.c:335 dtmcontrol_scan(): DTMCS: 0x10000 -> 0x5471
Debug: 11828 871839 riscv-013.c:278 scan(): 41b - 00000000 @11 -> b 00000000 @00
Debug: 11829 871839 riscv-013.c:346 increase_dmi_busy_delay(): dtmcontrol_idle=5, dmi_busy_delay=4440, ac_busy_delay=0
Debug: 11830 871920 riscv-013.c:335 dtmcontrol_scan(): DTMCS: 0x10000 -> 0x5471
What version of the OpenOCD code were you using? If I hack OpenOCD to just use dmi_busy_delay of 5000 then it works, and if I hack spike to always return 0x5471 when reading DTMCS then OpenOCD exits very differently than your trace.
Can you share the entire log file?
Maybe my environment has tighter process memory limits or something, and it's killing the process when it runs out of memory, or some malloc
is failing in my environment that isn't failing in yours...
going to try to generate a coredump
If you can reproduce it easily, can you run valgrind
on OpenOCD? That's probably pretty insightful.
Yeah, it's pretty deterministic failure. I'll run with valgrind
. I also may try switching to remote-bitbang
instead of jtag-vpi
interface
Ok, here is the result. This is the result of running riscv test-compliance
against a simulation target (at the part where it checks NDMRESET
but I can make it happen just by connecting to a target that's in reset.
I am again guessing this may be an underlying jtag_vpi
problem so I'll also try just using remote-bitbang
:
==18645== Invalid read of size 1
==18645== at 0x4C353F8: __memcpy_chk (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==18645== by 0x419EF3: memcpy (string3.h:53)
==18645== by 0x419EF3: jtag_vpi_queue_tdi_xfer (jtag_vpi.c:178)
==18645== by 0x41A11B: jtag_vpi_runtest (jtag_vpi.c:321)
==18645== by 0x41A11B: jtag_vpi_execute_queue (jtag_vpi.c:345)
==18645== by 0x408610: default_interface_jtag_execute_queue (core.c:839)
==18645== by 0x413AEE: interface_jtag_execute_queue (driver.c:381)
==18645== by 0x40875F: jtag_execute_queue_noclear (core.c:906)
==18645== by 0x408D08: jtag_execute_queue (core.c:924)
==18645== by 0x5577FC: dmi_scan (riscv-013.c:384)
==18645== by 0x557BEC: dmi_read (riscv-013.c:449)
==18645== by 0x557FF5: deassert_reset (riscv-013.c:1383)
==18645== by 0x55CBFA: riscv013_test_compliance (riscv-013.c:2568)
==18645== by 0x452794: run_command (command.c:611)
==18645== by 0x452794: script_command_run (command.c:208)
==18645== Address 0x22a is not stack'd, malloc'd or (recently) free'd
==18645==
==18645==
==18645== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==18645== Access not within mapped region at address 0x22A
==18645== at 0x4C353F8: __memcpy_chk (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==18645== by 0x419EF3: memcpy (string3.h:53)
==18645== by 0x419EF3: jtag_vpi_queue_tdi_xfer (jtag_vpi.c:178)
==18645== by 0x41A11B: jtag_vpi_runtest (jtag_vpi.c:321)
==18645== by 0x41A11B: jtag_vpi_execute_queue (jtag_vpi.c:345)
==18645== by 0x408610: default_interface_jtag_execute_queue (core.c:839)
==18645== by 0x413AEE: interface_jtag_execute_queue (driver.c:381)
==18645== by 0x40875F: jtag_execute_queue_noclear (core.c:906)
==18645== by 0x408D08: jtag_execute_queue (core.c:924)
==18645== by 0x5577FC: dmi_scan (riscv-013.c:384)
==18645== by 0x557BEC: dmi_read (riscv-013.c:449)
==18645== by 0x557FF5: deassert_reset (riscv-013.c:1383)
==18645== by 0x55CBFA: riscv013_test_compliance (riscv-013.c:2568)
==18645== by 0x452794: run_command (command.c:611)
==18645== by 0x452794: script_command_run (command.c:208)
==18645== If you believe this happened as a result of a stack
==18645== overflow in your program's main thread (unlikely but
==18645== possible), you can try to increase the size of the
==18645== main thread stack using the --main-stacksize= flag.
==18645== The main thread stack size used in this run was 8388608.
==18645==
==18645== HEAP SUMMARY:
==18645== in use at exit: 3,208,790 bytes in 5,541 blocks
==18645== total heap usage: 28,380 allocs, 22,839 frees, 4,046,219,512 bytes allocated
==18645==
==18645== 9 bytes in 2 blocks are definitely lost in loss record 148 of 1,134
==18645== at 0x4C2FB55: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==18645== by 0x51B401: riscv_gdb_thread_packet (riscv_debug.c:72)
==18645== by 0x44867C: gdb_input_inner (gdb_server.c:2996)
==18645== by 0x44867C: gdb_input (gdb_server.c:3056)
==18645== by 0x44C073: server_loop (server.c:517)
==18645== by 0x405B3F: openocd_thread (openocd.c:306)
==18645== by 0x405B3F: openocd_main (openocd.c:343)
==18645== by 0x557F82F: (below main) (libc-start.c:291)
==18645==
==18645== 16 bytes in 1 blocks are definitely lost in loss record 234 of 1,134
==18645== at 0x4C2FB55: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==18645== by 0x51B1E1: riscv_gdb_v_packet (riscv_debug.c:226)
==18645== by 0x448F2F: gdb_v_packet (gdb_server.c:2595)
==18645== by 0x448F2F: gdb_input_inner (gdb_server.c:2964)
==18645== by 0x448F2F: gdb_input (gdb_server.c:3056)
==18645== by 0x44C073: server_loop (server.c:517)
==18645== by 0x405B3F: openocd_thread (openocd.c:306)
==18645== by 0x405B3F: openocd_main (openocd.c:343)
==18645== by 0x557F82F: (below main) (libc-start.c:291)
==18645==
==18645== 436 bytes in 12 blocks are definitely lost in loss record 995 of 1,134
==18645== at 0x4C2FB55: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==18645== by 0x51B401: riscv_gdb_thread_packet (riscv_debug.c:72)
==18645== by 0x448E66: gdb_input_inner (gdb_server.c:2865)
==18645== by 0x448E66: gdb_input (gdb_server.c:3056)
==18645== by 0x44C073: server_loop (server.c:517)
==18645== by 0x405B3F: openocd_thread (openocd.c:306)
==18645== by 0x405B3F: openocd_main (openocd.c:343)
==18645== by 0x557F82F: (below main) (libc-start.c:291)
==18645==
==18645== LEAK SUMMARY:
==18645== definitely lost: 461 bytes in 15 blocks
==18645== indirectly lost: 0 bytes in 0 blocks
==18645== possibly lost: 0 bytes in 0 blocks
==18645== still reachable: 3,208,329 bytes in 5,526 blocks
==18645== suppressed: 0 bytes in 0 blocks
==18645== Reachable blocks (those to which a pointer was found) are not shown.
==18645== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==18645==
==18645== For counts of detected and suppressed errors, rerun with: -v
==18645== Use --track-origins=yes to see where uninitialised values come from
==18645== ERROR SUMMARY: 27200 errors from 12 contexts (suppressed: 0 from 0)
Segmentation fault (core dumped)
This is definitely a bug in jtag_vpi.c. The code where we crash is:
memcpy(vpi.buffer_out, bits, nb_bytes);
and vpi.buffer_out
is of constant size.
A bit further up the stack we see:
static int jtag_vpi_queue_tdi(uint8_t *bits, int nb_bits, int tap_shift)
{
int nb_xfer = DIV_ROUND_UP(nb_bits, XFERT_MAX_SIZE * 8);
...
while (nb_xfer) {
which is clearly an attempt to not overflow that buffer. Evidently there's a bug here somewhere. Are there directions on how to talk to a VPI target somewhere? If I can reproduce this it's probably pretty straightforward to fix.
You need VCS.
I've basically stopped using JTAG_VPI so this isn't that critical. maybe better to just file an issue upstream since this seems like something that would happen any time a too-large burst is tried to be sent over JTAGVPI.
Alright. Can you file the upstream issue, since you're the one who can reproduce it?
I get a segfault if I try to connect to a core that is in reset, s.t.
dmi_busy_delay
gets over 4000. I'm not sure what's happening, unless the jtag_scan can't fit all the cycles necessary for the delay and runs out of memory?