openpmix / prrte

PMIx Reference RunTime Environment (PRRTE)
https://pmix.org
Other
35 stars 67 forks source link

SIGSEGV in pmix_server_iof_pull_fn called from pmix_server_iofdereg while processing PMIx_IOF_deregister #744

Closed drwootton closed 3 years ago

drwootton commented 3 years ago

Thank you for taking the time to submit an issue!

Background information

What version of the PMIx Reference Server are you using? (e.g., v1.0, v2.1, git master @ hash, etc.)

PRRTE latest source from github master

What version of PMIx are you using? (e.g., v1.2.5, v2.0.3, v2.1.0, git branch name and hash, etc.)

PMIX latest source from github master (After PMIx pull request openpmix/openpmix#2073)

Please describe the system on which you are running

Details of the problem

I ran the attach example program with prterun as described in PMIx issue openpmix/openpmix#2072

This resulted in a SIGSEGV in pmix_server_iof_pull_fn at pmix_server_gen.c:1306, called by pmix_server_iof_dereg.

The problem was that nprocs was invalid, and was a large integer value that caused a bad memory reference.

pmix_server_iof_pull_fn does not currently handle a request from pmix_server_iofdereg, where it should do something to handle that case.

I looked around and did not find any code that looked like it handled stopping I/O forwarding. There is a declaration of prte_io_base_flush in src/mca/iof/base/base.h but I cannot find any implementation.

Guessing that pmix_server_iofdereg was calling pmix_server_iof_pull_fn to give the PRRTE implementation the chance to do something to stop I/O forwarding, but there was nothing to be done, I added the following code at the start of pmix_server_iof_pull_fn to handle the stop request.

    for (i = 0; i < ndirs; i++) {
        if (PMIX_CHECK_KEY(&directives[i], PMIX_IOF_STOP) &&
                       PMIX_INFO_TRUE(&directives[i])) {
            return PMIX_OPERATION_SUCCEEDED;
        }
    }

I ran my attach test with this a bunch of times, some calling PMIx_IOF_deregister just before PMIx_tool_finalize and other times just before attach got the notification that it's daemon terminated.

This ran successfully most times. However, I did get a couple aborts from malloc and free. I'm not sure if this had anything to do with my change or not.

#0  0x00002000007dfbf0 in raise () from /lib64/libc.so.6
#1  0x00002000007e1f6c in abort () from /lib64/libc.so.6
#2  0x0000200000828d10 in __libc_message () from /lib64/libc.so.6
#3  0x0000200000832344 in malloc_printerr () from /lib64/libc.so.6
#4  0x000020000083a19c in free () from /lib64/libc.so.6
#5  0x000020000049c770 in pdes (p=0x200004045140) at pmix_globals.c:227
#6  0x000020000035ddd8 in pmix_obj_run_destructors (object=0x200004045140)
    at /u/dwootton/git/openpmix/src/class/pmix_object.h:553
#7  0x0000200000363834 in PMIx_server_finalize () at server/pmix_server.c:759
#8  0x00002000000d83d8 in pmix_server_finalize () at prted/pmix/pmix_server.c:685
#9  0x0000200000b561d4 in rte_finalize () at ess_hnp_module.c:649
#10 0x00002000000aa1b8 in prte_finalize () at runtime/prte_finalize.c:171
#11 0x0000000010009250 in main (argc=7, argv=0x7fffc8b594d8) at prte.c:1204
#0  0x00002000007dfbf0 in raise () from /lib64/libc.so.6
#1  0x00002000007e1ee0 in abort () from /lib64/libc.so.6
#2  0x0000200000828d10 in __libc_message () from /lib64/libc.so.6
#3  0x0000200000832344 in malloc_printerr () from /lib64/libc.so.6
#4  0x00002000008374e8 in _int_malloc () from /lib64/libc.so.6
#5  0x000020000083936c in malloc () from /lib64/libc.so.6
#6  0x000020000035de8c in pmix_obj_new_tma (cls=0x2000004fd5a8 <pmix_ptl_queue_t_class>, tma=0x0)
    at /u/dwootton/git/openpmix/src/class/pmix_object.h:575
#7  0x000020000035dba8 in pmix_obj_new_debug_tma (type=0x2000004fd5a8 <pmix_ptl_queue_t_class>, tma=0x0,
    file=0x2000004ac6a8 "server/pmix_server.c", line=2857)
    at /u/dwootton/git/openpmix/src/class/pmix_object.h:324
#8  0x0000200000373890 in op_cbfunc (status=0, cbdata=0x20000406f500) at server/pmix_server.c:2857
#9  0x00002000003a7058 in pmix_server_iofdereg (peer=0x200004045df0, buf=0x20000151e478,
    cbfunc=0x2000003732ac <op_cbfunc>, cbdata=0x20000406f500) at server/pmix_server_ops.c:3721
#10 0x0000200000388c98 in server_switchyard (peer=0x200004045df0, tag=106, buf=0x20000151e478)
    at server/pmix_server.c:4421
#11 0x00002000003898a4 in pmix_server_message_handler (pr=0x200004045df0, hdr=0x20000406eadc,
    buf=0x20000151e478, cbdata=0x0) at server/pmix_server.c:4475
#12 0x000020000047e7f8 in pmix_ptl_base_process_msg (fd=-1, flags=4, cbdata=0x20000406e9e0)
    at base/ptl_base_sendrecv.c:793
#13 0x000020000066e038 in event_base_loop () from /lib64/libevent_core-2.0.so.5
#14 0x00002000003bf73c in progress_engine (obj=0x461489d8) at runtime/pmix_progress_threads.c:235
#15 0x0000200000768b94 in start_thread () from /lib64/libpthread.so.0
#16 0x00002000008c85f4 in clone () from /lib64/libc.so.6

I tried running prterun with valgrind one time and valgrind only complained about 2 invalid 8 byte reads and one invalid 2 byte read, all at pmix_server_ops.c:3712. There were no complaints about invalid writes or invalid frees.

@rhc54 I can create a PRRTE pull request if my change is correct. However, I'm skeptical that it's right given the couple crashes.

rhc54 commented 3 years ago

Yeah, it is actually a two-part problem. I miscast the object in the PMIx server_ops code and the check for directive has to be done a little differently over in PRRTE. See the two referenced PRs.

drwootton commented 3 years ago

This is still failing for me, now in the PMIx_IOF_pull processing.

I rebuilt with this afternoon's PMIx and PRRTE master head and configured with --enable-debug.

Sometimes I get an assert fail in prterun

prterun: server/pmix_server.c:4016: _iofreg: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == _obj->obj_magic_id' failed.
Abort(coredump)

Once I got a SIGSEGV with the traceback

#0  0x000000000000006c in ?? ()
#1  0x0000200000352ab4 in _iofreg (sd=-1, args=4, cbdata=0x200004097bc0) at server/pmix_server.c:3959
#2  0x000020000054e038 in event_base_loop () from /lib64/libevent_core-2.0.so.5
#3  0x000020000038fa48 in progress_engine (obj=0x10d54cf8) at runtime/pmix_progress_threads.c:235
#4  0x0000200000bb8b94 in start_thread () from /lib64/libpthread.so.0
#5  0x0000200000d185f4 in clone () from /lib64/libc.so.6

I verified this was in PMIx_IOF_pull processing by setting a breakpoint in the attach example immediately following the PMIx_IOF_pull call.

Once I started attach, prterun failed immediately with a SIGSEGV and attach stopped at the next statement after the PMIx_IOF_pull call.

The way I'm running this is prterun -n 2 --report-uri + sleep 30 in one terminal. I copy the URI up to but not including the '.'. Then I start attach in another terminal, attach <prterun URI>

The attach program is attached.

The PMIx_IOF_deregister call if the call that was the original failure for this issue as well as openpmix/prrte/#744

attach.c.txt (named attach.c.txt because github won't let me paste/attach .c files)

rhc54 commented 3 years ago

Okay, I'll have to circle back around to this one later next week - I need to focus on some of these other issues for a while.