Closed drwootton closed 3 years ago
Yeah, it is actually a two-part problem. I miscast the object in the PMIx server_ops code and the check for directive has to be done a little differently over in PRRTE. See the two referenced PRs.
This is still failing for me, now in the PMIx_IOF_pull processing.
I rebuilt with this afternoon's PMIx and PRRTE master head and configured with --enable-debug.
Sometimes I get an assert fail in prterun
prterun: server/pmix_server.c:4016: _iofreg: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == _obj->obj_magic_id' failed.
Abort(coredump)
Once I got a SIGSEGV with the traceback
#0 0x000000000000006c in ?? ()
#1 0x0000200000352ab4 in _iofreg (sd=-1, args=4, cbdata=0x200004097bc0) at server/pmix_server.c:3959
#2 0x000020000054e038 in event_base_loop () from /lib64/libevent_core-2.0.so.5
#3 0x000020000038fa48 in progress_engine (obj=0x10d54cf8) at runtime/pmix_progress_threads.c:235
#4 0x0000200000bb8b94 in start_thread () from /lib64/libpthread.so.0
#5 0x0000200000d185f4 in clone () from /lib64/libc.so.6
I verified this was in PMIx_IOF_pull processing by setting a breakpoint in the attach example immediately following the PMIx_IOF_pull call.
Once I started attach, prterun failed immediately with a SIGSEGV and attach stopped at the next statement after the PMIx_IOF_pull call.
The way I'm running this is prterun -n 2 --report-uri + sleep 30 in one terminal. I copy the URI up to but not including the '.'. Then I start attach in another terminal, attach <prterun URI>
The attach program is attached.
The PMIx_IOF_deregister call if the call that was the original failure for this issue as well as openpmix/prrte/#744
attach.c.txt (named attach.c.txt because github won't let me paste/attach .c files)
Okay, I'll have to circle back around to this one later next week - I need to focus on some of these other issues for a while.
Thank you for taking the time to submit an issue!
Background information
What version of the PMIx Reference Server are you using? (e.g., v1.0, v2.1, git master @ hash, etc.)
PRRTE latest source from github master
What version of PMIx are you using? (e.g., v1.2.5, v2.0.3, v2.1.0, git branch name and hash, etc.)
PMIX latest source from github master (After PMIx pull request openpmix/openpmix#2073)
Please describe the system on which you are running
Network type: localhost
Details of the problem
I ran the attach example program with prterun as described in PMIx issue openpmix/openpmix#2072
This resulted in a SIGSEGV in pmix_server_iof_pull_fn at pmix_server_gen.c:1306, called by pmix_server_iof_dereg.
The problem was that nprocs was invalid, and was a large integer value that caused a bad memory reference.
pmix_server_iof_pull_fn does not currently handle a request from pmix_server_iofdereg, where it should do something to handle that case.
I looked around and did not find any code that looked like it handled stopping I/O forwarding. There is a declaration of prte_io_base_flush in src/mca/iof/base/base.h but I cannot find any implementation.
Guessing that pmix_server_iofdereg was calling pmix_server_iof_pull_fn to give the PRRTE implementation the chance to do something to stop I/O forwarding, but there was nothing to be done, I added the following code at the start of pmix_server_iof_pull_fn to handle the stop request.
I ran my attach test with this a bunch of times, some calling PMIx_IOF_deregister just before PMIx_tool_finalize and other times just before attach got the notification that it's daemon terminated.
This ran successfully most times. However, I did get a couple aborts from malloc and free. I'm not sure if this had anything to do with my change or not.
I tried running prterun with valgrind one time and valgrind only complained about 2 invalid 8 byte reads and one invalid 2 byte read, all at pmix_server_ops.c:3712. There were no complaints about invalid writes or invalid frees.
@rhc54 I can create a PRRTE pull request if my change is correct. However, I'm skeptical that it's right given the couple crashes.