Open NateThornton opened 3 years ago
Crash occurs in strmap_unset when the node to be removed has two children and replacement == node->left
, which is the case when the right-most child of the node's left is the very first node.
In this case the call to strmap_node_extract_single will set node->left to NULL which causes a crash when dereferenced on line 778
Not sure the best way to go about fixing this; but will leave that for the authors
Thanks for the report and the debugging work, @NateRoiger .
@NateRoiger , thanks again for the report and the great job in debugging things. That made the fix much easier. I was able to reproduce this segfault.
I think #501 should fix it, and I've optimistically merged that in. It fixes my reproducer.
Would you also please verify that this fixes things for you?
I no longer experience the crash in strmap; but I am experiencing a hang after the broadcast is complete. I think that is a different issue which I can open up once I have some more information.
The hang occurs after "Bcast complete" until I killed dbcast on my worker0.
$mpirun -hostfile hostfile --mca btl_tcp_if_exclude virbr0,lo,ib0 dbcast /home/mpiuser/cloud/128-files.0.0 /home/mpiuser/dbcast.file
[2021-10-11T10:38:25] Creating destination directories for `/home/mpiuser/dbcast.file`
[2021-10-11T10:38:25] Broadcasting contents of `/home/mpiuser/cloud/128-files.0.0` to `/home/mpiuser/dbcast.file`
[2021-10-11T10:38:25] Progress: 100.0% 3.049585 MB/s 0.0 secs remaining
[2021-10-11T10:38:25] Bcast complete: size=1024, time=0.016042 secs, speed=0.060874 MB/sec
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 5 with PID 68558 on node worker0 exited on signal 15 (Terminated).
--------------------------------------------------------------------------
Running the dbcast command and I encounter a segfault; I'm using a main node to manage two worker nodes, and the crash has so far always occurred on the second worker. The first worker is successful.
Command from main node is
mpirun -np 128 --hostfile hostfile --mca btl_tcp_if_exclude virbr0,lo,ib0 dbcast /home/mpiuser/cloud/128-files.0.0 /home/mpiuser/dbcastfile.0.0
Every process backtrace has the same pattern; looks like some kind of crash in shared memory. I can try to recompile with debug symbols. Will post a response as I gather more data.