hpc / mpifileutils

File utilities designed for scalability and performance.
https://hpc.github.io/mpifileutils
BSD 3-Clause "New" or "Revised" License
168 stars 66 forks source link

dbcast Seg Fault #500

Open NateThornton opened 3 years ago

NateThornton commented 3 years ago

Running the dbcast command and I encounter a segfault; I'm using a main node to manage two worker nodes, and the crash has so far always occurred on the second worker. The first worker is successful.

Command from main node is mpirun -np 128 --hostfile hostfile --mca btl_tcp_if_exclude virbr0,lo,ib0 dbcast /home/mpiuser/cloud/128-files.0.0 /home/mpiuser/dbcastfile.0.0

Every process backtrace has the same pattern; looks like some kind of crash in shared memory. I can try to recompile with debug symbols. Will post a response as I gather more data.

Oct  5 16:14:50 localhost systemd-coredump[89949]: Process 89887 (dbcast) of user 1000 dumped core.

Stack trace of thread 89887:
0x00007f0b1d841187 strmap_unset (libmfu.so.3.0.0)#012#1  
0x00007f0b1d841421 strmap_unsetf (libmfu.so.3.0.0)#012#2  
0x0000000000402b04 GCS_Shmem_free (dbcast)#012#3  
0x00000000004060c7 main (dbcast)#012#4  
0x00007f0b1c4d1493 __libc_start_main (libc.so.6)#012#5  
0x000000000040226e _start (dbcast)#012#012

Stack trace of thread 89910:#012#0  
0x00007f0b1c59fa41 __poll (libc.so.6)#012#1  
0x00007f0b1bd73015 poll_dispatch (libopen-pal.so.40)#012#2  
0x00007f0b1bd6a5d9 opal_libevent2022_event_base_loop (libopen-pal.so.40)#012#3  
0x00007f0b1bd26e4e progress_engine (libopen-pal.so.40)#012#4  
0x00007f0b1b2af14a start_thread (libpthread.so.0)#012#5  
0x00007f0b1c5aadc3 __clone (libc.so.6)#012#012

Stack trace of thread 89915:#012#0  
0x00007f0b1c5ab0f7 epoll_wait (libc.so.6)#012#1  
0x00007f0b1bd6653d epoll_dispatch (libopen-pal.so.40)#012#2  
0x00007f0b1bd6a5d9 opal_libevent2022_event_base_loop (libopen-pal.so.40)#012#3  
0x00007f0b13d646be progress_engine (mca_pmix_pmix3x.so)#012#4  
0x00007f0b1b2af14a start_thread (libpthread.so.0)#012#5  
0x00007f0b1c5aadc3 __clone (libc.so.6)
NateThornton commented 3 years ago

Crash occurs in strmap_unset when the node to be removed has two children and replacement == node->left, which is the case when the right-most child of the node's left is the very first node.

In this case the call to strmap_node_extract_single will set node->left to NULL which causes a crash when dereferenced on line 778

Not sure the best way to go about fixing this; but will leave that for the authors

adammoody commented 3 years ago

Thanks for the report and the debugging work, @NateRoiger .

adammoody commented 3 years ago

@NateRoiger , thanks again for the report and the great job in debugging things. That made the fix much easier. I was able to reproduce this segfault.

I think #501 should fix it, and I've optimistically merged that in. It fixes my reproducer.

Would you also please verify that this fixes things for you?

NateThornton commented 3 years ago

I no longer experience the crash in strmap; but I am experiencing a hang after the broadcast is complete. I think that is a different issue which I can open up once I have some more information.

The hang occurs after "Bcast complete" until I killed dbcast on my worker0.

$mpirun -hostfile hostfile --mca btl_tcp_if_exclude virbr0,lo,ib0 dbcast /home/mpiuser/cloud/128-files.0.0 /home/mpiuser/dbcast.file
[2021-10-11T10:38:25] Creating destination directories for `/home/mpiuser/dbcast.file`
[2021-10-11T10:38:25] Broadcasting contents of `/home/mpiuser/cloud/128-files.0.0` to `/home/mpiuser/dbcast.file`
[2021-10-11T10:38:25] Progress: 100.0% 3.049585 MB/s 0.0 secs remaining
[2021-10-11T10:38:25] Bcast complete: size=1024, time=0.016042 secs, speed=0.060874 MB/sec
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 5 with PID 68558 on node worker0 exited on signal 15 (Terminated).
--------------------------------------------------------------------------