Segmentation fault in Open MPI MTL/ofi, possibly related to libfabric or librdmacm

mkre commented 4 years ago

I originally reported this in https://github.com/open-mpi/ompi/issues/7136. I'm going to copy most of the relevant information from the Open MPI ticket here, but you can find more on the original ticket.

This issue happens with Open MPI 3.1.3/4.0.1 and Libfabric 1.8.0/1.8.1 on a dual-socket AMD EPYC 7601 system (shared memory).

We are seeing segmentation faults with both Open MPI 3.1.3 and 4.0.1 and -mca pml cm -mca mtl ofi in a shared-memory run. Everything is working fine with -mca pml ob1 -mca btl self,vader. Our application fails with the following stack trace, which seems like it indicates a segfault in MTL/ofi: epyc_verbose_out_norxm.txt

When we run the application under valgrind, the first seemingly relevant part of the stack trace is this:

==109943== Invalid read of size 4                                                                                                                                                                                                                                                 
==109943==    at 0x24D6C659: rdma_reject (in /usr/lib64/librdmacm.so.1.0.13)             
==109943==    by 0x24AFFCAA: fi_ibv_msg_ep_reject (verbs_cm.c:285)                                                                                                                                                                                                                
==109943==    by 0x2806983D: fi_reject (fi_cm.h:111)                                                                                                                                                     
==109943==    by 0x2806D6E1: rxm_msg_process_connreq (rxm_conn.c:1163)                                                                                                                                                                                                            
==109943==    by 0x2806DECB: rxm_conn_handle_event (rxm_conn.c:1272)                                                                                                                                                                                                              
==109943==    by 0x2806BCCB: rxm_msg_eq_progress (rxm_conn.c:637)                                                                                                                                                                                                                 
==109943==    by 0x2806BF32: rxm_cmap_connect (rxm_conn.c:678)                                                                                                                                                                                                                    
==109943==    by 0x28070B73: rxm_ep_prepare_tx (rxm.h:988)                                                                                                                                   
==109943==    by 0x28075B43: rxm_ep_tinject (rxm_ep.c:1799)                                                                                                                                                                                                                       
==109943==    by 0x23E77409: fi_tinject (fi_tagged.h:134)                                                                                                                
==109943==    by 0x23E78B4D: ompi_mtl_ofi_send_start (mtl_ofi.h:288)                                                                                                                                                                                                              
==109943==    by 0x23E78B4D: ompi_mtl_ofi_isend (mtl_ofi.h:381)                                                                                                                                          
==109943==    by 0x23C699FA: mca_pml_cm_isend (pml_cm.h:314)                                                                                                                                                                                                                      
==109943==  Address 0x2cf82ad0 is 544 bytes inside a block of size 576 free'd                                                                                             
==109943==    at 0x4C2ACDD: free (vg_replace_malloc.c:530)                                                                                                                                                                                                                        
==109943==    by 0x24D6CB4A: rdma_destroy_id (in /usr/lib64/librdmacm.so.1.0.13)                                                                                                                                                                                                  
==109943==    by 0x24B0F288: fi_ibv_ep_close (verbs_ep.c:251)                                                                                                                                                                                                                     
==109943==    by 0x28068E98: fi_close (fabric.h:559)                                                                                                                                                                                                                              
==109943==    by 0x2806C875: rxm_msg_ep_open (rxm_conn.c:869)                                                                                                                                                                                                                     
==109943==    by 0x2806D595: rxm_msg_process_connreq (rxm_conn.c:1141)                                                                                                                                                                                                            
==109943==    by 0x2806DECB: rxm_conn_handle_event (rxm_conn.c:1272)                                                                                                                                                                                                              
==109943==    by 0x2806BCCB: rxm_msg_eq_progress (rxm_conn.c:637)                                                                                                                                                                                                                 
==109943==    by 0x2806BF32: rxm_cmap_connect (rxm_conn.c:678)                                                                                                                                                                                                                    
==109943==    by 0x28070B73: rxm_ep_prepare_tx (rxm.h:988)                                                                                                                                                                                                                        
==109943==    by 0x28075B43: rxm_ep_tinject (rxm_ep.c:1799)                                                                                                                                                                                                                       
==109943==    by 0x23E77409: fi_tinject (fi_tagged.h:134)                                                                                                                                                                                                                         
==109943==  Block was alloc'd at                                                                                                                                                                                                                                                  
==109943==    at 0x4C2B975: calloc (vg_replace_malloc.c:711)                                                                                                                                                                                                                      
==109943==    by 0x24D6CFB2: ??? (in /usr/lib64/librdmacm.so.1.0.13)                                                                                                                                                                                                              
==109943==    by 0x24D6D855: ??? (in /usr/lib64/librdmacm.so.1.0.13)                                                                                                                                                                                                              ==109943==    by 0x24B0915D: fi_ibv_eq_read (verbs_eq.c:803)                                                                                                                                                                                                                      
==109943==    by 0x28069480: fi_eq_read (fi_eq.h:352)                                                                                                                                                                                                                             ==109943==    by 0x28069D38: rxm_eq_read (rxm_conn.c:98)                                                                                                                                                                                                                          
==109943==    by 0x2806BC8C: rxm_msg_eq_progress (rxm_conn.c:632)                                                                                                                                                                                                                 ==109943==    by 0x2806BF32: rxm_cmap_connect (rxm_conn.c:678)                                                                                                                                                                                                                    
==109943==    by 0x28070B73: rxm_ep_prepare_tx (rxm.h:988)                                                                                                                                                                                                                        
==109943==    by 0x28075B43: rxm_ep_tinject (rxm_ep.c:1799)                                                                                                                                                                                                                      
==109943==    by 0x23E77409: fi_tinject (fi_tagged.h:134)                                                                                                                                                                                                                         
==109943==    by 0x23E78B4D: ompi_mtl_ofi_send_start (mtl_ofi.h:288)                                                                                                     
==109943==    by 0x23E78B4D: ompi_mtl_ofi_isend (mtl_ofi.h:381)

But the valgrind stack trace also contains the following part (searching for ompi_mtl_ofi_progress_no_inline because it appeared in the no-valgrind stack trace):

==80733== Process terminating with default action of signal 15 (SIGTERM)
==80733==    at 0x54646AB: ??? (in /usr/lib64/libpthread-2.17.so)
==80733==    by 0x247586B4: ??? (in /usr/lib64/librdmacm.so.1.0.13)
==80733==    by 0x244F415D: fi_ibv_eq_read (verbs_eq.c:803)
==80733==    by 0x27A54480: fi_eq_read (fi_eq.h:352)
==80733==    by 0x27A54D38: rxm_eq_read (rxm_conn.c:98)                                                                                                                                                                                                                          
==80733==    by 0x27A56C8C: rxm_msg_eq_progress (rxm_conn.c:632)                                                                                                                                                                                                                 
==80733==    by 0x27A68E5B: rxm_ep_do_progress (rxm_cq.c:1421)                                                                                                                                                                                                                   
==80733==    by 0x27A68F24: rxm_ep_progress (rxm_cq.c:1437)                                                                                                                                                                                                                      
==80733==    by 0x27A8E602: ofi_cq_progress (util_cq.c:594)
==80733==    by 0x27A8D875: ofi_cq_readfrom (util_cq.c:245)                                                                                                                                                                                                                      
==80733==    by 0x27A8DB7B: ofi_cq_read (util_cq.c:312)
==80733==    by 0x2812F3E2: ompi_mtl_ofi_progress_no_inline (in /home/moritzk/nightly/STAR-CCM+15.01.065/mpi/openmpi/3.1.3-cda-006/linux-x86_64-2.12/gnu7.1/lib/openmpi/mca_mtl_ofi.so)

Please let me know if you need any further information or if there is anything I should try out.

Thanks, Moritz

shefty commented 4 years ago

Where did you get the libibverbs and librdmacm libraries?

mkre commented 4 years ago

Those are installed from the official CentOS 7 repo:

> rpm -qf /usr/lib64/librdmacm.so.1.0.13
librdmacm-13-7.el7.x86_64
> rpm -qf /usr/lib64/libibverbs.so.1.1.13 
libibverbs-13-7.el7.x86_64

I have tried to download and use the latest version of librdmacm (librdmacm-22.1-3.el7.x86_64.rpm) via LD_LIBRARY_PATH, which also resulted in a failure similar to the one seen in epyc_verbose_out_norxm.txt.

shefty commented 4 years ago

Can you check against libfabric 1.9.0rc3? This looks like some sort of data/stack corruption. I haven't seen any issues myself, but I use Xeon (obviously).

mkre commented 4 years ago

@shefty, same problem with 1.9.0rc3. I also haven't similar issues before (mostly on Xeon systems). Is there anything else I can do?

shefty commented 4 years ago

The original post mentions a shared memory run. What does that mean? Are the stock CentOS drivers being used as well? I'm not sure where to go with this at the moment. I'll try analyzing the code from the stack trace.

mkre commented 4 years ago

The original post mentions a shared memory run. What does that mean?

We're not running over any network, but only inside a single host.

Are the stock CentOS drivers being used as well?

Which drivers are you talking about exactly?

I'm not sure where to go with this at the moment. I'll try analyzing the code from the stack trace.

Thanks, let me know if you need any further information or if I should test anything.

Moritz

shefty commented 4 years ago

I don't understand what's happening at all now. If this is only using shared memory, why is the librdmacm being accessed at all? I was referring to the kernel RDMA drivers.

mkre commented 4 years ago

Maybe the librdmacm part of the valgrind stack trace is a red herring, and we should rather only look at the non-valgrind stack trace from epyc_verbose_out_norxm.txt?

Regarding the RDMA drivers, we are also using the stock CentOS rdma-core package (version 13.7el7).

shefty commented 4 years ago

How is OMPI actually transferring data? Is it using its own shared memory implementation? Are the transfers going through libfabric?

mkre commented 4 years ago

In this case, it is using Libfabric because it is told to do so with -mca pml cm -mca mtl ofi. If we use Open MPI's built-in Vader transport (-mca pml ob1 -mca btl self,vader), everything is working fine.

shefty commented 4 years ago

So, the transfers are going through libfabric using the verbs provider? And there is no use of shared memory?

rajachan commented 4 years ago

The OFI MTL in Open MPI does not support shared memory communication by itself, so if you run with the verbs provider, intranode communication would loopback over verbs OFI endpoints.

mkre commented 4 years ago

Thanks for chiming in, @rajachan. How do you want to proceed with this issue? Is there anything else I can provide?

shefty commented 4 years ago

We are just working through several other bugs at the moment before we can get to this one.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 360 days with no activity. Remove stale label or comment, otherwise it will be closed in 7 days.

ofiwg / libfabric

Segmentation fault in Open MPI MTL/ofi, possibly related to libfabric or librdmacm #5444