Closed mkre closed 2 years ago
Where did you get the libibverbs and librdmacm libraries?
Those are installed from the official CentOS 7 repo:
> rpm -qf /usr/lib64/librdmacm.so.1.0.13
librdmacm-13-7.el7.x86_64
> rpm -qf /usr/lib64/libibverbs.so.1.1.13
libibverbs-13-7.el7.x86_64
I have tried to download and use the latest version of librdmacm (librdmacm-22.1-3.el7.x86_64.rpm
) via LD_LIBRARY_PATH
, which also resulted in a failure similar to the one seen in epyc_verbose_out_norxm.txt.
Can you check against libfabric 1.9.0rc3? This looks like some sort of data/stack corruption. I haven't seen any issues myself, but I use Xeon (obviously).
@shefty, same problem with 1.9.0rc3. I also haven't similar issues before (mostly on Xeon systems). Is there anything else I can do?
The original post mentions a shared memory run. What does that mean? Are the stock CentOS drivers being used as well? I'm not sure where to go with this at the moment. I'll try analyzing the code from the stack trace.
The original post mentions a shared memory run. What does that mean?
We're not running over any network, but only inside a single host.
Are the stock CentOS drivers being used as well?
Which drivers are you talking about exactly?
I'm not sure where to go with this at the moment. I'll try analyzing the code from the stack trace.
Thanks, let me know if you need any further information or if I should test anything.
Moritz
I don't understand what's happening at all now. If this is only using shared memory, why is the librdmacm being accessed at all? I was referring to the kernel RDMA drivers.
Maybe the librdmacm
part of the valgrind stack trace is a red herring, and we should rather only look at the non-valgrind stack trace from epyc_verbose_out_norxm.txt?
Regarding the RDMA drivers, we are also using the stock CentOS rdma-core
package (version 13.7el7).
How is OMPI actually transferring data? Is it using its own shared memory implementation? Are the transfers going through libfabric?
In this case, it is using Libfabric because it is told to do so with -mca pml cm -mca mtl ofi
. If we use Open MPI's built-in Vader transport (-mca pml ob1 -mca btl self,vader
), everything is working fine.
So, the transfers are going through libfabric using the verbs provider? And there is no use of shared memory?
The OFI MTL in Open MPI does not support shared memory communication by itself, so if you run with the verbs provider, intranode communication would loopback over verbs OFI endpoints.
Thanks for chiming in, @rajachan. How do you want to proceed with this issue? Is there anything else I can provide?
We are just working through several other bugs at the moment before we can get to this one.
This issue is stale because it has been open 360 days with no activity. Remove stale label or comment, otherwise it will be closed in 7 days.
I originally reported this in https://github.com/open-mpi/ompi/issues/7136. I'm going to copy most of the relevant information from the Open MPI ticket here, but you can find more on the original ticket.
This issue happens with Open MPI 3.1.3/4.0.1 and Libfabric 1.8.0/1.8.1 on a dual-socket AMD EPYC 7601 system (shared memory).
We are seeing segmentation faults with both Open MPI 3.1.3 and 4.0.1 and
-mca pml cm -mca mtl ofi
in a shared-memory run. Everything is working fine with-mca pml ob1 -mca btl self,vader
. Our application fails with the following stack trace, which seems like it indicates a segfault in MTL/ofi: epyc_verbose_out_norxm.txtWhen we run the application under valgrind, the first seemingly relevant part of the stack trace is this:
But the valgrind stack trace also contains the following part (searching for
ompi_mtl_ofi_progress_no_inline
because it appeared in the no-valgrind stack trace):Please let me know if you need any further information or if there is anything I should try out.
Thanks, Moritz