Closed chrisdolan closed 3 years ago
I think @a-ilango will need to look at this. Just looking at the function in question, I would have guessed rx_buf->conn would be non-null. I see places in the code where this appears:
rx_buf->conn = rxm_key2conn(...);
if (!rx_buf->conn)
There are a couple of places where the if() check is missing. But I don't understand the code well enough to even know why that check is needed.
@chrisdolan v1.6.2 uses shared receive context by default and that's where rx_buf->conn needs to be set. There might be some serialization issues in the CQ read, AV insert path but if Open MPI uses FI_THREAD_DOMAIN, we shouldn't hit it.
Which IMB test and benchmark is this? How many ranks? I'll check if I get the issue when I enable shared receive context on master.
No action in 2 years, problem is likely already addressed in newer releases. But if reproducible in a later release, please open a new issue. Trying to close all older issues that have not had activity to refocus on problems in current releases
Using Libfabric 1.6.2 with OpenMPI and IMB, I've experienced an intermittent crash when rx_buf->conn is null during early startup with the following stack. I've examined the code at HEAD and it looks like it might have the same issue.
As a workaround, I added
if (!rx_buf->conn) continue;
at the following line and it seems to prevent the crash. But I don't understand the code well enough to know if my workaround is a good idea or a terrible one.https://github.com/ofiwg/libfabric/blob/069c3935cfc0716de17391c362fb48028d031617/prov/rxm/src/rxm_conn.c#L942
Crash stack: libfabric/prov/rxm/src/rxm_cq.c rxm_cq_reprocess_directed_recvs libfabric/prov/rxm/src/rxm_cq.c rxm_cq_reprocess_recv_queues libfabric/prov/rxm/src/rxm_cq.c rxm_ep_progress_one libfabric/prov/util/src/util_cq.c ofi_cq_progress libfabric/prov/util/src/util_cq.c ofi_cq_readfrom libfabric/include/rdma/fi_eq.h ompi_mtl_ofi_progress_no_inline openmpi/opal/runtime/opal_progress.c opal_progress openmpi/ompi/request/request.h ompi_request_default_wait openmpi/ompi/mca/coll/base/coll_base_util.c ompi_coll_base_sendrecv_actual openmpi/ompi/mca/coll/base/coll_base_util.h ompi_coll_base_allgather_intra_recursivedoubling openmpi/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c ompi_coll_tuned_allgather_intra_dec_fixed openmpi/ompi/communicator/comm.c ompi_comm_split openmpi/ompi/mpi/c/profile/pcomm_split.c PMPI_Comm_split intel_mpi_benchmarks/src/IMB_init.c IMB_set_communicator intel_mpi_benchmarks/src/IMB_init.c IMB_init_communicator intel_mpi_benchmarks/src/IMB.c main