ofiwg / libfabric

Open Fabric Interfaces
http://libfabric.org/
Other
550 stars 376 forks source link

RxM: null rx_buf->conn during reprocessing recv #4893

Closed chrisdolan closed 3 years ago

chrisdolan commented 5 years ago

Using Libfabric 1.6.2 with OpenMPI and IMB, I've experienced an intermittent crash when rx_buf->conn is null during early startup with the following stack. I've examined the code at HEAD and it looks like it might have the same issue.

As a workaround, I added if (!rx_buf->conn) continue; at the following line and it seems to prevent the crash. But I don't understand the code well enough to know if my workaround is a good idea or a terrible one.

https://github.com/ofiwg/libfabric/blob/069c3935cfc0716de17391c362fb48028d031617/prov/rxm/src/rxm_conn.c#L942

Crash stack: libfabric/prov/rxm/src/rxm_cq.c rxm_cq_reprocess_directed_recvs libfabric/prov/rxm/src/rxm_cq.c rxm_cq_reprocess_recv_queues libfabric/prov/rxm/src/rxm_cq.c rxm_ep_progress_one libfabric/prov/util/src/util_cq.c ofi_cq_progress libfabric/prov/util/src/util_cq.c ofi_cq_readfrom libfabric/include/rdma/fi_eq.h ompi_mtl_ofi_progress_no_inline openmpi/opal/runtime/opal_progress.c opal_progress openmpi/ompi/request/request.h ompi_request_default_wait openmpi/ompi/mca/coll/base/coll_base_util.c ompi_coll_base_sendrecv_actual openmpi/ompi/mca/coll/base/coll_base_util.h ompi_coll_base_allgather_intra_recursivedoubling openmpi/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c ompi_coll_tuned_allgather_intra_dec_fixed openmpi/ompi/communicator/comm.c ompi_comm_split openmpi/ompi/mpi/c/profile/pcomm_split.c PMPI_Comm_split intel_mpi_benchmarks/src/IMB_init.c IMB_set_communicator intel_mpi_benchmarks/src/IMB_init.c IMB_init_communicator intel_mpi_benchmarks/src/IMB.c main

shefty commented 5 years ago

I think @a-ilango will need to look at this. Just looking at the function in question, I would have guessed rx_buf->conn would be non-null. I see places in the code where this appears:

rx_buf->conn = rxm_key2conn(...);
if (!rx_buf->conn)

There are a couple of places where the if() check is missing. But I don't understand the code well enough to even know why that check is needed.

arn314 commented 5 years ago

@chrisdolan v1.6.2 uses shared receive context by default and that's where rx_buf->conn needs to be set. There might be some serialization issues in the CQ read, AV insert path but if Open MPI uses FI_THREAD_DOMAIN, we shouldn't hit it.

Which IMB test and benchmark is this? How many ranks? I'll check if I get the issue when I enable shared receive context on master.

shefty commented 3 years ago

No action in 2 years, problem is likely already addressed in newer releases. But if reproducible in a later release, please open a new issue. Trying to close all older issues that have not had activity to refocus on problems in current releases