ibm/pt2pt/sendrecv_big test hangs when using OFI transport and some OFI providers

hppritcha commented 3 years ago

The ibm pt2pt sendrecv_big test hangs when using the OFI MTL and PSM2 libfabric provider, and other providers with a maximum fi_tsend message length.

The test emits the expected warning message: Message size 30720000000 bigger than supported by selected transport. Max = 4294963200 and then hangs.

The OFI MTL is properly returning an error code to the upper layers of OMPI, but this code in sendrecv.c

 #if OPAL_ENABLE_FT_MPI
         /* If ULFM is enabled we need to wait for the posted receive to
          * complete, hence we cannot return here */
         rcs = rc;
 #else

results in the error return code being ignored.

If one comments out these OPAL_ENABLE_FT_MPI code blocks the test errors out as would be expected:

[x232:00000] *** An error occurred in MPI_Sendrecv
[x232:00000] *** reported by process [2952003585,0]
[x232:00000] *** on communicator MPI_COMM_SELF
[x232:00000] *** MPI_ERR_OTHER: known error not in list
[x232:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[x232:00000] ***    and MPI will try to terminate your MPI job as well)

this problem is observed on master and v5.0.x branches.

hppritcha commented 3 years ago

from devel-core email thread

Howard,

The underlying logic (for the ERR_PROC_FAILED case) is the following:

1. When an error is reported on the send-side (e.g., the destination has failed), we still need to match the recv-side operation (because the source may be different from the destination, then the source may be still live, and we need to match their send).

2. If we do post a reception, we need to wait it, until the recv-buffer is de-registered from the device, before returning from a blocking call, as otherwise the user may de-allocate the recv-buffer while the BTL is still working on the buffer, and the whole program would go up in flames; we operate under the assumption that we will recover from this and be able to keep going, so that’s not acceptable.

~~~

This is causing issues with your specific scenario because, in the case you have a ’symmetric’ send error (local condition on all sides), all processes will enter the recv, which is unmatched by a send (given that this send just failed two lines above at the source), and will all wait instead of triggering the errhandler/aborting as they should.

A complete resolution will need to accommodate for both scenarios without the use of ‘ifdefs’. I am looking at options, I think a simple runtime conditional for ERR_PROC_FAILED will fix the immediate problem.

abouteiller commented 3 years ago

quick update; I have a patch for this, the correctness testing is slowed down by some FT defects caused by changes introduced in https://github.com/open-mpi/ompi/pull/9198.

hppritcha commented 3 years ago

reopen as we need equivalent of #9296 PR'd to v5.0.x

abouteiller commented 3 years ago

All merged.

open-mpi / ompi

ibm/pt2pt/sendrecv_big test hangs when using OFI transport and some OFI providers #9160