Closed hppritcha closed 3 years ago
from devel-core email thread
Howard,
The underlying logic (for the ERR_PROC_FAILED case) is the following:
1. When an error is reported on the send-side (e.g., the destination has failed), we still need to match the recv-side operation (because the source may be different from the destination, then the source may be still live, and we need to match their send).
2. If we do post a reception, we need to wait it, until the recv-buffer is de-registered from the device, before returning from a blocking call, as otherwise the user may de-allocate the recv-buffer while the BTL is still working on the buffer, and the whole program would go up in flames; we operate under the assumption that we will recover from this and be able to keep going, so that’s not acceptable.
~~~
This is causing issues with your specific scenario because, in the case you have a ’symmetric’ send error (local condition on all sides), all processes will enter the recv, which is unmatched by a send (given that this send just failed two lines above at the source), and will all wait instead of triggering the errhandler/aborting as they should.
A complete resolution will need to accommodate for both scenarios without the use of ‘ifdefs’. I am looking at options, I think a simple runtime conditional for ERR_PROC_FAILED will fix the immediate problem.
quick update; I have a patch for this, the correctness testing is slowed down by some FT defects caused by changes introduced in https://github.com/open-mpi/ompi/pull/9198.
reopen as we need equivalent of #9296 PR'd to v5.0.x
All merged.
The ibm pt2pt sendrecv_big test hangs when using the OFI MTL and PSM2 libfabric provider, and other providers with a maximum fi_tsend message length.
The test emits the expected warning message: Message size 30720000000 bigger than supported by selected transport. Max = 4294963200 and then hangs.
The OFI MTL is properly returning an error code to the upper layers of OMPI, but this code in sendrecv.c
results in the error return code being ignored.
If one comments out these OPAL_ENABLE_FT_MPI code blocks the test errors out as would be expected:
this problem is observed on master and v5.0.x branches.