Closed wzamazon closed 1 year ago
In terms of fixing the issue, I wonder whether we should keep mca_btl_ofi_flush
.
From what I can tell, it was provided as an optimization, and used by osc/rdma only.
When the btl does not support flush, osc/rdma has its internal counter, which is per communicator.
It is not clear to me that using mca_btl_ofi_flush
is faster than using osc/rdma's internal counter.
So I propose to remove this function, and do not set btl_flush
for the btl/ofi module.
Would it be sufficient to perform the increment of outstanding_rdma
before calling fi_read/write/atomic
? I would prevent the observed race condition (outstanding_rdma
is guaranteed >=
active rdma operations) but I don't know the code so it might break other things...
Would it be sufficient to perform the increment of outstanding_rdma before calling fi_read/write/atomic?
That would be hard to implement because call to fi_read/write/atomic
can fail due to temporarily out of resource. If we increase counter before, then we will need to decrease counter.
The question then is how likely the call is to fail. If it's a rare occurrence then the revert atomic decrement is acceptable.
The question then is how likely the call is to fail. If it's a rare occurrence then the revert atomic decrement is acceptable.
It would be depend on application, and the libfabric provider being used.
From my experience with EFA provider, it is not rare.
@devreal
I think your suggestion is a smaller change and is better than simply remove flush (as I originally proposed).
So I opened PR https://github.com/open-mpi/ompi/pull/11656 to implement it. Please take a look! Thank you!
PR has been merged
Thank you for taking the time to submit an issue!
Background information
I noticed this issue when debugging a data corruption issue with the
mt_1sided
test in ibm test suites.What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
I was using the main branch. From the mtt test result, this also impact v5.0.x and v4.1.x branch
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
compiled from source
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.Please describe the system on which you are running
Details of the problem
The function
mca_btl_ofi_flush()
is called to ensure that all inflight rdma operations have completed. It is used under the following workflow:outstanding_rdma
.outstanding_rdma
reach 0. then return.mca_btl_ofi_flush()
promise that upon its return, the rdma action submitted by the user has completed. However, this promise does not hold under multi-thread environment, because step 1 (callingfi_read/write/atomic
) and step 2 (increaseoutstanding_rdma
) are not serialized.For example, in the following workflow of two threads:
at the begining, there is not inflight rdma operation, and
outstanding_rdma
is 0thread 1 call
fi_read
thread 2 callfi_read
thread 1 increaseoutstanding_rdma
to1
. # note that this point theoustanding_rdma
is wrong because there are 2 inflightfi_read
. thread 1 callmca_btl_ofi_flush
, which call libfabric progress engine. thefi_read
submitted by thread 2 completed, which decreasedoutstanding_rdma
by 1 to 0. somca_btl_ofi_flush
return.Now that
mca_btl_ofi_flush
is completed, thread 1 think thefi_read
it submitted has completed, but it is the case.