mca_btl_ofi_flush() does not work as intended under multi-thread environment

wzamazon commented 1 year ago

Thank you for taking the time to submit an issue!

Background information

I noticed this issue when debugging a data corruption issue with the mt_1sided test in ibm test suites.

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

I was using the main branch. From the mtt test result, this also impact v5.0.x and v4.1.x branch

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

compiled from source

If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.

Please describe the system on which you are running

Operating system/version: Amazon Linux 2
Computer hardware: Intel CPU
Network type: EFA

Details of the problem

The function mca_btl_ofi_flush() is called to ensure that all inflight rdma operations have completed. It is used under the following workflow:

user of btl/ofi call mca_btl_ofi_get/put/aop, for which btl/ofi call fi_read/write/atomic
btl/ofi increase its internal counter outstanding_rdma.
user of btl/ofi call mca_btl_ofi_flush(), which run the progress engine until outstanding_rdma reach 0. then return.

mca_btl_ofi_flush() promise that upon its return, the rdma action submitted by the user has completed. However, this promise does not hold under multi-thread environment, because step 1 (calling fi_read/write/atomic) and step 2 (increase outstanding_rdma) are not serialized.

For example, in the following workflow of two threads:

at the begining, there is not inflight rdma operation, and outstanding_rdma is 0

thread 1 call fi_read thread 2 call fi_read thread 1 increase outstanding_rdma to 1. # note that this point the oustanding_rdma is wrong because there are 2 inflight fi_read. thread 1 call mca_btl_ofi_flush, which call libfabric progress engine. the fi_read submitted by thread 2 completed, which decreased outstanding_rdma by 1 to 0. so mca_btl_ofi_flush return.

Now that mca_btl_ofi_flush is completed, thread 1 think the fi_read it submitted has completed, but it is the case.

wzamazon commented 1 year ago

In terms of fixing the issue, I wonder whether we should keep mca_btl_ofi_flush.

From what I can tell, it was provided as an optimization, and used by osc/rdma only.

When the btl does not support flush, osc/rdma has its internal counter, which is per communicator.

It is not clear to me that using mca_btl_ofi_flush is faster than using osc/rdma's internal counter.

So I propose to remove this function, and do not set btl_flush for the btl/ofi module.

devreal commented 1 year ago

Would it be sufficient to perform the increment of outstanding_rdma before calling fi_read/write/atomic? I would prevent the observed race condition (outstanding_rdma is guaranteed >= active rdma operations) but I don't know the code so it might break other things...

wzamazon commented 1 year ago

Would it be sufficient to perform the increment of outstanding_rdma before calling fi_read/write/atomic?

That would be hard to implement because call to fi_read/write/atomic can fail due to temporarily out of resource. If we increase counter before, then we will need to decrease counter.

devreal commented 1 year ago

The question then is how likely the call is to fail. If it's a rare occurrence then the revert atomic decrement is acceptable.

wzamazon commented 1 year ago

The question then is how likely the call is to fail. If it's a rare occurrence then the revert atomic decrement is acceptable.

It would be depend on application, and the libfabric provider being used.

From my experience with EFA provider, it is not rare.

wzamazon commented 1 year ago

@devreal

I think your suggestion is a smaller change and is better than simply remove flush (as I originally proposed).

So I opened PR https://github.com/open-mpi/ompi/pull/11656 to implement it. Please take a look! Thank you!

wzamazon commented 1 year ago

PR has been merged

open-mpi / ompi