RFC: Window handle duplication and new info keys for thread-local RMA

devreal commented 4 years ago

The Problem

MPI RMA currently only allows for RMA completion control at the process level. With multiple threads, a flush in one thread may have to wait for operations started by another thread. Implementations may use thread-local endpoints to reduce inter-thread synchronization but a flush still requires all operations on all endpoints to complete. In many cases of thread-parallel RMA, flushing the operations issued by the current thread is sufficient but in some cases users may still want to flush all outstanding operations from all threads (e.g., after joining threads with outstanding RMA operations).

OpenShmem has introduced the concept of contexts, which are essentially explicitly managed endpoints. IIRC, the results have been promising in multi-threaded situations. However, it required a fairly invasive extension of the OSHMEM API to add functions accepting contexts. I don't think this is the way to go for MPI.

AFAIK, there have been ideas floating around about thread-local flush operations (like MPI_Win_flush_thread) but I'm not sure what their status is. I would like to put out the following ideas for discussion, maybe something like it has been discussed (and dismissed) previously?

The Proposal

The proposal does three things:

1) Allow for MPI windows to be duplicated in a way that the original and duplicated window both point to the same memory and have some consistency guarantees (more on that later) but potentially different info hints. 2) Add an info key indicating that all accesses through the window handle will be single-threaded (something like mpi_assert_thread_access with values similar to MPI_THREAD_[SINGLE|FUNNELED|MULTIPLE]). This may allow the implementation to avoid allocating and later handling multiple endpoints if the users asserts that thread-parallel access is not needed on that specific window. 3) Add another info key indicating that only operations of the current thread should be flushed (something like mpi_flush_thread_local with Boolean values). This will allow the implementation to flush only the current thread's operations while still allowing the window handle to be used by multiple threads.

Currently, MPI windows are tightly coupled to the memory that was allocated or provided at creation time (ignoring dynamic windows here) and there is no functionality for duplicating the window, unlike for example communicators. This proposal would add MPI_Win_dup and MPI_Win_dup_with_info that duplicate the window handle with both the original and the resulting window providing access to the same memory. This is somewhat similar to communicator duplication, where both the parent and the duplicated communicators represent the same process group.

Together with the new info key mpi_assert_thread_access it would be possible to duplicate a window multiple times and use one duplicated window per thread. AFAICS, this model would be similar to OSHMEM contexts without polluting the API. This may also solve a problem that was brought up in one of the discussions on accumulate info keys: what if different parts of the code use different combinations of accumulate operations? With duplicated windows it is possible to use different accumulate info values in different parts of the application.

The introduction of the second info key (mpi_flush_thread_local) would allow applications to maintain two window handles for the same memory, one with process-level flush and one with thread-level flush, and to switch between the two modes whenever appropriate. I believe both info keys can be of value to applications.

The duplicated windows are not independent windows with fully overlapping memory (as you would get from calling MPI_Win_create on the same memory) but instead they are meant as aliases for the original window handle with potentially different info values and (possibly) independent completion semantics among duplicates. In particular, accumulate operations on one duplicate window should be atomic with regard to accumulate operations on another window duplicated from the same parent window provided both windows have compatible accumulate info keys. This allows different threads to safely issue accumulate operations to the same target memory address using only thread-local flush semantics.

Ideally, a flush on the parent window should flush all outstanding operations on all windows duplicated from it (unless the parent window was already created with mpi_flush_thread_local=true in which case only the current thread's operations issued on the duplicated windows are completed by a flush on the parent). This allows applications to mix thread- and process-level flushes without explicitly switching between windows when issuing operations, i.e., threads may issue their last operation on a window duplicated with mpi_flush_thread_local=true but the completion happens through the parent window. I believe this can be useful for PGAS abstractions built on top of MPI RMA where the exact behavior of the application's threads is unknown to the layer that calls MPI. However, the exact semantics of completion between duplicated windows are open for discussion.

This proposal is by no means final but merely an idea I have been sitting on for a while now. The reason I'm interested in thread-local RMA is that this seemed to be one of the road-blocks for efficient operation ordering (#10) as having multiple endpoints active in one window will require a full flush on all of them to guarantee ordering at the process level. With thread-local RMA this issue could be avoided. I also believe that the two info keys allow users to provide hints to the implementation to reduce the overhead of multi-threaded RMA when restricted semantics such as thread-local completion or single-threaded access in an otherwise multi-threaded application are sufficient.

hjelmn commented 4 years ago

I have a working prototype of MPI_Winflush*thread. I am not sure what I think about this proposal as it seems a lot like the defunct endpoints proposal. We should discuss this proposal and the thread flush concept and come see how to best proceed. We need something but I am not a fan of either endpoints nor the way this is handled in oshmem.

jdinan commented 4 years ago

@hjelmn An important difference from endpoints is that there are no new MPI process IDs. With respect to introducing a new object (e.g. MPI window handle or OpenSHMEM context), how else would you identify threads without tying the API to a specific threading model? My bigger concern would be whether the RMA memory model for overlapping windows is relaxed enough to allow threads to actually get decent performance out of the proposed API.

hjelmn commented 4 years ago

@jdinan The prototype only supports pthreads at the moment but that is the dominant threading model for our systems. It should be fairly simple to add support for other threading models as Open MPI has already added support for qthreads and arcobots. The concept is fairly simple. Thread-local storage is used to assign a hardware context to the calling thread. When an MPI_Win_flush call is made all of the hardware contexts are flushed and when MPI_Win_flush_thread is called the hardware context for the calling thread is flushed. If pthreads are in use we can use C11 _Thread_local storage otherwise we can use the equivalent for the threading model in use (if necessary). For performance the MPI implementation needs to be aware of the application threading model anyway.

The prototype shows a decent (last measurement was ~10%) improvement in the performance of RMA-MT between MPI_Win_flush and MPI_Win_flush_thread with uGNI and osc/rdma at 32 threads.

I admit that more work is needed to ensure that as proposed (I haven't yet opened a bug) this interface will be effective. Once MPI-4 is finalized we should start the RMA WG meetings back up to start discussing these issues.

mpiwg-rma / rma-issues

RFC: Window handle duplication and new info keys for thread-local RMA #13

The Problem

The Proposal