mpiwg-rma / rma-issues

Repository to discuss internal RMA working group issues
1 stars 0 forks source link

we need pirate RMA for multithreaded use cases #23

Open jeffhammond opened 2 years ago

jeffhammond commented 2 years ago

Problem

How does one do remote completion in a multi-threaded application? It's impossible, because one cannot do a flush on one thread at the same time as a RMA op on another thread. This is not a theoretical problem, as it has been seen by users:

If we assert one can do remote completion in a multi-threaded application with the current features, then we need to add text to this effect, so that it's clear that Open-MPI is incorrectly blaming user programs. @hjelmn

Solution

Request-based remote completion, which I proposed a decade ago. This means we add the following functions, which take two request arguments, one for local completion, and one for remote completion. For completeness, we should make it legal to pass MPI_REQUEST_NULL when these are not needed.

The new functions would be:

MPI_Rrput(..,MPI_REQUEST_NULL,MPI_REQUEST_NULL) behaves like MPI_Put. MPI_Rrput(..,&request,MPI_REQUEST_NULL) behaves like MPI_Rput. MPI_Rrput(..,MPI_REQUEST_NULL,&request) can be locally completed with a local flush.

devreal commented 2 years ago

Locks, fence, and PSCW from multiple threads don't work. However, I believe it is legal to use lockall right after window creation and then MPI_Put|Get+MPI_Win_flush in concurrent threads. The operations and flushes will be executed in some order, the only constraint being that each flush must wait for all operations previously issue by the process. This has worked for me in the past.

This is not ideal and I would like to see a way for thread-scope synchronization (instead of today\s process-scope). I had opened https://github.com/mpiwg-rma/rma-issues/issues/13 but I'm sure we can have a more elegant solution than info keys and window duplication...

tschuett commented 2 years ago

I like the idea of OSHMEM contexts. Whatever is the meaning of local and thread with e.g. argobots? It could be an add-on on the existing APIs. What is the meaning of a thread when an argobot thread migrates to a different OS thread?

jeffhammond commented 2 years ago

@devreal If that's the case, then Open-MPI is broken for multithreaded RMA, and it needs to stop issuing an error about incorrect synchronization usage.

devreal commented 2 years ago

@jeffhammond is there an open issue for it in OMPI? What version of OMPI? A reproducer? I tried the following code and it works with both the generic and the UCX backend:

MPI_Win_lock_all(0, win);

#pragma omp parallel
{
#pragma omp for
  for (int k = 0; k < NUM_REPS; ++k) {
    uint64_t res;
    int target = k % size;
    uint64_t val = 1;
    MPI_Fetch_and_op(&val, &res, MPI_UINT64_T, target, 0, MPI_SUM, win);
    MPI_Win_flush(target, win);
  }
} // omp parallel
devreal commented 2 years ago

I never understood why request-based operations don't provide remote completion. I guess there were reasons over a decade ago... I think this would be a good addition to RMA. I'm not sure it can entirely solve the problem of thread-scope flushes though because tracking requests for large numbers of operations is potentially costly.

devreal commented 2 years ago

@tschuett The duplicated window handles I proposed can be generalized to single-threaded contexts, without the need for binding their resources to any particular thread and without blowing up the API like shmem contexts did.

jeffhammond commented 2 years ago

@devreal The links above both report user problems. I guess you can reproduce with Kokkos Remote Spaces, but I haven't had time to do that.

jeffhammond commented 2 years ago

@devreal request based remote completion was rejected because the hardware people didn't like it, and the use cases weren't strong.

tschuett commented 2 years ago

But e.g. MPI_Alloc_Contexts(win, &contexts, 5); should be easy to add.

MPI_Put_with_Context(win, contexts[3], ...);

tschuett commented 2 years ago

Another attempt: MPI 4.0 learned Partitioned Communication, i.e, multi-threaded message passing. There is already research on Partitioned Collectives, i.e., multi-threaded collectives. Maybe there is space for Partitioned RMA? The universe (threads) go through phases/epochs and within an epoch there are no overlapping operations.

devreal commented 2 years ago

For the record: https://github.com/kokkos/kokkos-remote-spaces/pull/51 is a non-issue, OMPI was correct in complaining when window locks are used in concurrent threads. Window locks are not thread-safe when concurrently accessing the same target.

Still, the issue of multi-threaded RMA is real and we should talk about whether we want explicit contexts and/or remote completing request-based ops :+1:

jdinan commented 2 years ago

This isn't impossible. The solution today is to create a window for each thread, with the same window buffer. An advantage of this approach is that it allows threads to use the flush operations, which could be more efficient for some usage models versus fine-grain remote completion as proposed here.

devreal commented 2 years ago

That only works with a fixed number of threads known a priori. Dynamically adding threads is impossible due to the collective nature MPI_Win_create.

My squabble with flush (and why I like remote-completing rput/raccumulate) is that flush is blocking, potentially depending on remote progress. The problem with rput/raccumulate as I mentioned above (and maybe what @jdinan refers to) is that each individual operation requires remote completion, which may be costly for large numbers of operations. To consolidate the two, nonblocking flush would provide both: a request for coarse remote completion semantics. I'm sure this has been discussed before and I'm curious why it had been rejected.

jdinan commented 2 years ago

It is only efficient with a fixed number of threads. Dynamically adding threads may require some threads to share a given window from a pool of windows created ahead of time. This could have performance consequences, but not correctness. What you really want is something like an OpenSHMEM context or maybe an aggregate handle. MPI made the unfortunate choice of tying memory registration/exposure together with the synchronization/memory models. We could introduce something like an MPI_Win_dup_local for situations like this where you want to tease them apart.

tschuett commented 2 years ago

If windows are long-lived, you could at any time do idk: MPI_Win_Alloc_Context(win, &next_context, 1); If I have to choose between duplicating windows to get handles or asking the window for another handle, then I would prefer the second option.

jdinan commented 2 years ago

@tschuett In the RMA memory model, each window context/dup would have overlapping window semantics and the object you get back from MPI_Win_alloc_context or MPI_Win_dup_local would be of type MPI_Win to be usable to RMA routines. Do you see places where the semantics would differ, or are these just two names for the same concept?