mpiwg-rma / rma-issues

Repository to discuss internal RMA working group issues
1 stars 0 forks source link

Deprecate MPI_Win_fence #4

Open hjelmn opened 6 years ago

hjelmn commented 6 years ago

The intent of this issue is to start the discussion (if it hasn't already been started) on whether we should look at deprecating the support for active-message RMA. Fence seems to be the obvious first one to look at for a couple of reasons:

1) I think (I may be wrong) it can be trivially replaced in user codes by the use of MPI_Win_lock_all(), MPI_Win_flush_all(), and MPI_Barrier(). In Open MPI with an RDMA capable network fence is effectively equivalent (with some variation with different asserts) to using these functions. The only real difference is the synchronization checks. Which leads to

2) The existence of active message (especially fence) makes it much more difficult to detect and accurately report synchronization errors.

I am sure there are other reasons but, as above, this issue is intended to start a discussion.

As for PSCW, that one is a little harder to justify deprecation. I can see how it might be useful in some algorithms. I think I may have a good replacement (topology-aware windows) which I hope to bring to the WG later this year.

jeffhammond commented 6 years ago

MPI_Win_flush_all may be O(n) from every process and thus generate O(n^2) packets. MPI_Win_fence can be a lot more efficient. Pavan proposed - if only informally - MPI_Win_all_flush_all that is collective, which would eliminate the issues with bulk synchronous passive target synchronization.

hjelmn commented 6 years ago

I don't see how MPI_Win_flush_all() being possibly O(n) being an issue. If it is O(n) then the app is already communicating with O(n) peers (edited) thus is already producing O(n^2) packets. Not sure the extra packets will make things much worse than they already are for those apps. Or am I missing something.

If the general feeling is we need another call to make us feel better then fine. If that is the path to eliminating fence the so be it :).

jeffhammond commented 6 years ago

On Blue Gene, there is likely a huge difference between what one can do with MPI_Win_flush_all; MPI_Barrier and MPI_Win_fence when communicating with all peers. For Cray, there isn't because end-to-end reliability means acks are sent to the origin anyways and shmem_quiet or MPI_Win_flush_all is merely spinning on a local counter (or something like that - it has been a while since I was working with these platforms).

In any case, the benchmark I would use to evaluate various options is an unstructured alltoallv implemented with MPI_Put. Something like the following:

// <input, count, target>
typedef std::tuple<void*,int,int> msg;

void RMA_Alltoallv(std::vector<msg> const & messages, MPI_Win const & win)
{
  MPI_Aint disp = 0;
#ifdef FENCE
  MPI_Win_fence();
#endif
  for (auto m : messages) {
    auto buf = m.get<0>;
    auto count = m.get<1>;
    auto target = m.get<2>;
    MPI_Put(buf, count, MPI_BYTE, target, disp, count, MPI_BYTE, win);
    disp += count;
  }
#ifdef FENCE
  MPI_Win_fence();
#else
  MPI_Win_flush_all();
  MPI_Barrier();
#endif
}
pavanbalaji commented 6 years ago

@jeffhammond let's separate out the issues here so we can handle them cleanly. So far I see the following:

  1. There's no collective flush_all, i.e., all_flush_all, so we lose some collective completion capabilities. In theory, one can get better performance with collective completion.

Argument against: I'm not sure this is very helpful for RDMA networks, since the target will not know of any remote completion and the origin has to inform it anyway. It'll be beneficial for active-message based networks, e.g., TCP/IP based networks, but I don't think we should care about optimizing such networks.

  1. Fence gives local completion for the operations that I issued, before allowing me to move to a different epoch. With unlock, we only get remote completion.

Argument against: Closing epochs might not be as important in passive-target because we have flush/flush_local. It might be important if the user mixes exclusive and shared modes, but algorithmically, the need for it seems minimal.

I agree with Nathan that detecting errors is harder with Fence, since the only way the MPI implementation can detect a completion epoch (unless the user passes an assert) is to wait and see if the user does another PUT/GET or not.

Performance wise, this also adds an additional branch to check for a request completion since, in MPICH, we do an MPI_Ibarrier inside Fence, and any PUT/GET operations have to check for its completion before they are issued.

hjelmn commented 6 years ago

@pavanbalaji

1) Yes, not really helpful for RDMA networks. What might be a better semantic for target notification is put with immediate semantics. I believe that such semantics were proposed for the standard by Torsten. Its been awhile but I recall @jeffhammond had some objections to the addition.

I know it is a weaker argument to add in support of removing fence but it is worth noting that in other contexts fence generally is just an ordering thing. This is true both with atomics and in OpenSHMEM.

pavanbalaji commented 6 years ago

@hjelmn

Hmm. RDMA with immediate could be useful in theory. For instance, with InfiniBand, if the origin side did local bounce buffer copies, local completion is trivial. For data coming in, the target could, in theory, check for RDMA immediate buffer notifications instead of getting a separate completion message from the origin.

In practice, however, I doubt this will show any benefit because the multiple additional DMA operations for the immediate data will likely wipe out any benefit compared with a single "I am done" message from the origin at the end.

In any case, I'm agreeing with you that MPI_Win_fence might not serve any useful purpose anymore.

jeffhammond commented 6 years ago

Fence gives local completion for the operations that I issued, before allowing me to move to a different epoch. With unlock, we only get remote completion.

No. Fence imparts remote completion. It was defined in MPI-2 when local and remote completion were equivalent. We definitely did not change it to mean local completion in MPI-3. That would break existing MPI-2 code.

jeffhammond commented 6 years ago

It'll be beneficial for active-message based networks, e.g., TCP/IP based networks, but I don't think we should care about optimizing such networks.

Then we should do it. Ethernet is the most popular network by a large margin. I care about MPI adoption in data centers where Ethernet is deployed.

jdinan commented 6 years ago

A collective flush operation is potentially helpful to application developers as a means of determining that the calling process has received all of the data sent to it (this is what fence guarantees). However, we would need to establish that such an API routine provides a performance improvement over a local flush followed by a barrier.

Notified access (Torsten's proposal) had some semantics that were challenging to offload and IIRC the discussion had stalled there. There were a couple other proposals for notifying RMA operations (see: https://github.com/mpi-forum/mpi-issues/issues/59). These were discussed several times in the RMA WG, but the activity on this fell off when the WG became inactive. If there is interest in pursing a proposal again, I would be happy to revive this topic.

pavanbalaji commented 6 years ago

No. Fence imparts remote completion.

I don't think so. When a Fence completes, the origin process only knows that the data is out of its local memory, but not that it's available in the remote memory. For example, if you have overlapping windows, and you do fence on one epoch and lock/get/unlock on the other window, you can get old data.

Then we should do it. Ethernet is the most popular network by a large margin. I care about MPI adoption in data centers where Ethernet is deployed.

Apart from the fact that Ethernet does not necessarily mean TCP/IP, which I'll ignore for this discussion, my point is that when all the PUT/GET operations are going over active messages, the relative performance impact of doing FLUSH_ALL/BARRIER instead of FENCE would be negligible.

jeffhammond commented 6 years ago

From MPI 3.1 Section 11.5:

The MPI call MPI_WIN_FENCE(assert, win) synchronizes RMA calls on win. The call is collective on the group of win. All RMA operations on win originating at a given process and started before the fence call will complete at that process before the fence call returns. They will be completed at their target before the fence call returns at the target.

So at this point, Pavan is right. However, a few lines down, it says this:

A fence call usually entails a barrier synchronization: a process completes a call to MPI_WIN_FENCE only after all other processes in the group entered their matching call. However, a call to MPI_WIN_FENCE that is known not to end any epoch (in particular, a call with assert equal to MPI_MODE_NOPRECEDE) does not necessarily act as a barrier.

So I contend that MPI_Win_fence imparts remote completion with assert=0 unless somebody can convince me that "fence call usually entails a barrier synchronization" is actually false, in which case we should consider removing it from the standard.

hjelmn commented 6 years ago

@jeffhammond Pavan is correct. When fence returns two things are true:

1) Any operations started before the fence by remote ranks targeting my process will be locally complete (remote completion for them).

2) Any operations started by my process are locally complete.

So it is slightly weaker than remote completion.

The exact wording:

"MPI_Win_fence synchronizes RMA calls on win. The call is collective on the group of win. All RMA operations on win originating at a given process and started before the fence call will complete at that process before the fence call returns. They will be completed at their target before the fence call returns at the target. RMA operations on win started by a process after the fence call returns will access their target window only after MPI_Win_fence has been called by the target process."

hjelmn commented 6 years ago

Hah, opps. @jeffhammond Posted just as I posted :)

pavanbalaji commented 6 years ago

So I contend that MPI_Win_fence imparts remote completion with assert=0 unless somebody can convince me that "fence call usually entails a barrier synchronization" is actually false, in which case we should consider removing it from the standard.

No, that's not true either. If the FENCE is not an opening epoch, a process needs to know that all PUTs/GETs for which it is the target have been deposited in its memory. That's the reason it needs to know that the remote fence has been called. However, it doesn't actually need to know that the PUTs/GETs that it has issued have been deposited into the target memory.

So, the text that you quoted from the MPI standard is correct, although I agree that it's misleading.

To answer your question about whether the MPI implementation knows -- yes, in MPICH we keep track of the fact that this is the first FENCE and if it is the first FENCE, we simply issue the ibarrier.

jeffhammond commented 6 years ago

@pavanbalaji So how does a user know when remote PUT/GET are complete using only FENCE synchronization?

pavanbalaji commented 6 years ago

They don't. They'll need to do a BARRIER after the FENCE for that information.

pavanbalaji commented 6 years ago

I should also point out that FENCE guarantees that a GET in the next epoch will get the correct data from the previous epoch. So if you are accessing the data only using FENCE epochs, you don't need the additional barrier.

However, if you want to mix FENCE and LOCK/UNLOCK, for example, then you need the extra barrier.

jeffhammond commented 6 years ago

So FENCE orders RMA without remote completion and is thus equivalent to every rank calling a function like shmem_fence? So this program works?

a=0
PUT(1->win)
FENCE(win)
GET(a<-win)
// a is now 1

But this is wrong?

a=0
PUT(1->win)
FENCE(win)
SENDRECV(a) // target of PUT sends window buffer back to origin
// a may not be 1

If so, then we really need to remove "fence call usually entails a barrier synchronization" and do a much better job of explaining ourselves.

hjelmn commented 6 years ago

@jeffhammond One small change:

a=0
PUT(1->win)
FENCE(win)
GET(a<-win)
FENCE(win)
// a is now 1

but otherwise I believe that is correct.

hjelmn commented 6 years ago

@jeffhammond As for the sendrecv example. That will also get a == 1 as the remote side had to finish the fence to get to the send.

hjelmn commented 6 years ago

I think the example @pavanbalaji had in mind was:

a=0
PUT(1->win)
FENCE(win)
LOCK(win)
GET(a<-win)
UNLOCK(win)

In that case a may be 0 or 1.

pavanbalaji commented 6 years ago

Right. The second example that @hjelmn gave was what I had in mind, when I said that the result might be 0 or 1 (or really any other value, since it's not atomic). This is particularly true when you give the MPI_MODE_NOCHECK hint to lock, since at least MPICH will completely ignore the lock call when that happens and proceed with the next GET operation.

pavanbalaji commented 6 years ago

This is correct, btw:

a=0
PUT(1->win)
FENCE(win)
SENDRECV(a) // target of PUT sends window buffer back to origin
// a must be 1

That's because the target process had to call FENCE too.

hjelmn commented 6 years ago

@jdinan Do you have an opinion on the deprecation of MPI_Fence. I think we discussed this a number of forums ago but I can't recall whether you were in favor of it.

jdinan commented 6 years ago

If you're going to deprecate MPI_Win_Fence it seems to me like you ought to just deprecate all of active target RMA, since fence is supposedly what gets used most. 🤷‍♂️

I'm not sure I've formed an opinion yet on the pro/con of deprecation. On one hand, I think that MPI_Win_Fence actually gets used and the original motivations for the active target model are still valid. On the other, if we can support the same model while removing some of the complexity involved in implementing and supporting RMA, that's a good thing.

jeffhammond commented 6 years ago

@jdinan PSCW with MPI_Accumulate is the best approximation to MPI_Recv_reduce. I don't want to lose that.

jdinan commented 6 years ago

@jeffhammond You could just as well call MPI_Comm_create_group from the pair of processes and then use MPI_Reduce with MPI_IN_PLACE at the receiver to achieve the same thing.

jeffhammond commented 6 years ago

@jdinan You are right. While we're at it, we should deprecate Send-Recv and tell everybody to use MPI_Neighborhood_alltoallw instead 😝

More seriously, this was discussed in the past, but it isn't sufficient because of ANY_SOURCE and tags. Also, if I am using PSWC+Accumulate for halo exchanges, I may need to create a very large number of communicators.

jdinan commented 6 years ago

@jeffhammond If you are doing halo exchange then active target is the right model anyway, so why are we considering deprecating it?

How are you simulating ANY_SOURCE and tags with PSCW+Accumulate? Every process in the group passed to MPI_Win_post needs to call MPI_Win_start and there are no tags involved. I'm not seeing how reduce isn't an equivalent solution to PSCW+Accumulate.

hjelmn commented 6 years ago

@jdinan Then maybe a collective flush + neighborhood aware windows is a good alternative. I could see this as a good way to do a halo exchange:

put_to_neighbors();
MPI_Win_iall_flush_all ();
do_work();
MPI_Wait();

I know this is mixing the concept of both non-blocking synchronization and a collective flush. Seems like a clean way to handle situations where active message might be beneficial.

jeffhammond commented 6 years ago

@hjelmn We first need MPI_Win_iflush and related nonblocking equivalents of the existing MPI-3 synchronization functions. The MPICH team has a paper on that already but they haven't brought forward a proposal to the Forum that I recall.

Note that somebody should trademark iFlush just in case Apple is planning to release a toilet that runs iOS 😆

jeffhammond commented 6 years ago

@jdinan Sketch the implementation of PRK transpose (B+=A using 1D distribution) using MPI_Recv_reduce, PSCW Accumulate, and MPI_Comm_create_group+MPI_Reduce and tell me if you think what you are proposing is reasonable.

hjelmn commented 6 years ago

@jeffhammond To be clear, I want to eliminate PSCW as well. I want to see if all-passive RMA + some existing and new functions will work as an effective replacement.

hjelmn commented 6 years ago

@jeffhammond I agree. We probably should move on getting the non-blocking synchronization into the standard. Will have to see where that is during the June meeting. Will have to make sure there will be RMAWG time then.

jeffhammond commented 6 years ago

@hjelmn PSCW and FENCE were both good ideas. In practice, PSCW is unused because Send-Recv meets the need and is more optimized in implementations. FENCE is a true implementation of BSP but as we've demonstrated here, is specified in a rather confusing way. While I understand the sentiment - particularly from an implementer - deprecate everything but passive target, I do not like the precedent it sets. It basically says that we should deprecate features because we've failed as a community to specific and implement them in a way that actually helps users. By that standard, there are many other parts of MPI that we should deprecate 😉

What I'd prefer to do is add an info key that allows users to specify which synchronization modes will (not) be used for a given window (as well as a key to assert something about overlapping windows). I think that will address the implementation issues, since you can then focus on optimizing for the passive-only case and ignore everything else. I don't think it is a burden to maintain existing RMA code that supports the other synchronization modes.

hjelmn commented 6 years ago

@jeffhammond While I agree we don't want to set a precedent I think we should still consider getting rid of active target. It probably should be discussed when most of us are the in the same room in June.

And, yes, it is a bit of a burden to support active target. The code to check for synchronization errors is a nightmare. Then there is trying to figure out when passive target is ok after a fence. It isn't too terrible but it limits performance.

jeffhammond commented 6 years ago

@hjelmn But didn't you already right that code? What new code has to be written here? I want to understand the cost to you versus the users whose code will be broken by PSCW/FENCE removal.

hjelmn commented 6 years ago

@jeffhammond The code is already written but every line of code comes with a support cost. Being able to simplify the standard greatly reduces the support code of maintaining that code. It also will improve the performance of the passive-target paths as we can remove a bunch of checks trying (and never quite achieving) to detect incorrect synchronization. I know we don't have to detect this but a high quality implementation should do its best to detect synchronization errors and report MPI_ERR_RMA_SYNC. The cost of the synchronization checks in Open MPI on an Aries network with small message latency is ~ 10-20%.

hjelmn commented 6 years ago

I should also add from experience at LANL that code developers often find (and use) MPI_Win_fence () in their codes when passive-target + non-blocking collectives perform much better. Maybe MPI_Win_ifence() would be able to compete but there is no reason for it in any of the cases I have seen. What I really want to provide is MPI_Neighbor_ireduce().

jeffhammond commented 6 years ago

@hjelmn Code isn't like fruit. It doesn't rot if you leave it alone for years. If you want to reduce your support burden, propose to deprecate Fortran support.

jdinan commented 6 years ago

https://en.wikipedia.org/wiki/Software_rot

jeffhammond commented 6 years ago

@jdinan None of the examples given there are applicable to RMA implementations.

hjelmn commented 6 years ago

@jeffhammond Considering I have broken PSCW in Open MPI multiple times now I think software rot is very relevant. Fence is more stable but as has already been brought up, on high performance networks it is not clear there is an advantage to it. It does, however, increase the complexity of the RMA implementation.

jeffhammond commented 6 years ago

@hjelmn If you break your code during refactoring, that is not code rot. Is it really your intent to to deprecate the BSP programming model from MPI because your job as an MPI implementer isn't trivial? Why don't we first go to WG14 and propose to deprecate every nasty feature of ISO C that has ever caused us pain?

This has devolved into a truly stupid discussion. "Programming is hard" is not a reason to deprecate anything from any standard. The RMA text clearly needs improving. Propose that.

hjelmn commented 6 years ago

@jeffhammond Its not refactoring. Its changing unrelated paths that just happen to cause PSCW to break. That fits the definition of code rot. I get it, you do not like this proposal. I still intend to move forward with proposing the end to active target. I mainly started this discussion to see what other thought about the idea.

Also, this is not deprecating BSP at all. RMA is just not the place to implement BSP. The only except I see is maybe a MPI_Win_all_flush_all() call. But even then a we have MPI_Barrier() and many other calls to support BSP with RMA.

jeffhammond commented 6 years ago

From The implementation of MPI-2 one-sided communication for the NEC SX-5:

Fence The fence is a collective operation involving all processes sharing a window object. An MPI_Win_fence call closes a preceding epoch and opens a new epoch. In the new epoch each process is exposed to, and has access to, all other processes. When an epoch is closed by a fence, all RMA requests and corresponding target actions must be guaranteed to have completed at origin and target. The fence mode of operation corresponds closely to the BSP model, and is appropriate for applications where each process may need to access memory of many other processes.

(emphasis mine)

hjelmn commented 6 years ago

Except that paper was published before MPI-3. Now we have MPI_Win_lock_all and MPI_Win_flush_all. Any code using fence can replace it with MPI_Win_flush_all + MPI_Barrier without changing the semantics.