mpi-forum / mpi-issues

Tickets for the MPI Forum
http://www.mpi-forum.org/
66 stars 7 forks source link

deprecate (with the intent to delete) BSEND and everything related to it #282

Closed jeffhammond closed 1 year ago

jeffhammond commented 4 years ago

Problem

Buffered send is a pointless exposure of implementation details. Nobody should use it.

Looking at the rationale for buffered mode in Section 3.6 indicates that it is basically equivalent to the eager mode that many implementations use, which is implemented without input from the user.

Rationale. There is a wide spectrum of possible implementations of buffered communication: buffering can be done at sender, at receiver, or both; buffers can be dedicated to one sender-receiver pair, or be shared by all communications; buffering can be done in real or in virtual memory; it can use dedicated memory, or memory shared by other processes; buffer space may be allocated statically or be changed dynamically; etc. It does not seem feasible to provide a portable mechanism for querying or controlling buffering that would be compatible with all these choices, yet provide meaningful information. (End of rationale.)

Proposal

Remove all references to buffered mode in the standard.

Changes to the Text

Deprecate the following functions:

Mark the following as pertaining to deprecated features:

Impact on Implementations

Less work for new implementations. Reduces maintenance and testing burden of existing implementations.

Impact on Users

I cannot find a single application that uses buffered mode. All of the references I found on GitHub are wrappers around MPI.

References

N/A

RolfRabenseifner commented 4 years ago

Dear Jeff,

for me, this looks to be a weired proposal.

But I may have it completely misunderstood: Do you want to remove or deprecate

  • MPI_Bsend
  • MPI_Ibsend
  • MPI_Buffer_attach
  • MPI_Buffer_detach
  • MPI_BSEND_OVERHEAD ?

Of course, we have users, using MPI_Bsend.

When I understand correctly, then they decided to Substitute complex non-contiguous data accesses by using the implicit data copying through MPI_Bsend instead of some internal data copying in nonblocking MPI_Isend.

I expect that nobody wants to guarantee that the Isend version is faster on any given hardware and MPI library on that hardware for any non-contigues derived datatype and count argument.

I cannot find a single application that uses buffered mode. All of the references I found on GitHub are wrappers around MPI.

This is the clearly wrong way to analyse what MPI users do in proprietary software. We did such misinterpretion what people are doing already in the past in other areas (e.g. C++).

Best regards Rolf

----- Original Message -----

From: "github notifications" notifications@github.com To: "mpi-forum" mpi-issues@noreply.github.com Cc: "Subscribed" subscribed@noreply.github.com Sent: Thursday, February 20, 2020 8:00:44 PM Subject: [mpi-forum/mpi-issues] remove BSEND and everything related to it (#282)

Problem

Buffered send is a pointless exposure of implementation details. Nobody should use it.

Looking at the rationale for buffered mode in Section 3.6 indicates that it is basically equivalent to the eager mode that many implementations use, which is implemented without input from the user.

Rationale. There is a wide spectrum of possible implementations of buffered communication: buffering can be done at sender, at receiver, or both; buffers can be dedicated to one sender-receiver pair, or be shared by all communications; buffering can be done in real or in virtual memory; it can use dedicated memory, or memory shared by other processes; buffer space may be allocated statically or be changed dynamically; etc. It does not seem feasible to provide a portable mechanism for querying or controlling buffering that would be compatible with all these choices, yet provide meaningful information. (End of rationale.)

Proposal

Remove all references to buffered mode in the standard.

Changes to the Text

Remove all references to the following from the MPI standard:

  • MPI_Bsend
  • MPI_Ibsend
  • MPI_Buffer_attach
  • MPI_Buffer_detach
  • MPI_BSEND_OVERHEAD

Remove Section 3.6.1.

Impact on Implementations

Less work for new implementations. Reduces maintenance and testing burden of existing implementations.

Impact on Users

I cannot find a single application that uses buffered mode. All of the references I found on GitHub are wrappers around MPI.

References

N/A

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/mpi-forum/mpi-issues/issues/282

-- Dr. Rolf Rabenseifner . . . . . . . . . .. email rabenseifner@hlrs.de . High Performance Computing Center (HLRS) . phone ++49(0)711/685-65530 . University of Stuttgart . . . . . . . . .. fax ++49(0)711 / 685-65832 . Head of Dpmt Parallel Computing . . . www.hlrs.de/people/rabenseifner . Nobelstr. 19, D-70550 Stuttgart, Germany . . . . (Office: Room 1.307) .

sg0 commented 4 years ago

Despite the obvious caveats around BSend, certain people doing distributed-memory graph analytics like BSend, basically there are algorithms in which a number of vertices can be buffered before sending it out to a target process (it is just a matter of convenience for them as I understand). https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6009071

jsquyres commented 4 years ago

@jeffhammond How about just deprecating bsend?

The first step is admitting you have a problem. Then there's only 11 more steps...

RolfRabenseifner commented 4 years ago

Deprecating is only the beginning of removing it. Does not help if the user wants to continue to unse Bsend.

hjelmn commented 4 years ago

I have to agree with Jeff. In the case Open MPI buffers small sends (both contiguous and non-contiguous) without the user needing to know about the implementation details. Adding a user defined buffer gets the user absolutely nothing while increasing the testing surface and complexity of the implementation.

FWIW, this function has been on my to kill list for awhile now. Deprecating it makes a lot of sense.

wgropp commented 4 years ago

BSEND was one of the routines that resulted from user feedback after the draft MPI-1.0 specification. Valid MPI implementations have set the eager limit (itself not part of the MPI standard) at 0 bytes. While we can (and should!) discuss the best solution for the users, simply removing Bsend doesn't address the underlying issue, which is that MPI (the standard) doesn't guarantee any buffering and previous efforts to give the user some way to set and/or discover whether the implementation offers any buffering have not succeeded.

jeffhammond commented 4 years ago

@wgropp Is your comment not ultimately about quality-of-implementation issues? If we don't trust that there will be good implementations of MPI send-recv and work around it by adding Bsend to the standard, what else do we need to add to the standard to compensate for?

jeffhammond commented 4 years ago

@jsquyres @RolfRabenseifner The word "remove" was intended to suggest the process of deprecation with eventual deletion but since this is the MPI Forum and we must argue about the meaning of words instead of the substance of things, I have made the title of the ticket extremely explicit.

jeffhammond commented 4 years ago

@sg0 I am very curious what happens to the performance of such codes if you replace every instance of Bsend with Send. My suspicion is zero impact.

sg0 commented 4 years ago

The particular case I was talking about, the program would deadlock if Bsend is replaced with Send :)

pavanbalaji commented 4 years ago

I generally dislike MPI_Bsend, but it does have a somewhat neat feature that, if the user ensures that the amount of data sent is within the attached buffer size, then the send operation is guaranteed to complete immediately without waiting for the remote process. MPI_Send does not give that guarantee.

Having said that, this has a major shortcoming that the buffer is shared by all communicators (and thus other software libraries in my application). During the MPI-2.2 discussion, I had brought up a proposal (I can try to dig it up) that provides a new routine called MPI_Comm_buffer_attach, which makes such buffers comm-specific and would allow enforcing such guarantees more cleanly. But everyone hated MPI_Bsend so much that the proposal was dead on arrival.

jeffhammond commented 4 years ago

Sayan: ok then it can use nonblocking send and immediately free the request, which is legal. I recall some discussion of this pattern with Dries years ago at Argonne.

Edit: This comment is incorrect for non-zero message sizes.

jeffhammond commented 4 years ago

Pavan: an info assert on the Comm accomplishes the same thing, or even better since the library can internally allocate a pinned/registered buffer, unlike the user (except via alloc_mem with info assert). Why do we ever want MPI users allocate what should be internal state?

But the response to your proposal should tell us all something about the relevance of Bsend in practice.

bosilca commented 4 years ago

@jeffhammond what you describe is a different pattern. In the original example provided by @sg0 one could modify the send buffer right after the return from MPI_Bsend and the data sent will still be the original data. In your example, there is no such guarantees.

pavanbalaji commented 4 years ago

Pavan: an info assert on the Comm accomplishes the same thing, or even better since the library can internally allocate a pinned/registered buffer, unlike the user (except via alloc_mem with info assert). Why do we ever want MPI users allocate what should be internal state?

No, it won't, because the info cannot guarantee that that much memory has been allocated. So the user still has to assume that the MPI library might have to synchronize with the receiver.

But the response to your proposal should tell us all something about the relevance of Bsend in practice.

Agreed. I'm just pointing out what you'd lose by dropping MPI_Bsend. Whether that's useful or not is a separate discussion.

wgropp commented 4 years ago

Replying to Jeff above. No, this is absolutely not a quality of implementation issue. As @sgo points out, applications that use MPI_Send instead of MPI_Bsend may deadlock, and this is in conformance with the MPI standard. This is why users asked for MPI_Bsend - it wasn't something that the committee originally provided.

The issue with assuming some buffering is that then there is the question of how much buffering is available. That in turn brings up the question of the implementation model - is the buffering per communicator? Job? connection? Dependent on in-progress messages? The Bsend definition provides an answer to these, by having the user provide the buffer. As Pavan points out, other models might be more in line with other MPI choices.

We could decide that this is something that should be handled with libraries built on top of MPI - that's reasonable, but then there needs to be good implementations for the users that are currently using Bsend.

jeffhammond commented 4 years ago

@pavanbalaji Sorry, I assumed https://github.com/mpi-forum/mpi-standard/pull/14 would give us meaningful info keys to require the implementation to do what the user says. What you want with MPI_Comm_buffer_attach is better implemented as an info key on the communicator requiring a certain eager buffer limit, because this means that small message use cases require zero change in the implementation behavior, and the buffer memory is whatever the implementation knows is best.

VictorEijkhout commented 4 years ago

A while ago I decided to investigate if derived datatypes carry a performance penalty. After a lot of experimenting it seems that the main penalty is MPI's internal buffering. I was getting much lower performance than manuall copying and sending a contiguous buffer.

Notable exception: buffered send of a derived type. Because there the buffer is in user space (ish) and so does not have to be allocated & freed (& re-allocated & re-freed in the next iteration) on the spot. Buferred sends performed equally well as gather & send. All the other schemes (with one exception) were considerably worse.

An info key for eager sends does not solve taht, I think. Besides, I don't care about eager-ness: I care about MPI keeping its buffer around.

In other words: I disagree.

hjelmn commented 4 years ago

That is an implementation detail. I don't believe there is such a penalty with Open MPI unless the send is larger than the internal max send size.

VictorEijkhout commented 4 years ago

@hjelmn If it's an implementation detail then it's one that every implementation gets wrong. Can you contact me privately if you have access to a cluster with OpenMPI? Btw, what is the "internal max send size" typically?

VictorEijkhout commented 4 years ago

@hjelmn In fact I now have word (substantiated by testing) from sources very familiar with Open MPI that they actually handle that case worse than other MPI implementations.

So imo bufferend sends are still the best way to get good performance out of derived data types.

hjelmn commented 4 years ago

Could you provide the system details? There are multiple point-to-point implementations and some are better at this than others. I would prefer to make this better than rely on the existence of Bsend. Bsend is wrong abstraction. Implementations need to provide good performance in either case.

VictorEijkhout commented 4 years ago

I have Power9s with IBM's network and IBM SpectumMPI. Performance is a lot worse than for instance mvapich on the same machine. Can you give me access to your OpenMPI cluster?

hjelmn commented 4 years ago

That suggests pml/ucx is in use (unless IBM has their own pml-- they can add --mca pml_base_verbose 100 to the mpirun command line to verify). I don't know much about that pml but I can test it on a Cray XC and see if I can reproduce the performance bug. Will also double-check pml/ob1 to see if the performance has changed from the last time I tested non-contiguous performance. pml/ob1 does the same thing that bsend but with an internal buffer already registered with the network (should be just as fast as a user buffer).

Do you know the layout of the datatype? Total size, average contiguous chunk size, stride, etc? I can use that to run a reproducer.

Can't provide access but I can see about fixing the performance bug (or passing it off to the UCX team to fix).

VictorEijkhout commented 4 years ago

https://github.com/TACC/mpipacking

Sending every other element, every stream length from 1 element to tens of gigs.

jeffhammond commented 1 year ago

Do we have the appetite to fix BSEND in MPI-5? Please see https://github.com/mpi-forum/mpi-issues/issues/643.