csubich commented 1 year ago

Per the request of @jeffhammond, I am recreating mpiwg-rma/rma-issues/issues/27 here. I initially reported this as an ambiguity in the MPI specification, but @jeffhammond had doubts about what the specification intends to require.

Greetings,

Problem

Recently, I was surprised^1 by an error at MPI_Finalize on Intel's MPI implementation (MPICH-based, I think) with the following code:

program test_window
   !! Test whether MPI dies when a window is created but not freed before MPI_Finalize
   use mpi_f08
   use, intrinsic :: iso_fortran_env
   use, intrinsic :: iso_c_binding
   integer, dimension(10) :: window_array
   integer :: myrank, numproc
   type(MPI_Win) :: created_window
   call MPI_Init()
   call MPI_Comm_size(MPI_COMM_WORLD,numproc)
   call MPI_Comm_Rank(MPI_COMM_WORLD,myrank)
   write(0,'("Rank ",I0,"/",I0," initialized")') myrank+1, numproc
   call MPI_Win_Create(window_array, int(10,kind=MPI_ADDRESS_KIND), &
                       1, MPI_INFO_NULL, MPI_COMM_WORLD, created_window)
   write(0,'("Rank ",I0," created window")') myrank+1
   call MPI_Finalize()
   write(0,'(" Rank ",I0," finalized")') myrank+1
end program

The error was evidently mine, for not freeing the MPI_Win object before the call to MPI_Finalize. After some long discussion in the above-linked Intel thread, I've convinced myself that the specification does allow implementations to treat this as an error, making the call to MPI_Win_free obligatory.

However, this requirement does not appear to be obvious. The language around MPI_Finalize refers to completing communications, and my mental model of an MPI_Win object was that it was more like a communicator object (which does not need to be explicitly freed) than a request (which must be completed).

Might a future version of the MPI specification highlight this requirement more explicitly?

Proposal

The MPI standard should be more explicit on the interaction between MPI Window objects and the MPI_Finalize call. If an MPI Window object is "like a communicator," then it should be permissible to leave the window dangling (with further references invalid but not erroneous). If the window object is "like a file," then it should be closed/freed before the call to MPI_Finalize.

Changes to the Text

Section 12 should include commentary on the intended lifetime of MPI Window objects. If they must be freed before calling MPI_Finalize, then at minimum MPI_Win_free should note this requirement.

Impact on Implementations

If calling MPI_Finalize without MPI_Free is left explicitly undefined, then no implementation should be required to change.

If the call is obligatory and not calling it is erroneous, it might require implementations that currently passively accept this (Open-MPI) to remark upon the error.

If the call is not obligatory, then other implementations (Intel MPI) that currently error must not.

Impact on Users

If freeing each MPI Window before MPI_Finalize is obligatory, then users must keep track of the lifetime of window objects and not allow them to go out of scope. This might require introducing garbage collection semantics to otherwise linear codes.

My surprising discovery of this issue (re: Intel MPI) came about because the legacy code I'm working on implicitly assumes it can always exit-with-error by finishing any pending communications (always local to the function or procedure, not long-lived) and calling MPI_Finalize. My own solution involved attaching a garbage-collection hook to MPI_COMM_SELF.

Additionally, a new (or newly explicit) requirement to free windows might cause some perverse user problems to have performance problems at exit-time. The call to MPI_Win_free is collective like a barrier, and user programs can only free one window at a time. If a program creates millions of windows over tens of thousands of processes, the required synchronization might add substantially to process-exit time even if the underlying implementation needs to do nothing to free a window.

References and Pull Requests

jeffhammond commented 1 year ago

Repeating my comment from the original issue...

First, I don't care whether your program is strictly compliant or not - Intel MPI should not crash like that. They have all the necessary information to clean up windows during finalization and not crash. MPICH built with the necessary debug options will report every single handle that's still active during finalize, so the infrastructure is already there inside of Intel MPI.

Second, I have yet to find anything in the standard that says your program is wrong, but I'll wait until others have weighed in to take a strong position here.

Here's what I found so far:

§ 11.2.2

Before an MPI process invokes MPI_FINALIZE, the process must perform all MPI calls needed to complete its involvement in MPI communications associated with the World Model. It must locally complete all MPI operations that it initiated and must execute matching calls needed to complete MPI communications initiated by other processes.

The call to MPI_FINALIZE does not free objects created by MPI calls; these objects are freed using MPI_XXX_FREE calls.

This does not state that windows must freed, only that finalize doesn't free them. It says you must complete all communication, but you haven't initiated any in your example program.

A future version of the standard should clarify this situation for all objects.

csubich commented 1 year ago

MPI_Win_free comes up with the notion of disconnecting two processes. In commentary on MPI_Comm_disconnect, the standard states (§11.10 of MPI-4, with the same language elsewhere in MPI-3.1):

Advice to users. To disconnect two processes you may need to call MPI_COMM_DISCONNECT, MPI_WIN_FREE, and MPI_FILE_CLOSE to remove all communication paths between the two processes. Note that it may be necessary to disconnect several communicators (or to free several windows or files) before two processes are completely independent. (End of advice to users.)

That's not too helpful for reasoning about MPI_Finalize, however. We can finalize without disconnecting/freeing communicators, but we must close files before finalizing (§14.2.1):

Before calling MPI_FINALIZE, the user is required to close (via MPI_FILE_CLOSE) all files that were opened with MPI_FILE_OPEN.

So if an MPI Window is like a communicator we can leave it, but if it's like a file we need to clean it up.

jeffhammond commented 1 year ago

Hmm, okay, RMA is similar enough to IO that we should have the same semantics for Windows and Files. I'd argue at that point that we do the same for everything.

This interpretation would render your program incorrect, but today I think it is ambiguous.

csubich commented 1 year ago

That's a reasonable argument, but to play Devil's Advocate I'll present my view of the other side, that the specification should not require that users explicitly free MPI Window objects.

As I understand the history, the MPI-IO requirements on files are deliberately conservative because the specification makes no assumptions about which (if any) process has direct access to the file system, or what the underlying system's semantics are if a program ends without closing an open file (on disk). Closing the file might directly have externally-visible effects, such as with MPI_MODE_DELETE_ON_CLOSE.

I suppose it's also possible for a program that aborts (without closing files properly) to leave an opened-but-not-closed file in an inconsistent state despite calls to MPI_File_sync, since at a glance I think that call only mandates synchronization and consistency within a set of MPI processes. That is, I see no guarantee that external programs ever see a consistent view of an opened (write mode) MPI-IO file.

RMA windows, however, are internal to the process, and anything externally visible is incidental to the RMA operation. It is like an MPI communicator in this sense: each opens the possibility of communication. If an MPI implementation is expected to keep track of internal resources such that communicators can go out of scope freely, then it's not unreasonable to expect the same of window objects (without active access or exposure epochs).

The only guaranteed effect of a call to MPI_Win_free is that the window's destruction callbacks fire, and that may not happen as part of an MPI_Finalize call. However, the same holds true for a communicator object, where a user-created communicator might not see its destruction hook called during the finalize call (with the specification only guaranteeing a free of MPI_COMM_SELF).

I've also updated the issue-opening comment to more properly follow the issue template.

jeffhammond commented 1 year ago

RMA windows, however, are internal to the process, and anything externally visible is incidental to the RMA operation.

This isn't strictly true due to shared memory. Freeing an allocated window that provides shared memory to other processes would invalidate pointers, which potentially leads to undefined behavior in C.

A related issue is that not all processes may return from MPI_Finalize. In all the major implementations, they do, but it's not required. Who owns the memory associated with an MPI process that doesn't return from MPI_Finalize (and may have exited)?

Thanks to shared memory at least, RMA has similar issues to IO and thus needs the same restriction. The fact that POSIX shared memory uses a filesystem abstraction reinforces this.

mpi-forum / mpi-issues

Must MPI_Window objects be freed before MPI_Finalize? #711

Problem

Proposal

Changes to the Text

Impact on Implementations

Impact on Users

References and Pull Requests