Closed jeffhammond closed 1 year ago
MPI_PROC_NULL
needs to specify that it returns the lowest rank in the node. Something like that.COMM_TYPE_SHARED
required for Win_allocate_shared
.~Note that Win_allocate
now needs to allocate contiguous shared memory by default but that the user can opt-out as with Win_allocate_shared
. Change for that info key to apply to Win_allocate
as well.~
Edit: see below.
March 2016 mtg: Jeff will fix the ticket.
This sentence adds quite a bit of uncertainty:
The user can determine the set of MPI processes for which size might be non-zero using MPI_COMM_SPLIT_TYPE with split_type = MPI_COMM_TYPE_SHARED; however, just because a rank is a member of this communicator does not mean that load-store access will be possible.
So even with windows allocated through MPI_Win_allocate_shared
it might not work? I assume this is meant to say that querying shared memory addresses may not be supported by every window on every implementation? If so, should we add language to allow implementations to not support shared memory for windows allocated through MPI_WIN_ALLOCATE
and MPI_WIN_CREATE
but require support for windows from MPI_WIN_ALLOCATE_SHARED
?
I have updated the new text to address this.
The motivation for this ticket is that every RMA implementation allocates shared memory in MPI_Win_allocate, yet because MPI_Win_shared_query makes it illegal for me to query it, one has to do the following ridiculous nonsense instead:
int XXX_Win_allocate(MPI_Aint size, int disp_unit, MPI_Info info,
MPI_Comm comm, void *baseptr, MPI_Win * win)
{
MPI_Comm shared_comm;
MPI_Comm_split_type(comm, MPI_COMM_TYPE_SHARED, &shared_comm);
void * baseptr;
MPI_Win shared_win;
MPI_Win_allocate_shared(size, disp_unit, info, shared_comm, &baseptr, &shared_win);
MPI_Win_create(baseptr, size, disp_unit, info, comm, win);
// do some nonsense to hide the shared memory window handle
// as an attribute on the user-facing window so i can free it later
// in XXX_Win_free in order to not leak memory
MPI_Win_create_keyval(..);
MPI_Win_set_attr(..);
MPI_Comm_free(&shared_comm);
return MPI_SUCCESS;
}
Casper does something along these lines (more complicated) here: https://github.com/pmodels/casper/blob/master/src/user/include/cspu_shmbuf.h.
Thanks, that's clear now :+1: (at least for me)
I started putting this into a PR. I'm not sure about contiguous shared memory in MPI_Win_allocate
. Should it be optional (the opposite of MPI_Win_allocate_shared
) or do we just not care about contiguous shared memory from MPI_Win_allocate
? We shouldn't make it the default (potential performance regression) but I don't see a good reason not to make it an option.
I agree that we do not want to force WA to have contig as the default, but the problem is, info can be ignored, so no app can rely on their request to get contig. We can solve this by saying that WA is never required to be contiguous, and that the contig info is ignored. This means that there are some use cases that work with WAS but not WA (or WC, which couldn't allocate contiguous memory anyways), but I think that's okay. All existing code works, including the ones involving WAS+contig, and we remove a pointless restriction on WSQ related to WA and WC.
WSQ = MPI_Win_shared_query WAS = MPI_Win_allocate_shared WA = MPI_Win_allocate WC = MPI_Win_create
This passed a no-no vote.
Yes | No | Abstain |
30 | 0 | 1 |
This passed a 1st vote.
Yes | No | Abstain |
29 | 0 | 2 |
Had no-no reading on 2023-05-02.
This passed a no-no vote.
Yes | No | Abstain |
29 | 0 | 3 |
Summary
Extend the functionality of
MPI_WIN_SHARED_QUERY
to all windows, which will inform the user regarding the MPI shared-memory properties of any window. To what extent this function will return a nontrivial result (i.e. indicate the shared memory has been allocated and is accessible) depends on the implementation. It may be difficult for implementations to use shared memory withMPI_WIN_CREATE
, although there are multiple existence proofs.This change permits MPI shared-memory accesses on any window, but nothing new is required. Implementations will now be allowed to provide more if possible. Previously, if implementations were able to do this, there was no ability for the user to leverage it explicitly.
In the event where the implementation cannot support MPI shared-memory access beyond what MPI-3 defines, the implementation is trivial, because
MPI_WIN_SHARED_QUERY
will tell the user only the trivial case of local shared-memory access is permitted on windows not allocated byMPI_WIN_ALLOCATE_SHARED
.In order to make it possible for the user to access the shared memory associated with windows allocated by any means, it must be valid to use
MPI_WIN_SHARED_QUERY
on windows resulting fromMPI_WIN_ALLOCATE
andMPI_WIN_CREATE
, not justMPI_WIN_ALLOCATE_SHARED
.Motivation
While
MPI_WIN_ALLOCATE_SHARED
is sufficient to allocate shared-memory windows, on systems it is not necessary, i.e. windows allocated by other means may still support MPI shared-memory accesses.Because it is illegal to determine if an implementation has allocated shared memory with
MPI_WIN_ALLOCATE
, if one needs shared-memory access to data associated with a window that spans multiple shared-memory domains - as is part of both the OpenSHMEM and Global Arrays APIS - then one must do the following, which is tedious and prevents the use of certain scalability optimizations supported byMPI_WIN_ALLOCATED
.Casper does something along these lines (more complicated) here: https://github.com/pmodels/casper/blob/master/src/user/include/cspu_shmbuf.h.
Because of the current semantics of MPI RMA, it is impossible to support a useful implementation of OpenSHMEM's
shmem_ptr
, as one example.Limitations
We exclude
MPI_WIN_CREATE_DYNAMIC
because it is not possible forMPI_WIN_SHARED_QUERY
to provide useful information in the general case whereMPI_WIN_ATTACH
has been used more than once on the window. A new query function would be required to support this window type.~One side effect of this is that implementations that use non-contiguous shared-memory allocations for
Win_allocate
windows will either have to not expose that to users (i.e. do what they do now) or change to using contiguous shared-memory unless the non-contiguous info key is provided. This may have some impact on performance in some cases due to NUMA, but we note that support for NUMA balancing in operating systems like Linux ameliorates this issue. NUMA balancing did not exist when MPI-3 was standardized, so this was a greater concern then.~ (we won't require contig)Implementation
All MPI implementations can support
MPI_Win_allocate
by first allocating local memory withMPI_Win_allocate_shared
and then callingMPI_Win_create
with the resulting buffer as input. However, this precludes scalability optimizations available toMPI_Win_allocate
, because no coordination can be done across nodes w.r.t. the underlying buffer allocation. Furthermore, most implementations ofMPI_Win_allocate
already allocate shared memory internally, but this information is not available to the user since the relevant query function is not permitted to provide it (it is an invalid input to pass such a window).Historically (it may be deprecated now), MVAPICH provided an info key to allocate shared memory via
MPI_Alloc_mem
and when this buffer was passed intoMPI_Win_create
, it would permit shared memory optimizations not otherwise available. There is already text encouraging the use ofMPI_Alloc_mem
with RMA, so it is not unreasonable to treat this as the important case.The following systems permit
MPI_Win_create
to induce the use of shared memory without requiringMPI_Alloc_mem
:BG_MAPCOMMONHEAP
is set in the environment.Proposed Text
This will need to be reimplemented since the source of the document has changed a lot since 2015.
~See https://github.com/mpiwg-rma/mpi-standard/commit/fcfa116935376d65f3bb28332d0669df269edf85 for integrated version.~
Old text (MPI 3.0)
This function queries the process-local address for remote memory segments created with
MPI_WIN_ALLOCATE_SHARED
. This function can return different process-local addresses for the same physical memory on different processes. The returned memory can be used for load/store accesses subject to the constraints defined in Section 11.7. This function can only be called with windows of typeMPI_WIN_FLAVOR_SHARED
. If the passed window is not of flavorMPI_WIN_FLAVOR_SHARED
, the errorMPI_ERR_RMA_FLAVOR
is raised. When rank isMPI_PROC_NULL
, the pointer, disp_unit, and size returned are the pointer, disp_unit, and size of the memory segment belonging the lowest rank that specified size > 0. If all processes in the group attached to the window specified size = 0, then the call returns size = 0 and a baseptr as ifMPI_ALLOC_MEM
was called with size = 0.New text:
This function queries the process-local address for remote memory segments created with
MPI_WIN_ALLOCATE_SHARED
,MPI_WIN_ALLOCATE
andMPI_WIN_CREATE
. This function can return different process-local addresses for the same physical memory on different processes. The returned memory can be used for load/store accesses subject to the constraints defined in Section 11.7. When rank isMPI_PROC_NULL
, the pointer, disp_unit, and size returned are the pointer, disp_unit, and size of the memory segment belonging the lowest rank in the shared memory domain that specified size > 0. If all processes in the group attached to the window specified size = 0, then the call returns size = 0 and a baseptr as ifMPI_ALLOC_MEM
was called with size = 0.Only
MPI_Win_allocate_shared
is required to allocate shared memory. Implementations are permitted, where possible, to do so withMPI_Win_create
andMPI_Win_allocate
. For the latter two cases, if shared-memory is allocated, the shared memory domain is the communicator resulting from a call toMPI_COMM_SPLIT_TYPE
withtype=MPI_COMM_TYPE_SHARED
on the communicator created from the group of the window. The user can determine the set of MPI processes for which size might be non-zero usingMPI_COMM_SPLIT_TYPE
with split_type =MPI_COMM_TYPE_SHARED
; however, just because a rank is a member of this communicator does not mean that load-store access will be possible. When the remote memory segment corresponding to a particular rank cannot be accessed directly, this call returns size = 0 and a baseptr as ifMPI_ALLOC_MEM
was called with size = 0.MPI_Win_allocate
is not required to allocate contiguous shared memory and will ignore the info hint to do so.Related Work
History
This was https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/397