mpi-forum / mpi-issues

Tickets for the MPI Forum
http://www.mpi-forum.org/
67 stars 7 forks source link

Allow MPI_WIN_SHARED_QUERY on created and allocated windows #23

Closed jeffhammond closed 1 year ago

jeffhammond commented 8 years ago

Summary

Extend the functionality of MPI_WIN_SHARED_QUERY to all windows, which will inform the user regarding the MPI shared-memory properties of any window. To what extent this function will return a nontrivial result (i.e. indicate the shared memory has been allocated and is accessible) depends on the implementation. It may be difficult for implementations to use shared memory with MPI_WIN_CREATE, although there are multiple existence proofs.

This change permits MPI shared-memory accesses on any window, but nothing new is required. Implementations will now be allowed to provide more if possible. Previously, if implementations were able to do this, there was no ability for the user to leverage it explicitly.

In the event where the implementation cannot support MPI shared-memory access beyond what MPI-3 defines, the implementation is trivial, because MPI_WIN_SHARED_QUERY will tell the user only the trivial case of local shared-memory access is permitted on windows not allocated by MPI_WIN_ALLOCATE_SHARED.

In order to make it possible for the user to access the shared memory associated with windows allocated by any means, it must be valid to use MPI_WIN_SHARED_QUERY on windows resulting from MPI_WIN_ALLOCATE and MPI_WIN_CREATE, not just MPI_WIN_ALLOCATE_SHARED.

Motivation

While MPI_WIN_ALLOCATE_SHARED is sufficient to allocate shared-memory windows, on systems it is not necessary, i.e. windows allocated by other means may still support MPI shared-memory accesses.

Because it is illegal to determine if an implementation has allocated shared memory with MPI_WIN_ALLOCATE, if one needs shared-memory access to data associated with a window that spans multiple shared-memory domains - as is part of both the OpenSHMEM and Global Arrays APIS - then one must do the following, which is tedious and prevents the use of certain scalability optimizations supported by MPI_WIN_ALLOCATED.

int XXX_Win_allocate(MPI_Aint size, int disp_unit, MPI_Info info,
                     MPI_Comm comm, void *baseptr, MPI_Win * win)
{
  MPI_Comm shared_comm;
  MPI_Comm_split_type(comm, MPI_COMM_TYPE_SHARED, &shared_comm);

  void * baseptr;
  MPI_Win shared_win;
  MPI_Win_allocate_shared(size, disp_unit, info, shared_comm, &baseptr, &shared_win);
  MPI_Win_create(baseptr, size, disp_unit, info, comm, win);

  // do some nonsense to hide the shared memory window handle 
  // as an attribute on the user-facing window so i can free it later
  // in XXX_Win_free in order to not leak memory
  MPI_Win_create_keyval(..);
  MPI_Win_set_attr(..);

  MPI_Comm_free(&shared_comm);

  return MPI_SUCCESS;
}

Casper does something along these lines (more complicated) here: https://github.com/pmodels/casper/blob/master/src/user/include/cspu_shmbuf.h.

Because of the current semantics of MPI RMA, it is impossible to support a useful implementation of OpenSHMEM's shmem_ptr, as one example.

Limitations

We exclude MPI_WIN_CREATE_DYNAMIC because it is not possible for MPI_WIN_SHARED_QUERY to provide useful information in the general case where MPI_WIN_ATTACH has been used more than once on the window. A new query function would be required to support this window type.

~One side effect of this is that implementations that use non-contiguous shared-memory allocations for Win_allocate windows will either have to not expose that to users (i.e. do what they do now) or change to using contiguous shared-memory unless the non-contiguous info key is provided. This may have some impact on performance in some cases due to NUMA, but we note that support for NUMA balancing in operating systems like Linux ameliorates this issue. NUMA balancing did not exist when MPI-3 was standardized, so this was a greater concern then.~ (we won't require contig)

Implementation

All MPI implementations can support MPI_Win_allocate by first allocating local memory with MPI_Win_allocate_shared and then calling MPI_Win_create with the resulting buffer as input. However, this precludes scalability optimizations available to MPI_Win_allocate, because no coordination can be done across nodes w.r.t. the underlying buffer allocation. Furthermore, most implementations of MPI_Win_allocate already allocate shared memory internally, but this information is not available to the user since the relevant query function is not permitted to provide it (it is an invalid input to pass such a window).

Historically (it may be deprecated now), MVAPICH provided an info key to allocate shared memory via MPI_Alloc_mem and when this buffer was passed into MPI_Win_create, it would permit shared memory optimizations not otherwise available. There is already text encouraging the use of MPI_Alloc_mem with RMA, so it is not unreasonable to treat this as the important case.

The following systems permit MPI_Win_create to induce the use of shared memory without requiring MPI_Alloc_mem:

Proposed Text

This will need to be reimplemented since the source of the document has changed a lot since 2015.

~See https://github.com/mpiwg-rma/mpi-standard/commit/fcfa116935376d65f3bb28332d0669df269edf85 for integrated version.~

Old text (MPI 3.0)

This function queries the process-local address for remote memory segments created with MPI_WIN_ALLOCATE_SHARED. This function can return different process-local addresses for the same physical memory on different processes. The returned memory can be used for load/store accesses subject to the constraints defined in Section 11.7. This function can only be called with windows of type MPI_WIN_FLAVOR_SHARED. If the passed window is not of flavor MPI_WIN_FLAVOR_SHARED, the error MPI_ERR_RMA_FLAVOR is raised. When rank is MPI_PROC_NULL, the pointer, disp_unit, and size returned are the pointer, disp_unit, and size of the memory segment belonging the lowest rank that specified size > 0. If all processes in the group attached to the window specified size = 0, then the call returns size = 0 and a baseptr as if MPI_ALLOC_MEM was called with size = 0.

New text:

This function queries the process-local address for remote memory segments created with MPI_WIN_ALLOCATE_SHARED, MPI_WIN_ALLOCATE and MPI_WIN_CREATE. This function can return different process-local addresses for the same physical memory on different processes. The returned memory can be used for load/store accesses subject to the constraints defined in Section 11.7. When rank is MPI_PROC_NULL, the pointer, disp_unit, and size returned are the pointer, disp_unit, and size of the memory segment belonging the lowest rank in the shared memory domain that specified size > 0. If all processes in the group attached to the window specified size = 0, then the call returns size = 0 and a baseptr as if MPI_ALLOC_MEM was called with size = 0.

Only MPI_Win_allocate_shared is required to allocate shared memory. Implementations are permitted, where possible, to do so with MPI_Win_create and MPI_Win_allocate. For the latter two cases, if shared-memory is allocated, the shared memory domain is the communicator resulting from a call to MPI_COMM_SPLIT_TYPE with type=MPI_COMM_TYPE_SHARED on the communicator created from the group of the window. The user can determine the set of MPI processes for which size might be non-zero using MPI_COMM_SPLIT_TYPE with split_type = MPI_COMM_TYPE_SHARED; however, just because a rank is a member of this communicator does not mean that load-store access will be possible. When the remote memory segment corresponding to a particular rank cannot be accessed directly, this call returns size = 0 and a baseptr as if MPI_ALLOC_MEM was called with size = 0.

MPI_Win_allocate is not required to allocate contiguous shared memory and will ignore the info hint to do so.

Related Work

This was https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/397

jeffhammond commented 8 years ago
jeffhammond commented 8 years ago

~Note that Win_allocate now needs to allocate contiguous shared memory by default but that the user can opt-out as with Win_allocate_shared. Change for that info key to apply to Win_allocate as well.~

Edit: see below.

rsth commented 8 years ago

March 2016 mtg: Jeff will fix the ticket.

devreal commented 2 years ago

This sentence adds quite a bit of uncertainty:

The user can determine the set of MPI processes for which size might be non-zero using MPI_COMM_SPLIT_TYPE with split_type = MPI_COMM_TYPE_SHARED; however, just because a rank is a member of this communicator does not mean that load-store access will be possible.

So even with windows allocated through MPI_Win_allocate_shared it might not work? I assume this is meant to say that querying shared memory addresses may not be supported by every window on every implementation? If so, should we add language to allow implementations to not support shared memory for windows allocated through MPI_WIN_ALLOCATE and MPI_WIN_CREATE but require support for windows from MPI_WIN_ALLOCATE_SHARED?

jeffhammond commented 2 years ago

I have updated the new text to address this.

jeffhammond commented 2 years ago

The motivation for this ticket is that every RMA implementation allocates shared memory in MPI_Win_allocate, yet because MPI_Win_shared_query makes it illegal for me to query it, one has to do the following ridiculous nonsense instead:

int XXX_Win_allocate(MPI_Aint size, int disp_unit, MPI_Info info,
                     MPI_Comm comm, void *baseptr, MPI_Win * win)
{
  MPI_Comm shared_comm;
  MPI_Comm_split_type(comm, MPI_COMM_TYPE_SHARED, &shared_comm);

  void * baseptr;
  MPI_Win shared_win;
  MPI_Win_allocate_shared(size, disp_unit, info, shared_comm, &baseptr, &shared_win);
  MPI_Win_create(baseptr, size, disp_unit, info, comm, win);

  // do some nonsense to hide the shared memory window handle 
  // as an attribute on the user-facing window so i can free it later
  // in XXX_Win_free in order to not leak memory
  MPI_Win_create_keyval(..);
  MPI_Win_set_attr(..);

  MPI_Comm_free(&shared_comm);

  return MPI_SUCCESS;
}

Casper does something along these lines (more complicated) here: https://github.com/pmodels/casper/blob/master/src/user/include/cspu_shmbuf.h.

devreal commented 2 years ago

Thanks, that's clear now :+1: (at least for me)

devreal commented 2 years ago

I started putting this into a PR. I'm not sure about contiguous shared memory in MPI_Win_allocate. Should it be optional (the opposite of MPI_Win_allocate_shared) or do we just not care about contiguous shared memory from MPI_Win_allocate? We shouldn't make it the default (potential performance regression) but I don't see a good reason not to make it an option.

jeffhammond commented 2 years ago

I agree that we do not want to force WA to have contig as the default, but the problem is, info can be ignored, so no app can rely on their request to get contig. We can solve this by saying that WA is never required to be contiguous, and that the contig info is ignored. This means that there are some use cases that work with WAS but not WA (or WC, which couldn't allocate contiguous memory anyways), but I think that's okay. All existing code works, including the ones involving WAS+contig, and we remove a pointless restriction on WSQ related to WA and WC.

Abbreviations

WSQ = MPI_Win_shared_query WAS = MPI_Win_allocate_shared WA = MPI_Win_allocate WC = MPI_Win_create

mpiforumbot commented 1 year ago

This passed a no-no vote.

Yes No Abstain
30 0 1
mpiforumbot commented 1 year ago

This passed a 1st vote.

Yes No Abstain
29 0 2
wesbland commented 1 year ago

Had no-no reading on 2023-05-02.

wesbland commented 1 year ago

This passed a no-no vote.

Yes No Abstain
29 0 3
wesbland commented 1 year ago

This passed a 2nd vote.

Yes No Abstain
26 0 6