jeffhammond commented 2 years ago

Motivation

The current window construction routines are not good. The problems include:

Lack of strict orthogonality: win_allocate includes the functionality of win_allocate_shared, except that the shared-memory associated with the former cannot be queried.
Lack of type-safety. Many optimizations cannot happen because windows are just bytes and the implementation does not know the type used for access until an RMA operation is issued.
Lack of subsetting synchronization motifs. No reasonable use case requires all the synchronization motifs, and supporting them all for every window is a burden on the implementation.
Lack of other subsetting that would improve performance and make high-quality implementations easier.

New signatures

We condense win_allocate and win_allocate_shared into one routine, with the expectation that win_shared_query will be able to query shared-memory associated with all windows (see "relax constraints on MPI_WIN_SHARED_QUERY" for details).

int MPI_Win_allocate_new(MPI_Datatype base_datatype,
                         MPI_Window_assertions window_assertions,
                         MPI_Comm comm,
                         void ** baseptr_out,
                         MPI_Win * win)

We condense win_create and win_create_dynamic into one routine. Users are permitted to attach memory to a win_create_new window, assuming the appropriate assertion is specified.

int MPI_Win_create_new(const void * baseptr_in,
                       MPI_Datatype base_datatype,
                       MPI_Window_assertions window_assertions,
                       MPI_Comm comm,  
                       MPI_Win * win)

TODO: new function to query windows that have more than one buffer attached (win_create_dynamic and win_create_new).

Window assertions

This is a bit field that contains the following entries, which will be implemented as opaque values like other RMA assertions. Assertions here are a statement of intent to use the named feature; if a given assertion is not provided, the implementation does not need to support this feature for the window.

- Local load-store access
- Remote load-store access
- Post-Start-Complete-Wait synchronization
- Fence synchronization
- Shared lock or lock_all synchronization
- Exclusive lock synchronization
- Ordered NO_OP/REPLACE operations
- Ordered accumulate operations
- Multi-element RMA operations
- User-defined datatype RMA operations
- Attach/detach memory                      ! MPI_Win_create_new only

If the implementation cannot support all of the assertions supplied by the user, the window constructor call must fail. The user can thus assume that all successful window constructor calls implement the specified assertions.

devreal commented 2 years ago

I like the assertions part and the merger of allocate and allocate-shared.

Just some thoughts on the type safety though:

What are the constraints on the provided datatype? Or what is the semantic of passing an index or vector datatype with non-unit stride? I can see the value of providing contiguous datatypes and the lack of a count argument suggests that allocating more than one element of some base type would require at least a contiguous type. But types with holes? Would we force the implementation to allocate the space or can it allocate space only for the actually described elements?

My other concern is that creating a window may become rather tedious because I have to create the correct datatypes for my window first. Where now I can do

struct A {
  double x, y;
  int z;
};
struct B {
  A a[2]; // expressed either as two distinct members or as contiguous?
  double c;
};

MPI_Win_allocate(..., sizeof(B)*count, ...);

I'd now have to create three MPI types up front

MPI_Type_create_struct([DOUBLE, DOUBLE, INT], &A_type);
MPI_Type_create_struct([A_type, A_type, DOUBLE], &B_type);
MPI_Type_create_contiguous(count, B_type, win_type);
MPI_Win_allocate_new(win_type, ...);

Changing A now requires a change in A_type too.

Also, how do I now address z in A? Are we using byte displacements for communication operations?

jdinan commented 2 years ago

The displacement unit is wasted cycles on every RMA operation and should be removed. There's no reasonable way to harmonize displacement units and MPI datatypes. An info assertion regarding the underlying datatypes used on the window would be fine.

devreal commented 2 years ago

I'm afraid I don't understand what the benefit of having a datatype attached to a window is in the end. If we end up with byte displacements and arbitrary datatypes in communication calls then why do we want to attach a datatype to the window in the first place?

tschuett commented 2 years ago

I really like the assertions for optimisation. Given the datatype upfront is also an optimisation opportunity. Is this going to be a random access one byte window or are going to send complex numbers with two doubles?

jeffhammond commented 2 years ago

I'm afraid I don't understand what the benefit of having a datatype attached to a window is in the end. If we end up with byte displacements and arbitrary datatypes in communication calls then why do we want to attach a datatype to the window in the first place?

Windows with explicit types do not permit byte displacements. All displacements are in integer multiples of the datatype provided in the window constructor.

devreal commented 2 years ago

So only basic types in windows then?

jeffhammond commented 2 years ago

...

Also, how do I now address z in A? Are we using byte displacements for communication operations?

In that example, you'd use MPI_BYTE in the constructor and do explicit displacements.

The point of having types in the constructor are to enable optimization of common cases with homogeneous data. If you want use RMA to access members of structs that are members of structs, I'm not sure what optimizations you'd expect.

jeffhammond commented 2 years ago

So only basic types in windows then?

No. If you always access 1000 doubles, you can construct a window that consists of elements of MPI_Datatype corresponding to 1000 doubles.

jeffhammond commented 2 years ago

But what I really want to enable is optimization of RMA for arrays of built-in datatypes, since that is an incredibly common use case.

devreal commented 2 years ago

The point of having types in the constructor are to enable optimization of common cases with homogeneous data.

So what is the actual optimization potential here? And how do I make sure that the implementation doesn't make assumptions about me using MPI_BYTE when I'm actually using compound types with arbitrary offsets for multi-byte types?

jeffhammond commented 2 years ago

The point of having types in the constructor are to enable optimization of common cases with homogeneous data.

So what is the actual optimization potential here? And how do I make sure that the implementation doesn't make assumptions about me using MPI_BYTE when I'm actually using compound types with arbitrary offsets for multi-byte types?

If you construct a window with type MPI_DOUBLE, your implementation knows that all RMA accesses are 8-byte aligned and that every accumulate/atomic operation is on double and nothing else, which means it can use hardware that supports this type but not other datatypes, which could not otherwise be excluded if the window was allocated using one of the existing calls.

jdinan commented 2 years ago

Atomicity already isn't guaranteed across different basic datatypes, right?

devreal commented 2 years ago

@jdinan That's right, atomicity is only guaranteed for the same basic types.

@jeffhammond Thanks, that makes sense. That sounds like another hint in the puzzle for atomic performance... What if I know I want to use two types, not just one? Wouldn't it be better to express that information as a list of types in an info key?

jdinan commented 2 years ago

I don't see how this would change the atomics implementation. You should be able to use the same approach for double regardless of any other datatypes that are in use. What am I missing?

devreal commented 1 year ago

Regarding the assertions: I suggest we make it an in/out parameter. The application can ask for features and the implementation will allocate a window even if it cannot satisfy all requirements. It then sets whatever it supports so the application can check whether it can still work with that. If not it has to free the window and move on. I think this simplifies the handshake between both sides if the application can work with less than it requested.

jdinan commented 1 year ago

You can also have a flag that says whether the other flags are a hard requirement.

jeffhammond commented 1 year ago

I prefer MPI_Window_assertions input, MPI_Window_assertions &output and let the user compare input and output.

jeffhammond commented 1 year ago

Notes: add count, full datatype of element memory representation (including non-contiguous struct etc), and atomic requirements as vector of (Op,Type) pairs.

devreal commented 1 year ago

From the discussion at the WG meeting today:

1) What is the relation between window creation datatypes and operation datatypes? The first describes the data layout during creation/allocation, the second describes the unit of access for a single operation. We may want to allow for sparse data types and let the implementation ignore the holes. Access displacements would be in bytes, no explicit displacement units handled by the implementation. 2) We should keep the possibility of using arbitrary data user types (as multiple of MPI_TYPE_BYTE) and use atomic operations on other basic datatypes. The creation type should not limit the operations applied to the memory. 3) Add same_op_noop_replace to include replacement. Is there any use for mixing different arithmetic/logical operations?

devreal commented 1 year ago

Notes from the 9/22/2022 WG meeting:

Extend the interface to provide types and ops intended (in) and supported (out) for accumulate operations:

int MPI_Win_allocate_new(MPI_Aint size, // no displacement, all displacements are in Bytes
                         /* see below */
                         MPI_Window_assertions in_assertions,
                         /* accumulate operations, ignored if user requests support for any op/type combination */
                         int num_types_in,
                         MPI_Datatype types_to_use_in[num_type_ops], 
                         int num_ops_in,
                         MPI_Op ops_to_use_in[num_type_ops],
                         int &num_types_out,
                         MPI_Datatype types_to_use_out[num_type_ops],
                         int &num_ops_out, 
                         MPI_Op ops_to_use_out[num_type_ops],
                         MPI_Count accumulate_max_count, // max number of elements to use in MPI_Accumulate
                         /* communicator */
                         MPI_Comm comm,
                         /* output parameters: MPI signals supported capabilities */
                         MPI_Window_assertions *out_assertions,
                         void ** baseptr_out,
                         MPI_Win * win)

Further questions/comments:

Should we add an info key to for future extension?
Need to provide MPI_CAS to signal the use of CAS as atomic operation. Cannot be used anywhere else.

tschuett commented 1 year ago

Please add an info key. There will no MPI_Win_allocate_new2.

jeffhammond commented 1 year ago

Need to provide MPI_CAS to signal the use of CAS as atomic operation. Cannot be used anywhere else.

Maybe if we name it MPI_ATOMIC_CAS, it's more obvious that isn't not a valid reduce op, and only works with RMA atomics.

jeffhammond commented 1 year ago

Should we add an info key to for future extension?

Yes, every object constructor should have one.

jdinan commented 1 year ago

I would like to register a bookmark for #25 which is essentially this ticket, but taken to an extreme. 😁

mpiwg-rma / rma-issues

better window create/allocate routines #22

Motivation

New signatures

Window assertions