Open jeffhammond opened 2 years ago
I like the assertions part and the merger of allocate and allocate-shared.
Just some thoughts on the type safety though:
What are the constraints on the provided datatype? Or what is the semantic of passing an index or vector datatype with non-unit stride? I can see the value of providing contiguous datatypes and the lack of a count argument suggests that allocating more than one element of some base type would require at least a contiguous type. But types with holes? Would we force the implementation to allocate the space or can it allocate space only for the actually described elements?
My other concern is that creating a window may become rather tedious because I have to create the correct datatypes for my window first. Where now I can do
struct A {
double x, y;
int z;
};
struct B {
A a[2]; // expressed either as two distinct members or as contiguous?
double c;
};
MPI_Win_allocate(..., sizeof(B)*count, ...);
I'd now have to create three MPI types up front
MPI_Type_create_struct([DOUBLE, DOUBLE, INT], &A_type);
MPI_Type_create_struct([A_type, A_type, DOUBLE], &B_type);
MPI_Type_create_contiguous(count, B_type, win_type);
MPI_Win_allocate_new(win_type, ...);
Changing A
now requires a change in A_type
too.
Also, how do I now address z
in A
? Are we using byte displacements for communication operations?
The displacement unit is wasted cycles on every RMA operation and should be removed. There's no reasonable way to harmonize displacement units and MPI datatypes. An info assertion regarding the underlying datatypes used on the window would be fine.
I'm afraid I don't understand what the benefit of having a datatype attached to a window is in the end. If we end up with byte displacements and arbitrary datatypes in communication calls then why do we want to attach a datatype to the window in the first place?
I really like the assertions for optimisation. Given the datatype upfront is also an optimisation opportunity. Is this going to be a random access one byte window or are going to send complex numbers with two doubles?
I'm afraid I don't understand what the benefit of having a datatype attached to a window is in the end. If we end up with byte displacements and arbitrary datatypes in communication calls then why do we want to attach a datatype to the window in the first place?
Windows with explicit types do not permit byte displacements. All displacements are in integer multiples of the datatype provided in the window constructor.
So only basic types in windows then?
...
Also, how do I now address
z
inA
? Are we using byte displacements for communication operations?
In that example, you'd use MPI_BYTE
in the constructor and do explicit displacements.
The point of having types in the constructor are to enable optimization of common cases with homogeneous data. If you want use RMA to access members of structs that are members of structs, I'm not sure what optimizations you'd expect.
So only basic types in windows then?
No. If you always access 1000 doubles, you can construct a window that consists of elements of MPI_Datatype corresponding to 1000 doubles.
But what I really want to enable is optimization of RMA for arrays of built-in datatypes, since that is an incredibly common use case.
The point of having types in the constructor are to enable optimization of common cases with homogeneous data.
So what is the actual optimization potential here? And how do I make sure that the implementation doesn't make assumptions about me using MPI_BYTE
when I'm actually using compound types with arbitrary offsets for multi-byte types?
The point of having types in the constructor are to enable optimization of common cases with homogeneous data.
So what is the actual optimization potential here? And how do I make sure that the implementation doesn't make assumptions about me using
MPI_BYTE
when I'm actually using compound types with arbitrary offsets for multi-byte types?
If you construct a window with type MPI_DOUBLE, your implementation knows that all RMA accesses are 8-byte aligned and that every accumulate/atomic operation is on double
and nothing else, which means it can use hardware that supports this type but not other datatypes, which could not otherwise be excluded if the window was allocated using one of the existing calls.
Atomicity already isn't guaranteed across different basic datatypes, right?
@jdinan That's right, atomicity is only guaranteed for the same basic types.
@jeffhammond Thanks, that makes sense. That sounds like another hint in the puzzle for atomic performance... What if I know I want to use two types, not just one? Wouldn't it be better to express that information as a list of types in an info key?
I don't see how this would change the atomics implementation. You should be able to use the same approach for double
regardless of any other datatypes that are in use. What am I missing?
Regarding the assertions: I suggest we make it an in/out parameter. The application can ask for features and the implementation will allocate a window even if it cannot satisfy all requirements. It then sets whatever it supports so the application can check whether it can still work with that. If not it has to free the window and move on. I think this simplifies the handshake between both sides if the application can work with less than it requested.
You can also have a flag that says whether the other flags are a hard requirement.
I prefer MPI_Window_assertions input, MPI_Window_assertions &output
and let the user compare input
and output
.
Notes: add count, full datatype of element memory representation (including non-contiguous struct etc), and atomic requirements as vector of (Op,Type) pairs.
From the discussion at the WG meeting today:
1) What is the relation between window creation datatypes and operation datatypes? The first describes the data layout during creation/allocation, the second describes the unit of access for a single operation. We may want to allow for sparse data types and let the implementation ignore the holes. Access displacements would be in bytes, no explicit displacement units handled by the implementation.
2) We should keep the possibility of using arbitrary data user types (as multiple of MPI_TYPE_BYTE
) and use atomic operations on other basic datatypes. The creation type should not limit the operations applied to the memory.
3) Add same_op_noop_replace
to include replacement. Is there any use for mixing different arithmetic/logical operations?
Notes from the 9/22/2022 WG meeting:
Extend the interface to provide types and ops intended (in) and supported (out) for accumulate operations:
int MPI_Win_allocate_new(MPI_Aint size, // no displacement, all displacements are in Bytes
/* see below */
MPI_Window_assertions in_assertions,
/* accumulate operations, ignored if user requests support for any op/type combination */
int num_types_in,
MPI_Datatype types_to_use_in[num_type_ops],
int num_ops_in,
MPI_Op ops_to_use_in[num_type_ops],
int &num_types_out,
MPI_Datatype types_to_use_out[num_type_ops],
int &num_ops_out,
MPI_Op ops_to_use_out[num_type_ops],
MPI_Count accumulate_max_count, // max number of elements to use in MPI_Accumulate
/* communicator */
MPI_Comm comm,
/* output parameters: MPI signals supported capabilities */
MPI_Window_assertions *out_assertions,
void ** baseptr_out,
MPI_Win * win)
Further questions/comments:
MPI_CAS
to signal the use of CAS as atomic operation. Cannot be used anywhere else.Please add an info key. There will no MPI_Win_allocate_new2
.
Need to provide MPI_CAS to signal the use of CAS as atomic operation. Cannot be used anywhere else.
Maybe if we name it MPI_ATOMIC_CAS
, it's more obvious that isn't not a valid reduce op, and only works with RMA atomics.
Should we add an info key to for future extension?
Yes, every object constructor should have one.
I would like to register a bookmark for #25 which is essentially this ticket, but taken to an extreme. 😁
Motivation
The current window construction routines are not good. The problems include:
Lack of strict orthogonality:
win_allocate
includes the functionality ofwin_allocate_shared
, except that the shared-memory associated with the former cannot be queried.Lack of type-safety. Many optimizations cannot happen because windows are just bytes and the implementation does not know the type used for access until an RMA operation is issued.
Lack of subsetting synchronization motifs. No reasonable use case requires all the synchronization motifs, and supporting them all for every window is a burden on the implementation.
Lack of other subsetting that would improve performance and make high-quality implementations easier.
New signatures
We condense
win_allocate
andwin_allocate_shared
into one routine, with the expectation thatwin_shared_query
will be able to query shared-memory associated with all windows (see "relax constraints on MPI_WIN_SHARED_QUERY" for details).We condense
win_create
andwin_create_dynamic
into one routine. Users are permitted to attach memory to awin_create_new
window, assuming the appropriate assertion is specified.TODO: new function to query windows that have more than one buffer attached (
win_create_dynamic
andwin_create_new
).Window assertions
This is a bit field that contains the following entries, which will be implemented as opaque values like other RMA assertions. Assertions here are a statement of intent to use the named feature; if a given assertion is not provided, the implementation does not need to support this feature for the window.
If the implementation cannot support all of the assertions supplied by the user, the window constructor call must fail. The user can thus assume that all successful window constructor calls implement the specified assertions.