Typed Collectives with minimal examples

davidozog commented 5 years ago

This PR proposes typed collectives for broadcast, collect/fcollect, and alltoall/alltoalls with corresponding (minimal) examples on SHMEM_TEAM_WORLD. It deprecates the 32/64-bit type constraints on these collectives. There are also a few minor fixes/edits thrown in that I found while reading.

@gmegan, please review when you have the chance and let me know what you think. In particular, I imagine we may want examples using collectives over teams other than SHMEM_TEAM_WORLD, but I hesitate to make all these examples more sophisticated than necessary...

nspark commented 5 years ago

While I think the addition of typed collectives is fine, I think we shouldn't deprecate the existing (e.g., shmem_alltoall{32,64}) or not add team-based (e.g., shmem_team_alltoall{32,64}) collective operations that are untyped (i.e., void*). SHMEM needs to provide ways to use data-shuffling collectives over arbitrary types (e.g., struct foo).

At some point, IIRC, I proposed adding shmem_alltoall{8,16} to complete the generality of the API. The resistance, I think, was with respect to whether these were actually needed or merely there for completeness. (It was mostly for completeness.)

I suppose that if shmem_team_{char,schar,uchar}_alltoall exist, one could use them for sending arbitrary C structures without violating C's pointer-conversion requirements, but properly invoking the generic machinery of shmem_team_alltoall would appear strange, as it would require an explicit cast to a pointer to a character type for destination and source buffers.

davidozog commented 5 years ago

@nspark - thanks, this is a good point. Do you have a preference over the two options? For example, whether to keep (i.e not deprecate) shmem_alltoall{32,64} vs. add shmem_team_alltoall{32,64}?

nspark commented 5 years ago

Sorry, I think I got mixed up a little bit. Are we deprecating all the existing collectives that take pSync arguments? If so, then yes, I think we should deprecate shmem_alltoall{32,64} along with the others. Then, I think we should add shmem_team_alltoall{32,64} to keep the analogous functionality using SHMEM teams.

davidozog commented 5 years ago

Yup, that's my understanding based on the current document. I'd also guess that users want the ability to use teams over arbitrary types... so I'll add the shmem_team_{bcast,collect,alltoall}{32,64} routines back, if that sounds good to @gmegan?

davidozog commented 5 years ago

I don't believe I have been involved with any discussions regarding adding 8 and 16 bit (and 128?) routines... But I don't see much of an issue adding something like: shmem_team_alltoall<SIZE>... where SIZE is one of 8, 16, 32, 64, 128...

We just have to be careful about C's limit on the number of external symbols in a translation unit. That's 4095 correct?

gmegan commented 5 years ago

As to the above comments about what to add and leave... lets l talk more in WG today with other people, but from what I am seeing, we are talking about all of the following:

C/C++

shmem_team_TYPENAME_[broadcast|collect|fcollect|alltoall|alltoalls] (team, ...)
shmem_team_[broadcast|collect|fcollect|alltoall|alltoalls]_[8|16|32|64] (team, ...)

Deprecated C/C++:

shmem_[broadcast|collect|fcollect|alltoall|alltoalls]_[8|16|32|64] ( ..., active set params )

C11

shmem_[broadcast|collect|fcollect|alltoall|alltoalls] (team, ... )
shmem_[broadcast|collect|fcollect|alltoall|alltoalls]_[8|16|32|64] (team, ... )

And from Naveens comments last week about reductions, we are looking at:

C/C++

shmem_team_TYPENAME_reduce ( team, OPCODE, ... )

Deprecated C/C++

shmem_TYPENAME_[and|or|xor|max|min|sum|prod]_to_all ( ..., active set params )

C11

shmem_reduce ( team, OPCODE, ... )

While we are doing this, do we also want 8/16/32/64 reductions?

gmegan commented 5 years ago

I mention reductions here, but let's not add them to this PR. I will update reductions in a separate PR

davidozog commented 5 years ago

TODO on this PR: make team the first argument, not the last.

nspark commented 5 years ago

Regarding collective reductions with explicit types and/or operations, I think the space of options spans two axes of choice:

Provide (A) type-generic and typed interfaces or (B) only type-generic interfaces
Provide (A) operation-named reductions or (B) operation-as-argument reductions

These would look like some subset of the following:

void shmem_team_OP_to_all(shmem_team_t team, TYPE *dst, const TYPE *src, size_t nreduce);
void shmem_team_TYPE_OP_to_all(shmem_team_t team, TYPE *dst, const TYPE *src, size_t nreduce);

typedef enum {
  SHMEM_OP_MAX,
  SHMEM_OP_MIN,
  SHMEM_OP_SUM,
  // ...
  SHMEM_OP_XOR,
} shmem_op_t;
void shmem_team_reduce_to_all(shmem_team_t team, TYPE *dst, const TYPE *src, size_t nreduce, shmem_op_t op);
void shmem_team_TYPE_reduce_to_all(shmem_team_t team, TYPE *dst, const TYPE *src, size_t nreduce, shmem_op_t op);

Of the options:

1A + 2A
- PRO: Compatible with PSHMEM
- CON: Increases the number of user-facing interfaces significantly
- MEH: PSHMEM support is burdensome but explicit (one API for each op/type pair)
1A + 2B
- PRO: Somewhat compact user-facing API
- CON: Users still have to decide between type-generic and typed API
- MEH: PSHMEM support is reduced only to one per type
1B + 2A
- PRO: Somewhat compact user-facing API
- CON: PSHMEM support is unclear
- MEH: Requires C11 support
1B + 2B
- PRO: Most compact (only two user-facing interfaces)
- CON: PSHMEM support is unclear
- MEH: Requires C11 support

One comment I made on today's call was that implementations could choose to implement shmem_team_reduce_to_all (1B + 2B) as something like:

void shmem_internal_team_reduce_to_all(shmem_team_t team, shmem_type_t type, void *dst, const void *src, size_t nreduce, shmem_op_t op);

typedef enum {
  SHMEM_TYPE_INT,
  SHMEM_TYPE_UINT,
  // ...
} shmem_type_t;

#define TYPE_TO_ENUM(PTR) \
  _Generic((PTR), \
           int*: SHMEM_TYPE_INT,\
           unsigned int*: SHMEM_TYPE_UINT,\
           /* ... */)

#define shmem_team_reduce_to_all(TEAM, DST, ...) \
  shmem_internal_team_reduce_to_all((TEAM), TYPE_TO_ENUM((DST)), (DST), __VA_ARGS__)

But I don't think we should ever expose such an MPI-like interface as shmem_internal_team_reduce_to_all.

davidozog commented 5 years ago

Thanks @nspark for the summary.

I've just created a Doodle for taking a straw poll on this issue. It's located here: https://doodle.com/poll/4v8759ia4hw4fn7y

Please submit a vote everyone! I would appreciate if you leave a name/organization on the poll, but inputing "anonymous" is fine as well.

There is also the matter of whether to include 8-bit and/or 16-bit sizes to the untyped collectives (where relevant); and/or whether to include the "mem" variants where the nelems argument is scaled in bytes (similar to shmem_putmem).

As I recall, we discussed 5 reasonable options here as well:

1) Add 8-bit and 16-bit routines without a mem routine 2) Add 8-bit and 16-bit routines with a mem routine 3) Add an 8-bit routine only without a mem routine 4) Add an 8-bit routine only with a mem routine 5) Don't add either 8 or 16 bit routines, don't add a mem routine

Here is a Doodle poll for helping to guide this issue in a good direction. Please participate! https://doodle.com/poll/6dretrharavuz54y

Side note: As I mentioned on the last call, the (official?) Github polls app may be a little buggy with regards to allowing multiple votes (the issue is documented there), so perhaps Doodle is the better way to go for now.

minsii commented 5 years ago

Sorry that I was not able to join the discussion earlier. My comments are based on the discussion in this PR, so it might be incorrect. I have also voted in the polls.

For the 8-bit/16-bit/mem issue, I would think completeness is important. As we already have 8/16/mem for RMA, I do not see a reason to do not add them (or only add some of them) for the collective API.

For the reduction API issue, I would prefer to always have typed interfaces + operation-named reductions for performance reason. Type-generic API can be added for compilers with C11 support. In MPI type-generic and operation-as-argument API (e.g., MPI_Reduce(..., dtype, op,...)), we always have to pay extra instruction overhead when the runtime tries to understand the arguments' value and issue the data to network. For example:

MPI_Reduce(..., dtype, op,...) {
    if (network reduce does not support <dtype, op>)
       use active-message
    else
       use network hardware reduce
}

I am not sure how much trouble it would cause for user programming if we add a large number of typed interfaces + operation-named functions. But for a runtime implementation which does not care the instruction overhead, we can simply map the function to an internal type-generic and operation-as-argument routine.

gmegan commented 5 years ago

Min, thank you for the feedback. In the call last week, it came up that there would be overhead of an if-then statement inside the reduction. For some people, this seems ok, since the network operation is so slow already, that adding extra time in the form of the if-then statement would be alright.

I am interested that you say that this performance overhead is actually a problem in the MPI reductions, since that was our model for adding the shmem operation-as-argument reductions. Have you have seen worse performance of MPI reductions vs shmem reductions in your implementations that is due to added complexity of MPI reductions? Or is this more based on looking at the code and knowing that this will always be slower if there is a branch?

naveen-rn commented 5 years ago

@minsii Which network supports data type specific collectives? I'm aware of size specific ones - but data type specific is new to me. Please clarify. There still needs some conversion from say int to 4 bytes.

@gmegan If we go with 1B+2B or even 1A+2B - not sure where is the need for the conditional check on the datatypes?

minsii commented 5 years ago

@naveen-rn In MPICH, we have to check the datatype in the collective reduction on Mellanox IB network (either use hcoll or active message-based fallback). Moreover, as network layer provides a different set of datatypes, we always need a switch to convert MPI_Datatype to the network type, even only for basic datatypes. Below is the code we used in MPICH, DTE_ are the types defined by hcoll. Similar conversion is also needed for the op parameter.

    switch (datatype) {
        case MPI_CHAR:
        case MPI_SIGNED_CHAR:
            return DTE_BYTE;
        case MPI_INT:
            return DTE_INT32;
        ...
        default:
             return UNDEFINED;
         }

minsii commented 5 years ago

@gmegan We do not have data for collective reduction. The worse performance has been observed in the RMA routines in message-rate benchmarks because of the same generic MPI API issue (i.e., need extra instructions to understand MPI datatype and op parameters in performance-critical issuing path).

I assume the performance issue in collectives might not be as significant as that in RMA over current network where the hardware collectives is not yet fast enough. But the trend we believe is that the speed of network data transferring is increasing but CPU speed is reducing, thus eventually all small message transferring will become CPU-bound. We are trying to minimize the CPU instructions in MPI because of this reason, but the if-else and switch instructions for handling MPI dtype-as-argument and op-as-argument API are the part we cannot fully avoid.

naveen-rn commented 5 years ago

@minsii Looking at the instruction set is one part - we need to keep in mind that we might end up increasing the size of the shared library builds - by unnecessarily duplicating the functions, just to handle the simple operation case. Consider the number of algorithms internally supported by the implementation multiplied by the datatypes and operations. If we could reduce the duplication at any part without hitting the performance a lot - it should be acceptable.

gmegan commented 5 years ago

@naveen-rn, I think there is only a concern if we have op as argument, and there would be overhead to test if the op is supported. I don't think we have a case where the datatype would need a conditional test, since it is either in the routine prototype or is part of a generic function call.

If I understand, then Min's example problem in openshmem with 1A + 2B would look something like:

void shmem_team_TYPE_reduce_to_all(shmem_team_t team, TYPE *dst, const TYPE *src, size_t nreduce, shmem_op_t op) 
{
    if (network_reduce_TYPE does not support <op>)
          use active-message
    else
          use network hardware reduce
}

naveen-rn commented 5 years ago

In MPICH, we have to check the datatype in the collective reduction on Mellanox IB network (either use hcoll or active message-based fallback).

This is exactly what I was thinking - in either case of 1B+2B or 1A+2B - you wont be performing this conditional switch checks. It should be handled by C11 type conversion. All you would be doing is some kind of macro conversion #define SHMEM_INT DTE_INT32.

naveen-rn commented 5 years ago

@gmegan I might be wrong, please correct me. If I understand correctly the issue here is just how to pass in the operations without the conditional check. Instead of enum, lets just make the operations as a void pointer with say a new type shmem_reduce_op_t and let the implementation add as much information it needs to the new type.

In this case, we would end up with just one choice - whether we need 1B+2B or 1A+2B - or simple to say - whether all type specific macros should be exposed to the users? I don't have any strong preference for it.

minsii commented 5 years ago

@naveen-rn

Looking at the instruction set is one part - we need to keep in mind that we might end up increasing the size of the shared library builds - by unnecessarily duplicating the functions, just to handle the simple operation case.

Agree with this concern. But for one who concerns the binary size, he/she can always map the type-generic & op-named API to an internal generic routine. For one who wants to get the best performance, he/she can choose to implement each API with separate code. Thus, I would think the type-generic & op-named version is more flexible for runtime implementers.

gmegan commented 5 years ago

Looking back over this discussion, it seems like the issue is that we want to avoid the case where we are supporting both type-named and op-named functions. In that case, say there are N types and M ops, we end up with M*N functions.

On the one hand, @minsii is advocating for type-generic plus op-named since this will make at most M functions and avoid testing overhead for having to test the op argument.

On the other hand, @naveen-rn is advocating for op-argument using a void pointer to encode functionality and possibly avoid testing overhead and doesn't care if we have type generic or type named.

My preference would tend to be to require type-named routines in addition to generics, just to have some older C interface to fall back to... But I don't have a very firm reason to argue for this.

Thinking about it, can't we avoid the test overhead for operator as argument using generics? We can disallow using the shmem_op_t type to create variables. So it could be a generic type argument. Like, you would NOT allow things like:

shmem_reduce_op_t op = SHMEM_REDUCE_ADD;
void shmem_team_int_reduce_to_all(team, dest, src, nreduce, op);

Instead, you would require:

void shmem_team_int_reduce_to_all(team, dest, src, nreduce, SHMEM_REDUCE_ADD)

where

typedef struct { int info } SHMEM_REDUCE_ADD;
typedef struct { int info } SHMEM_REDUCE_MAX;

#define shmem_int_reduce_to_all(T,D,S,N,O) \
             ((_Generic((O), \
                SHMEM_REDUCE_ADD: shmem_team_int_add_to_all, 
                SHMEM_REDUCE_MAX: shmem_team_int_max_to_all))  (T,D,S,N))

void shmem_team_int_add_to_all(shmem_team_t team, TYPE *dst, const TYPE *src, size_t nreduce)
{
       use active messages
}

void shmem_team_int_max_to_all(shmem_team_t team, TYPE *dst, const TYPE *src, size_t nreduce)
{
       use network operation
}

I have often done this kind of thing in C++, but have not tested yet in C11. I think this should work.

This would still lead to the problem where you end up with N*M functions internally. So, the implementation could choose not to do this and just use a test instead.

nspark commented 5 years ago

I think one important distinction is the difference between the number of user-facing functions in the API and the # of internal functions needed to implement the API. For example, a type-generic API that does proper type checking—unlike MPI_Send or an implementation of a type-generic shmem_put that just calls shmem_putmem, for example—will generally require distinct function symbols for each set of compatible types (e.g., shmem_int_foo would suffice for int and int32_t types on many platforms). So, a user-facing API that only provides a type-generic interface will generally require N functions (one for each type), which can still be less than total number of named functions for a named interface (e.g., shmem_int_foo vs. shmem_int32_foo).

Megan: Looking back over this discussion, it seems like the issue is that we want to avoid the case where we are supporting both type-named and op-named functions.

My initial interest here was in reducing the overall API surface. My overall intent with generic APIs is to reduce the cognitive burden for a user in thinking about which function to make. Type-generic APIs are a help (big or small), IMO.

Megan: My preference would tend to be to require type-named routines in addition to generics, just to have some older C interface to fall back to... But I don't have a very firm reason to argue for this.

That is the current style of the OpenSHMEM collectives, and it's not a terrible place to be in. If it's causing heartburn, we can always keep with the current style of the API. At some future point, maybe we'll reduce the overall API surface with generic-only calls.

The big problem, IMO, of not including the type-named functions is that the PSHMEM API becomes ambiguous or undefined. For example, if shmem_add_to_all gets mapped to some implementation-defined name for each type (e.g., shmem_internal_{int, long}_reduce_add vs. shmem_{int, long}_add_to_all), then the user doesn't know the appropriate shmem_-prefixed function to "overload" or the pshmem_-prefixed function to call.

Megan: I have often done this kind of thing in C++, but have not tested yet in C11. I think this should work.

Yes, your code is close, but not quite there for C11. (It's actually more like what one would do in C++.) C11 generic selection operates on an expression, not a type, but it's trivial to adapt your example to work.

gmegan commented 5 years ago

The big problem, IMO, of not including the type-named functions is that the PSHMEM API becomes ambiguous or undefined. For example, if shmem_add_to_all gets mapped to some implementation-defined name for each type (e.g., shmeminternal{int, long}_reduceadd vs. shmem{int, long}_add_toall), then the user doesn't know the appropriate shmem-prefixed function to "overload" or the pshmem_-prefixed function to call.

This makes sense. Having fixed symbol names to overload is useful for pshmem, but also for things like library wrapper layers. Also, I believe profiling tools like TAU have an easier time if the functions have fixed names.

I wanted to see what the above generic example looks like when it is working, so I fixed it I think. This compiles and runs as expected for me: https://gitlab.com/gmegan/openshmem-examples/blob/master/shmem_team/shmem_generic_reduce.c

jdinan commented 5 years ago

@gmegan, There's a small edit needed to make your generic selection code work. The controlling expression should be an expression, not a type. Here's one way to fix it:

#include <stdio.h>

extern void *shmem_c11_reduce_helper;

typedef struct { int info; } SHMEM_REDUCE_ADD;
typedef struct { int info; } SHMEM_REDUCE_MAX;

#define typename(type)                          \
    _Generic((type*)shmem_c11_reduce_helper,    \
             SHMEM_REDUCE_ADD*: "add",          \
             SHMEM_REDUCE_MAX*: "max")

int main(void) {
    printf("%s %s\n",
           typename(SHMEM_REDUCE_ADD), 
           typename(SHMEM_REDUCE_MAX));
    return 0;
}

I'm surprised that the compiler doesn't see the SHMEM_REDUCE_* types as being compatible and flag an error (C11 forbids this in _Generic expressions`). But, it appears to work with both clang and gcc. 🤷‍♂️

FWIW, I'm not completely comfortable with this API. I think users tend to expect names like SHMEM_REDUCE_ADD to be integer constants and might run into problems with the above definitions.

jdinan commented 5 years ago

Also, we will need C functions to support the profiling interfaces, C99 users, and C++.

gmegan commented 5 years ago

Closing this PR since issues raised here are resolved in other PRs.

gmegan / specification

Typed Collectives with minimal examples #84