openucx / ucc

Unified Collective Communication Library
https://openucx.github.io/ucc/
BSD 3-Clause "New" or "Revised" License
190 stars 93 forks source link

Add topology support in UCC #13

Open manjugv opened 3 years ago

manjugv commented 3 years ago

Issue

Information about the node and network topology is useful to achieve many optimizations including implementing topology-aware collectives, routing data via fast paths, and minimizing the impact of congestion. Current UCC interfaces lack the topology information.

Potential solution

Add topology abstraction to library and team creation interfaces.

Note

This issue is a placeholder to capture the discussion, proposals, and details related to topology abstraction.

alex--m commented 3 years ago

In addition to the more specific issue #19, I'm sure every collective component would benefit from topology information.

For reference, here's what UCG expects from MPI today:


/**
 * @ingroup UCG_GROUP
 * @brief UCG group member distance.
 *
 * During group creation, the caller can pass information about the distance of
 * each other member of the group. This information may be used to select the
 * best logical topology for collective operations inside UCG.
 */
enum ucg_group_member_distance {
    UCG_GROUP_MEMBER_DISTANCE_SELF   = 0, /* This is the calling member */
    UCG_GROUP_MEMBER_DISTANCE_CACHE  = UCS_MASK(1), /* member shares cache memory */
    /* Reserved for in-socket proximity values */
    UCG_GROUP_MEMBER_DISTANCE_SOCKET = UCS_MASK(3), /* member is on the same socket */
    /* Reserved for in-host proximity values */
    UCG_GROUP_MEMBER_DISTANCE_HOST   = UCS_MASK(4), /* member is on the same host */
    /* Reserved for network proximity values */
    UCG_GROUP_MEMBER_DISTANCE_NET    = UCS_MASK(8) - 2, /* member is on the network */

    UCG_GROUP_MEMBER_DISTANCE_FAULT  = UCS_MASK(8) - 1,
    UCG_GROUP_MEMBER_DISTANCE_LAST   = UCS_MASK(8)
} UCS_S_PACKED;

/**
 * @ingroup UCG_GROUP
 * @brief Creation parameters for the UCG group.
 *
 * The structure defines the parameters that are used during the UCG group
 * @ref ucg_group_create "creation".
 */
typedef struct ucg_group_params {
...
    /*
     * This array contains information about the process placement of different
     * group members, which is used to select the best topology for collectives.
     *
     * For example, for 2 nodes, 3 sockets each, 4 cores per socket, each member
     * should be passed the distance array contents as follows:
     *   1st group member distance array:  0111222222223333333333333333
     *   2nd group member distance array:  1011222222223333333333333333
     *   3rd group member distance array:  1101222222223333333333333333
     *   4th group member distance array:  1110222222223333333333333333
     *   5th group member distance array:  2222011122223333333333333333
     *   6th group member distance array:  2222101122223333333333333333
     *   7th group member distance array:  2222110122223333333333333333
     *   8th group member distance array:  2222111022223333333333333333
     *    ...
     *   12th group member distance array: 3333333333333333011122222222
     *   13th group member distance array: 3333333333333333101122222222
     *    ...
     */
    enum ucg_group_member_distance *distance;

OpenMPI's MCA-COLL component provides the following:

    args.distance                 = alloca(args.member_count *
                                           sizeof(*args.distance));
    if (args.distance == NULL) {
        COLL_UCX_ERROR("Failed to allocate memory for %lu local ranks", args.member_count);
        return OMPI_ERROR;
    }

    /* Generate (temporary) rank-distance array */
    ucg_group_member_index_t rank_idx;
    for (rank_idx = 0; rank_idx < args.member_count; rank_idx++) {
        struct ompi_proc_t *rank_iter =
                (struct ompi_proc_t*)ompi_comm_peer_lookup(comm, rank_idx);
        rank_iter->proc_endpoints[OMPI_PROC_ENDPOINT_TAG_COLL] = NULL;
        if (rank_idx == args.member_index) {
            args.distance[rank_idx] = UCG_GROUP_MEMBER_DISTANCE_SELF;
        } else if (OPAL_PROC_ON_LOCAL_SOCKET(rank_iter->super.proc_flags)) {
            args.distance[rank_idx] = UCG_GROUP_MEMBER_DISTANCE_SOCKET;
        } else if (OPAL_PROC_ON_LOCAL_HOST(rank_iter->super.proc_flags)) {
            args.distance[rank_idx] = UCG_GROUP_MEMBER_DISTANCE_HOST;
        } else {
            args.distance[rank_idx] = UCG_GROUP_MEMBER_DISTANCE_NET;
        }
    }