Instant on and heterogenous nodes

jaidayal commented 2 years ago

Overview

We would like to extend the return value of the PMIx_Get call with PMIX_FABRIC_ENDPT parameter to provide information about the numa domain to which a NIC is attached to. The current call will return the total number of NICs in the node, in some cases sorted based on the affinity of the rank to socket and NICs in that socket. While that is sufficient for homogeneous systems where each node/socket has the same number of NICs, it is not enough for heterogeneous systems where a node/socket can have different number of NICs, maybe because one of them failed.

Example. On a system with multiple NICs when a rank is pinned to socket 0, as in Figure 1, PMIx_Get() will return NIC0, NIC1, NIC2, and NIC3 (other orders such as NIC1, NIC0, NIC2, and NIC3 are also possible, since NIC0 and NIC1 are at the same distance of socket0). For systems, where all the nodes are homogeneous and have the same number of NICs this is enough. However, for nodes with different number of NICs, this is not enough.

Assume that NIC0 in this example failed. Then, when the rank in the above example calls PMIx_Get(), the returned values are NIC1, NIC2, NIC3. This information by itself is insufficient to know whether socket 0 (where the rank is pinned to) has one or two NICs.

Motivation

Looking for a solution for the cases where NICs on a node fail.

Discussion Items

How should heterogenous nodes be dealt with for instant on?

rhc54 commented 2 years ago

Hmmm...there are several confused things here, so let's try to sort them out.

NICs are not associated with a NUMA - haven't been for quite some time. So let's instead just focus on distances.

What you are really asking is "which NICs are closest to the package where rank N is executing". PMIx_Get returns the relative distance of each NIC from the location where the specified rank is pinned. This is computed in an arbitrary manner, but the bottom line is that those NICs attached to the PCIe bus of the package where rank N is executing will all have the same distance (typically reported as "1"), while the distance to all NICs attached to other packages will be some greater value.

This has nothing to do with homogeneity - the definition remains the same whether or not every node is identical. In the case where every node has the same number of NICs (and all NICs are alive at startup), the reported inventory will be the same, and so the library will simply compute the relative distances for all procs based on the same relative NIC locations.

If a NIC is not alive at startup, then the inventory for that node will be different. The library then sees the different inventory and uses the inventory for that node to compute relative distances on that node. No problem there.

So it sounds like your question really is: what to do when a NIC fails during operation? This becomes more of an issue for the RTE/RM as the PMIx library has no way to know that a NIC failed somewhere in the allocation. It falls upon the RTE/RM to notify the library of the failure, stating precisely which NIC(s) failed. The library would then have to update the relative NIC distances for any procs on that node, and generate an appropriate event so its local procs could know about the change and take any required action (which is decidedly non-trivial due to the issue of in-flight and potentially lost messages). This last part (updating and generating the event) is not in the current implementation.

Bottom line: I don't see anything here that requires a change to the Standard - this is purely an implementation issue. The current implementation handles this just fine for the "static" case (i.e., where NICs are either alive or dead at startup), but not for the "dynamic" case where a NIC fails during operation. Doing the latter is simple for the PMIx library, but non-trivial for the application - typically, MPI applications simply terminate in such a situation. Regardless, we can discuss it over in the implementation.

RaymondMichael commented 2 years ago

@rhc54 in your third paragraph you say that PMIx returns the relative distance. What field is that?

rhc54 commented 2 years ago

You combine the info from two keys to obtain the full picture.

PMIX_FABRIC_ENDPT provides an array of endpoints:

typedef struct pmix_endpoint {
    char *uuid;
    char *osname;
    pmix_byte_object_t endpt;
} pmix_endpoint_t;

You then combine that with PMIX_DEVICE_DISTANCES:

typedef struct pmix_device_distance {
    char *uuid;
    char *osname;
    pmix_device_type_t type;
    uint16_t mindist;
    uint16_t maxdist;
} pmix_device_distance_t;

The min and max distances are relative values - i.e., they aren't intended to be absolute measures, but instead indicate that "this device is twice as far from you as the other device". We provide two values because threads within the same rank can be pinned to different locations - in practice, that doesn't happen very often and so the two values are usually the same.

jaidayal commented 2 years ago

So effectively, for NICs that can't be reached, the distance to that NIC should be something like -1.

rhc54 commented 2 years ago

If a NIC has failed then it won't be included in the response as it effectively does not exist (i.e., it won't appear in the inventory, or will appear but marked as "down" and therefore excluded from consideration).

jaidayal commented 2 years ago

@RaymondMichael I think effectively, the "inventory" (in the network service or WLM) then has to track if a nic is down for instant on to work for these cases.

rhc54 commented 2 years ago

I think effectively, the "inventory" (in the network service or WLM) then has to track if a nic is down for instant on to work for these cases.

This may not be accurate. For example, in OpenPMIx we have the ability for the host to request that we collect the local inventory. We then have plugins for the various fabric and gpu vendors - each of those checks the local environment to detect the available resources. These are then reported to the host for use.

PRRTE, for instance, uses this at DVM startup to collect the available inventory across the allocation. Note that resources might be "up", but not assigned for use by this allocation (I know that doesn't pertain to your use-case, but not everybody operates that way). Each plugin does this for its corresponding vendor and resource type (e.g., Mellanox NICs, NVIDIA GPUs). The DVM then uses this information to make its resource/endpt assignments and to compute device distances for each process as the process is mapped to a location.

We don't access the network service or the scheduler for this purpose for scalability reasons. PRRTE has to establish the DVM anyway, and that requires a "bootstrap" communication from the daemon on each node back to the DVM controller so the controller knows that the daemon is alive/ready. Easy enough to simply piggyback the inventory on that message. When PRRTE operates in single-shot mode as OMPI's "mpirun", mpirun itself acts as the DVM controller, so OMPI regularly performs this operation (i.e., we know this works!).

Note that PMIx does not include support for Slingshot. HPE has not participated in PMIx so far, and frankly we haven't had any requests for Slingshot support (outside of you 😄 ). We have been distracted for awhile with other matters, but should get back to completing support in this regard early next year. If someone wants to follow what we do for the other fabrics, it would be simple enough for them to add Slingshot to the mix.

So it isn't really accurate to say that the network service or scheduler needs to track and report on the NIC's status for instant on to work. I'm sure they do for their own purposes - but PMIx+PRRTE is an example of how to independently accomplish that task.

pmix / pmix-standard