pmix / pmix-standard

PMIx Standard Document
https://pmix.org
Other
23 stars 22 forks source link

Defining allocation resource types #386

Open rhc54 opened 2 years ago

rhc54 commented 2 years ago

Overview

PMIx includes an API (PMIx_Allocation_request) that allows one to request scheduling operations. This RFC initiates a discussion on how best to describe the resources involved in the requested operation.

Motivation

Dynamic programming models are gaining in popularity. These include workflows, machine learning, and MPI "sessions". One key element in the development of these models is their need to adjust allocations on-the-fly - either adding resources, removing resources, or extending the allocation time of existing resources. PMIx has provided a standardized interface by which those requests can be made, and at least a preliminary standardized way of describing the resources. As more groups begin considering these options, it might be good to discuss how well those descriptions meet their needs.

Discussion Items

Schedulers, programming models, and applications can view the system resources in a variety of ways. Some may choose to allocate resources at the node level, disallowing any sharing of nodes between users and applications. Others may work at a more atomistic level, allocating individual CPUs, GPUs, memory regions, and networking resources. PMIx needs to define a set of attributes and qualifiers that can span this range.

The pmix_info_t already allows a user to specify if the included attribute/directive is "required" or "if supported". Thus, a user could specify resources in whatever manner best fits their worldview, leaving the scheduler to interpret those as needed to fit its own heuristic. For example, a user could request a number of CPUs in an environment that does not support shared node usage, letting the scheduler translate that into a number of nodes (with the obvious caveat that this may result in some unused CPUs).

In order to get this started, I thought to initiate it with some simple and rather common values. The following are already defined in the Standard:

In addition, there are several attributes associated with fabric resources (won't include them here for brevity). We also have the following definitions that can be "reused" here (or define a PMIX_ALLOC_xxx version of them):

Since those were defined, PMIx has been extended to include storage attributes - so we may want to add corresponding allocation attributes for storage-related values.

Are there any thoughts on missing attributes from this list?

Wee-Free-Scot commented 2 years ago

Are all nodes assumed to be identical? Are all CPUs assumed to be equal? Is a hyperthread exposed as a CPU?

Is there any way to obtain structural information like num-CPUs-per-node? Or num-GPUs-per-node?

Is there any way to obtain performance information like num_mem_channels & bandwidth-per-mem-channel? Or network-type (IB, Ethernet), network-latency, and network-bandwidth?

rhc54 commented 2 years ago

Are all nodes assumed to be identical? Are all CPUs assumed to be equal?

Short answer is "no" - we need to define appropriate qualifiers to allow the user to specify the type of node/cpu being requested. This gets a little difficult to standardize, which is why we initially didn't do it - suggestions are welcome.

Is a hyperthread exposed as a CPU?

We seem to be missing an attribute to specify that option - thought we had one, but I don't see it. PRRTE translates its command line option into a PRRTE-directive, so we probably need to add a PMIx equivalent.

Is there any way to obtain structural information like num-CPUs-per-node? Or num-GPUs-per-node?

Yes, using PMIx_Query_info - we would need to define an attribute for those values.

Is there any way to obtain performance information like num_mem_channels & bandwidth-per-mem-channel?

We don't currently have attributes for those values - easy to define, probably more important to determine where/how one might get that info. We based our fabric attributes on what libfabric and HWLOC expose, but there may be other sources of information on the system. Someone would have to explore.

Or network-type (IB, Ethernet), network-latency, and network-bandwidth?

These we have covered with the PMIX_FABRIC_xxx attributes. You can get pretty detailed info on the fabric (e.g., vendor, type, data rate). I don't see latency on the list - I'm not sure I've ever seen that info provided in the OS info, so it might be hard to obtain.