Open rhc54 opened 2 years ago
Are all nodes assumed to be identical? Are all CPUs assumed to be equal? Is a hyperthread exposed as a CPU?
Is there any way to obtain structural information like num-CPUs-per-node? Or num-GPUs-per-node?
Is there any way to obtain performance information like num_mem_channels & bandwidth-per-mem-channel? Or network-type (IB, Ethernet), network-latency, and network-bandwidth?
Are all nodes assumed to be identical? Are all CPUs assumed to be equal?
Short answer is "no" - we need to define appropriate qualifiers to allow the user to specify the type of node/cpu being requested. This gets a little difficult to standardize, which is why we initially didn't do it - suggestions are welcome.
Is a hyperthread exposed as a CPU?
We seem to be missing an attribute to specify that option - thought we had one, but I don't see it. PRRTE translates its command line option into a PRRTE-directive, so we probably need to add a PMIx equivalent.
Is there any way to obtain structural information like num-CPUs-per-node? Or num-GPUs-per-node?
Yes, using PMIx_Query_info
- we would need to define an attribute for those values.
Is there any way to obtain performance information like num_mem_channels & bandwidth-per-mem-channel?
We don't currently have attributes for those values - easy to define, probably more important to determine where/how one might get that info. We based our fabric attributes on what libfabric and HWLOC expose, but there may be other sources of information on the system. Someone would have to explore.
Or network-type (IB, Ethernet), network-latency, and network-bandwidth?
These we have covered with the PMIX_FABRIC_xxx
attributes. You can get pretty detailed info on the fabric (e.g., vendor, type, data rate). I don't see latency on the list - I'm not sure I've ever seen that info provided in the OS info, so it might be hard to obtain.
Overview
PMIx includes an API (
PMIx_Allocation_request
) that allows one to request scheduling operations. This RFC initiates a discussion on how best to describe the resources involved in the requested operation.Motivation
Dynamic programming models are gaining in popularity. These include workflows, machine learning, and MPI "sessions". One key element in the development of these models is their need to adjust allocations on-the-fly - either adding resources, removing resources, or extending the allocation time of existing resources. PMIx has provided a standardized interface by which those requests can be made, and at least a preliminary standardized way of describing the resources. As more groups begin considering these options, it might be good to discuss how well those descriptions meet their needs.
Discussion Items
Schedulers, programming models, and applications can view the system resources in a variety of ways. Some may choose to allocate resources at the node level, disallowing any sharing of nodes between users and applications. Others may work at a more atomistic level, allocating individual CPUs, GPUs, memory regions, and networking resources. PMIx needs to define a set of attributes and qualifiers that can span this range.
The
pmix_info_t
already allows a user to specify if the included attribute/directive is "required" or "if supported". Thus, a user could specify resources in whatever manner best fits their worldview, leaving the scheduler to interpret those as needed to fit its own heuristic. For example, a user could request a number of CPUs in an environment that does not support shared node usage, letting the scheduler translate that into a number of nodes (with the obvious caveat that this may result in some unused CPUs).In order to get this started, I thought to initiate it with some simple and rather common values. The following are already defined in the Standard:
PMIX_ALLOC_REQ_ID
: User-provided string identifier for this allocation request which can later be used to query status of the request.PMIX_ALLOC_ID
: A string identifier (provided by the host environment) for the resulting allocation which can later be used to reference the allocated resources in, for example, a call to PMIx_SpawnPMIX_ALLOC_NUM_NODES
: number of nodesPMIX_ALLOC_NODE_LIST
: regex of specific nodesPMIX_ALLOC_NUM_CPUS
: number of cpus (could be used as a qualifier to a given node, or as an overall limit)PMIX_ALLOC_NUM_CPU_LIST
: regex of num cpus for each nodePMIX_ALLOC_CPU_LIST
: regex of specific cpus indicating the cpus involved on each nodePMIX_ALLOC_MEM_SIZE
: number of MbytesPMIX_ALLOC_TIME
: time in seconds that the allocation shall remain validPMIX_ALLOC_QUEUE
: name of scheduler queue being referencedIn addition, there are several attributes associated with fabric resources (won't include them here for brevity). We also have the following definitions that can be "reused" here (or define a
PMIX_ALLOC_xxx
version of them):PMIX_NUM_SLOTS
: specify a number of computing slots (independent of num cpus)PMIX_HOST
: a comma-delimited list of hostsPMIX_HOSTFILE
: file containing names of hostsSince those were defined, PMIx has been extended to include storage attributes - so we may want to add corresponding allocation attributes for storage-related values.
Are there any thoughts on missing attributes from this list?