Using group-level rank in PMIx operations other than PMIx_Fence

darbyShaw commented 2 years ago

Describe your question or clarification request here

Suggested Clarification

The chapter on process sets and groups mentions that the PMIx server creates a tracker that assigns a new group rank based on the relative position in the array of processes. If I create a group of two namespaces, would it be possible to use the group-level rank of a process to query PMIx info? For eg., two namespaces have two processes each. From namespace 1, process 1 wants to know what the NODE_LOCAL_RANK is for process 2 on namespace 2, would it be possible to query using PMIx_Get by specifying the group-level rank 4 and namespace as the group-id for the target process? Also, would PMIX_JOB_SIZE return the size of the group if the group id is used as the namespace? I saw that the this doesn't work yet but is it in the plan to implement group operations in that manner? Or groups and namespaces might not work together and need qualifiers to store group information instead like in examples/group_lcl_cid.c?

References

Chapter 13, Section 13.2.1

rhc54 commented 2 years ago

Yes it is supposed to work that way. Hope to implement it later this year. Could be done sooner if there is a driver for it.

darbyShaw commented 2 years ago

Is this open to contribution?

rhc54 commented 2 years ago

absolutely! Contributions are much appreciated 😄

darbyShaw commented 2 years ago

I would like to share my contribution to querying ability for process groups attributes. My changes are here https://github.com/openpmix/prrte/compare/master...darbyShaw:prrte:master . Is it ok to make a pull request?

rhc54 commented 2 years ago

Sure - please do!

darbyShaw commented 2 years ago

Hello again,

I have identified a few areas in the groups implementation that I would like to clarify. There is some confusion with respect to the group context id, the group name and the group signature.

In order to proxy PMIx_Get queries for group members, at the PMIx server I use the group name in the namespace attribute of the proc argument to identify the actual namespace and rank of the proc which I map from the group searched using the group name in the pmix_server_globals.groups list. (I wrote a few lines to enable this behaviour).
As written in the standard, the group name is unique only across the host environment. In order to inform all host environments of created groups, just like all daemons are informed of new job namespaces, I broadcast the constructed groups to all daemons. But during this broadcast, I do not broadcast the group name since there might be a conflicting one on some host environment. We can instead identify groups either by the context ID generated by the host node process in the resource manager or the group signature.
This means that all group queries should instead include the context ID instead of the group name and in order to check if a group already exists during group signature, one must use the group signature with a requirement on ordering to make it easier.
Then the group name doesn't seem to have any benefit.

The confusion arises because I would like to be able to form groups of groups and also enable querying groups with PMIx_Get that have been created in other host environments. Shouldn't group context IDs and not group names be used in group related queries making group names superfluous?

Thank you very much.

rhc54 commented 2 years ago

There are several elements in your post that require clarification. However, first let me say that I appreciate your sharing of this use-case! The group concept was motivated by a desire for async formation of loosely-coupled procs, and it is good to see someone extending that concept.

So let's start at the bottom and work our way up the hierarchy. The group "name" was created to provide the host environment (and we'll define that better at the end) with a unique tag for the requested operation. When a process calls PMIx_Group_construct, it instantiates a collective operation across the participating procs that may span multiple nodes. The PMIx client library passes that operation to its local server, which (if the participants span multiple nodes) must then pass it to the host environment for execution.

The problem here is that many such operations can occur at the same time, even involving the same participants listed in the same order - so how do the various host environment daemons know how to correlate the requests? In other words, if daemon B receives a collective request for "group construct", how does it know which of the local "group construct" operations corresponds to it? The group "name" is the solution to that problem - all procs must pass the same string to the API, and the host environment can then use that string to provide the correlation.

Why a string? Because the signature for APIs such as PMIx_Get involve a pmix_proc_t identifier, and that identifier includes a pmix_nspace_t field - which is effectively a string. Thus, we can reuse the group "name" as a pmix_nspace_t to reference procs within the group, which is sometimes a handy thing to do. We cannot use any other format for the group "name" (e.g., an integer) as there is no way to match that to the API signatures.

At a later point in time, someone asked if the host environment could return a globally unique integer "context ID" when completing "group construct". Note that this context ID has no correlation to the group! In fact, a group can (and sometimes does in certain scenarios) request multiple context IDs over the course of its existence. The context ID is nothing more than a guaranteed globally unique integer that can be used for whatever purpose the requester desires. In the original use-case, the ID was assigned as an identifier to an MPI communicator.

Note that replacing the string "group name" with an integer "context ID" doesn't help your scenario. You would still have to guarantee that the integer was unique across the entire operational space (your multiple host environments), which would require a collective operation across that space. Effectively, you are simply shifting the burden of generating the group-unique identifier (whether string or int) from the app to the host environment, which is contrary to what we are trying to accomplish.

Which brings us then to the definition of "host environment". I left this intentionally vague as the scope of that environment depends upon several factors - it is probably easiest to explain via example. Let's say we start our application using mpirun (or its equivalent) and that the mpirun we use operates in isolation - i.e., it has no connections to other mpirun instances and its only interaction with the local resource manager is to read its resource allocation from the environment. This is the typical mode of operation for mpirun today.

In this case, the "host environment" consists solely of mpirun itself as this is the only source of RM-like capabilities available to the application. All PMIx operations run completely within this scope.

If someone adds the ability for multiple mpirun instances to interoperate, then the "host environment" grows to encompass that collection as PMIx operations can now span the various mpirun instances. The requirement for group ID uniqueness expands along with it. Admittedly, this creates more of a challenge to the application, but there are methods one can devise for dealing with it - e.g., I recently provided some for the OMPI community.

The scope of the "host environment" can therefore best be defined as the scope of PMIx operational support. If PMIx operations can cross between mpirun instances, then the "host environment" must include those instances. This can be extended as far upward as you like (e.g., a collection of clusters whose RMs interoperate).

The full definition of "host environment" is broader than what I presented above, but hopefully the above helps reduce the confusion a bit. In terms of the group "name", the burden on the application lies in generating a "name" that is unique across the expected operational scope. If people have difficulty doing so, we could (I suppose) provide an optional utility for that purpose.

pmix / pmix-standard

Using group-level rank in PMIx operations other than PMIx_Fence #423

Suggested Clarification

References