Open SteVwonder opened 4 years ago
We had a pretty long discussion of this topic in our working group this week (IAWG). I'll try to summarize some of the key points. One area of confusion is the point between when the PMIx system invokes the command line provided through PMIx_Spawn or via a command line tool (mpirun, srun, jsrun, etc) and the point at which PMIx_Init is called. For example, there is an attribute named PMIX_PROC_PID which can be queried to find the PID of another rank using PMIx_Get. This attribute makes some immediate assumptions, such that PMIx "processes" are OS processes and that there is an OS identifier that can be reported. Let's ignore those issue for now since we can assume this attribute might be optional or can be in that "not-supported" class and just consider a typical Linux system. Consider 2 ranks are created and some process within the rank 0 tree calls PMIx_Init followed by PMIx_Get of PMIX_PROC_PID on rank 1. What happens if no process in the 2nd rank tree has called PMIx_Init, what value should this return? What if a process has called PMIx_Init, what value should this return? Is there a requirement on what calls PMIx_Init? The answer could be that it must be the "thing" the PMix system invoked (i.e you can't have wrapper scripts, etc) or it could be as far as any process on the system (perhaps even on a different node) can take the role of the client. Both of these seem quite extreme to me personally, but it leaves us in the predicament of trying to define a middle ground. Another interesting option that comes to light with MPI Sessions, is the mapping of ranks to OS processes. Can a single OS process take on the role of multiple PMIx ranks? Can multiple OS processes take on the role of a single PMIx rank? The PMIx_Spawn interface tends to favor a view that you specify an application (a command line) and how many of that application to invoke and expects that many ranks to be created. The interface is flexible enough, through the use of the PMIx_Info attributes, to be expandable to launch more or less "processes" than the number of ranks to be created. I do not believe the PMIx standard dictates how a process calling PMIx_Init is mapped to a particular rank within a namespace. In practice I believe this is done through an environment variable placed in the clients environment by the PMIx system that created it.
Personally I tend to be of the opinion that we should 1) be fairy narrow in what must be supported to ease implementation and not discouraged implementations based on difficult to implement use cases that are not actually being used 2) do not burn any bridges, but leave open a clear path to support for more complex use cases than a simple 1-to-1 mapping of ranks to the "processes" created by PMIxspawn. Possibly leaving it up to implementations on what they are willing to permit 3) specifically call out other possibly use cases if and when a clear demand for those mechanisms presents itself. For example documenting a "PMIX" attribute to allow the number of ranks to differ from the number of processes launched by PMIx_Spawn().
PMIx system invokes the command line provided through PMIx_Spawn
Just to be clear, PMIx never invoke that command line - it only passes it along to the host environment. PMIx isn't a starter or launcher.
This attribute makes some immediate assumptions, such that PMIx "processes" are OS processes and that there is an OS identifier that can be reported
No - it only provides an attribute by which one can query such an ID, IF one exists. If that particular system doesn't have PIDs, then you'll get a "not found" answer. This is true of every attribute that can return a value.
What happens if no process in the 2nd rank tree has called PMIx_Init, what value should this return? What if a process has called PMIx_Init, what value should this return? Is there a requirement on what calls PMIx_Init?
Sigh. You don't need to call PMIx_Init to be assigned a PID, and the PMIx server isn't tracking PIDs. It simply passes your request to the host environment, who is responsible for retrieving and returning the PID for the process it defined as "rank 2". This will be the process it started, NOT any child process that was fork/exec'd by the original "rank 2" process as the host environment doesn't know those child process(es) exist.
difficult to implement use cases that are not actually being used
I take exception to that - the use case of clients fork/exec'ing children is in use today. Just not by MPI folks. Please don't assume the MPI folks in this "forum" are familiar with what everyone out there is doing.
You folks are WAY overthinking this!! Maybe you should just let this be for now? The people using the rest of the Standard will clarify it for you over time thru their usage. Might be easier to come back and derive tight definitions after a few years.
Adding this reference: https://github.com/pmix/pmix-standard/pull/235#discussion_r475842903
Just to be clear, PMIx never invoke that command line - it only passes it along to the host environment. PMIx isn't a starter or launcher.
agreed. I should have been more careful in my wording there.
No - it only provides an attribute by which one can query such an ID, IF one exists. If that particular system doesn't have PIDs, then you'll get a "not found" answer. This is true of every attribute that can return a value.
Good point. I did mention this during our last meeting that this is probably a case of "provided if possible".
Sigh. You don't need to call PMIx_Init to be assigned a PID, and the PMIx server isn't tracking PIDs. It simply passes your request to the host environment, who is responsible for retrieving and returning the PID for the process it defined as "rank 2". This will be the process it started, NOT any child process that was fork/exec'd by the original "rank 2" process as the host environment doesn't know those child process(es) exist.
Your explanation is in line with how we are defining process. I proposed a definition of "whatever the underlying system started in response to the PMIx_Spawn call or other CL tool which presents the functionality of PMIx_Spawn." (not those words exactly, but that idea). So PMIX_PROC_PID is whatever the underlying system want to use to identify that thing.
difficult to implement use cases that are not actually being used
I take exception to that - the use case of clients fork/exec'ing children is in use today. Just not by MPI folks. Please don't assume the MPI folks in this "forum" are familiar with what everyone out there is doing.
I was referring to the cases of multiple processes all claiming to be the same rank or a single process taking on the roll of multiple ranks. Are either of those situations in use today?
You folks are WAY overthinking this!! Maybe you should just let this be for now? The people using the rest of the Standard will clarify it for you over time thru their usage. Might be easier to come back and derive tight definitions after a few years.
I agree. That is why I proposing we do a pretty simple (maybe even vague) explanation now and tighten it up in time as the need arises.
I was referring to the cases of multiple processes all claiming to be the same rank or a single process taking on the roll of multiple ranks. Are either of those situations in use today?
The former is - I don't believe the latter is permitted as the host is required to assign a unique rank to each process. There is no mechanism by which a process can assume multiple ranks.
Is there a project or example code that does this that we can track to better understand how folks are using this capability? I'd like to keep track of it to inform this discussion.
One current usage relates to a "swarm intelligence" model that has to operate in an environment that does not support PMIx_Spawn
. The model doesn't utilize MPI, so that isn't a factor here, but there is cross-process communication (in this case, over sockets). Basically, each process is looking at its data set and making an independent decision about how to attack it - e.g., how many processes should address that particular problem. It then fork/exec's a number of clones for that purpose. Each clone calls PMIx_Init
because it needs access to information available from PMIx and might need to wireup or sync with other processes.
OpenPMIx handles this just fine. Where things get a little tricky, however, is when you start digging into collectives like PMIx_Fence
. Do you count the local contribution on the basis of a single clone, or do you count the contribution only when all clones of a given rank participate? Currently, OpenPMIx only requires that a single clone participate, and that is fine for what we are doing.
However, one can certainly envision a case where you need all clones to participate, especially in a fence operation to sync their work. I had planned to add an attribute to cover that case, but held off from doing it in v4 because of (a) the complexity of explaining the clone scenario, and (b) time constraints as it would take additional time/effort to write that all up. Still probably worth doing if we can think of some words to convey the concept. I didn't feel a need to deal with the case of multiple clones, but not all clones, participating.
So PMIX_PROC_PID is whatever the underlying system want to use to identify that thing.
Sounds right to me - basically, all we care (I think) is that the ID be something a debugger can use to identify and "attach" to that "process" (i.e., thing that is executing).
I agree. That is why I proposing we do a pretty simple (maybe even vague) explanation now and tighten it up in time as the need arises.
Yeah, I think that makes the most sense. I'm not sure a tight definition is all that critical, and it can become a rather deep black hole of time.
Just to be clear, PMIx never invoke that command line - it only passes it along to the host environment. PMIx isn't a starter or launcher.
agreed. I should have been more careful in my wording there.
I should have been more careful too, because my statement is actually incorrect. We do allow PMIx_Spawn
to directly fork/exec procs in the case of a launcher (i.e., tool that declares itself to be a launcher ala "mpiexec"). This was requested by the debugger folks as a way to simplify starting of another launcher - i.e., the debugger would declare itself a "launcher" and then use PMIx_Spawn
to start "mpiexec". If the debugger had attached to a PMIx server, then the server would start the "mpiexec" process. If not, then PMIx would just fork/exec "mpiexec" directly. Simplified the debugger code a bunch.
In retrospect, and given my "swarm intelligence" use-case as described above, it could be useful to allow anyone to use that capability. Nothing in the OpenPMIx code particularly cares, so it was somewhat of an artificial constraint. Might be worth chatting about?
OpenPMIx handles this just fine. Where things get a little tricky, however, is when you start digging into collectives like PMIx_Fence. Do you count the local contribution on the basis of a single clone, or do you count the contribution only when all clones of a given rank participate? Currently, OpenPMIx only requires that a single clone participate, and that is fine for what we are doing.
I didn't realize this works and its pretty amazing that it does. Now I want to go try it out just because that's pretty neat. I think the way it handles PMIx_Fence is clean as it is now... we treat the whole group of possible calling entities (threads, OS processes, whatever) as a single entity (a PMIx "process") and any of those entities can count for the rank's participation in the fence. If the user needs to fence across those individual entities as well, that can be their responsibility because PMIx doesn't consider or get involved in how each of those calling entities within the rank are implemented. Not saying we can't ever get involved in solving that problem, but it sure is a lot clearer model and easier to describe if we treat it as you described it working currently.
Now I want to go try it out
😆 Well, hopefully I haven't broken it with some of the recent code - I haven't used it in a few months. If so, I'll have to go back and fix it as that is one of my pet "post-retirement" projects!
The swarm intelligence use case is pretty neat and a good one for the dynamics working group to take up and define (FYI @jaidayal ). It sounds like there might be some server-side interfaces/semantics that might be needed if we wanted to extend it to include collectives. I agree it is probably too much to go after expanding for v4, but might be something for v5 if there is interest in expanding that use case.
Discussion in #235 on the use of
processes
in Chapter 2:@schulzm:
@kathrynmohror:
@jjhursey:
Suggested Clarification
@schulzm suggested: