Define mission statement and core functionalities

pmix / pmix-standard

PMIx Standard Document

https://pmix.org

Other

23 stars 22 forks source link

Define mission statement and core functionalities #190

Open gvallee opened 5 years ago

gvallee commented 5 years ago

Overview

Having a mission statement as well as a clear definition of what composes the core of the PMIx standard would help define what is the PMIx standard, what is in it and what could be added to it.

This issue is not meant to say that we should have a mission statement and a definition of core functions/attributes but rather drive the discussion and try to ultimately have consensus within the group.

Motivation

The following questions periodically are raised during discussions:

Should we aim at limiting the size of the standard and if so, how big the standard should be?
What APIs should be in stable class (see issue #179)?
Do we want to have a set of APIs and attributes that implementations must implement and if so how do decide what would that be?

The goal of this issue is not to answer these questions (we may end up opening separate issue for these questions), it instead focuses on the idea of defining a mission statement and the definition of what would be the core of the PMIx standard to help drive the future discussions aiming at answering these questions.

Discussion Items

The following points were discussed so far to define the mission statement and what should be the core of the PMIx standard.

Mission statement

Underlying question 1: Do we need to control/limit the size of the standard?

Having a clear mission statement would be a good step at ensuring the standard is of the "right size", making it easier to implementors to support PMIx. In other words, having a clear mission statement would help us ensure that the standard only includes what is necessary. Optional feature could still be available in implementations and/or ("and/or" needs to be discussed) in the "lower-class" of the standard (see issue #179).

However, it may limit the openness of the community, each group in the community having a different set of goals and a different mission. Therefore, having a restrictive mission statement would prohibit contributions from new group. Instead, we may want to keep the current approach of giving the opportunity to implementors to not implement functions and/or attributes. In other words, implementors are free to decide which functions and attributes they provide. How do we then identify the set of functions and attributes is required at a given time (e.g., for procurement)? Through use cases (e.g., system X must provide an PMIx implementation supporting MPI wire-up)?

Underlying question 2: Can we have a dynamic/flexible mission statement?

PMIx aims at providing a standardized set of interfaces and attributed
that support application interactions with the system management stack,
including system subsystem-to-subsystem interactions that help enable
application-driven workflow orchestration. Thus, the standard provides an
interface by which the applications can request a particular operation (e.g.,
"position this file into a nearby cache") and an abstracted interface by
which the resource manager can pass that request on to the storage
system. This allows both the application and the SMS to work with
abstracted interfaces, thereby reducing the amount of target-specific
code they have to write/support.

By being dynamic/flexible, it is possible to revisit when new use cases are considered by the standardization group. It is in fact the approach that was implemented so far and one of the reason of its success. We may not want to move away from this. What would be the down side of having a very flexible mission statement? Is it only related to the first point, i.e., the size of the standard?

Standard core

Underlying question 1: Do we need a "standard core" that would be required for implementations to provide?

Defining a core of the standard as the set of functions and attributes that are stable and required to be provided by all implementations, would allow implementors to precisely know what needs to be implemented, while not having to implement the entire standard. However, we cannot require an environment-specific implementation to implement APIs and attributes that have nothing to do with their target market without the experienced risk of community members just move away for the standard all-together. This is a serious issue that prevent us from defining a core set of functionalities and attributes as being required from all implementations. But do we all agree on this? Or is there a need for more discussions?

The notion of core functions and attributes might still be beneficial to identify at first which functions and attributes would be included in a "stable" class see issue #179). Maybe a good way to think about it is to consider the core as the intersection of the all the functions and attributes from use cases that are a priority for the standardization group. Use cases here are not all possible contexts where PMIx is or could be used, but the use cases that the standardization group want to focus on and for which the group wants to provide a clear and stable standard. However if we consider the core as the intersection of the functions and attributes for the use cases the standardization group cares about, it might take a long time to define what it is. Is the standardization group willing to wait?

gvallee commented 5 years ago

@jjhursey, @SteVwonder, @dsolt and @spophale could you please have a look and tell me if I captured correctly the discussion we had today? Is so, please put a 👍 emoji, if not, let's update the issue to include what is missing or misrepresented. Thanks.

rhc54 commented 5 years ago

Just a couple of quick comments based on initial impressions:

Having a clear mission statement would be a good step at ensure the standard is of the "right size". Having a clear mission statement would help us ensure that the standard only includes what is necessary. Optional feature could still be available in implementations and/or ("and/or" needs to be discussed) in the "lower-class" of the standard (see issue #179).

I think one needs to be very careful here about a priori restricting the size of a standard. What a group of MPI people consider "necessary" may have no relationship to what another programming model or system management subsystem might feel they need. Requiring people to justify their needs to an unrelated subgroup will be viewed negatively by some people.

Our "mission statement" has accordingly been rather broad: provide a standardized set of interfaces that support application interactions with the system management stack, including system subsystem-to-subsystem interactions that help enable application-driven workflow orchestration. Thus, we provide an interface by which the app can request a particular operation (e.g., "position this file into a nearby cache") and an abstracted interface by which the resource manager can pass that request on to the storage system. This allows both the app and the SMS to work with abstracted interfaces, thereby reducing the amount of target-specific code they have to write/support.

So long as a proposed extension doesn't interfere with some other group, there seems little reason to reject it out-of-hand.

The core of the standard is the set of functions and attributes that are stable and required to be provided by all implementations.

I respectfully must point out that this requirement will not work. You cannot require an environment-specific implementation to implement APIs and attributes that have nothing to do with their target market. They will either reject your standard or ignore the parts they don't want to do. Either way, the result is the same - alienation of that group.

For example, one might think that you could define the "core" to consist of the usual MPI wireup functions - put, commit, fence, and get. However, those functions are of no use to some environments who instead rely on other methods for obtaining their objectives (which may not involve wireup at all). Declaring those implementations to "violate" the standard because they don't include functions of no value to them accomplishes little other than convincing them to migrate to some other standard that is more permissive/flexible.

Note that this is true even within MPI. For example, some implementations have never implemented support for dynamic operations or the more esoteric definitions of recent years. Some don't even provide the APIs in their headers, preferring to have applications fail to compile, while others "stub out" the APIs and simply return "error" if called. Either way, this doesn't mean that they "violate" the standard - it just means that they have implemented the parts their customers care about and documented the parts that they do not support.

Bottom line: The primary reasons for the success we have enjoyed so far has been the "right to not support" plus our willingness to support people who are seeking solutions to their abstraction problems. I would advise careful consideration of that history before making changes to the mission.

gvallee commented 5 years ago

Just a couple of quick comments based on initial impressions:

Having a clear mission statement would be a good step at ensure the standard is of the "right size". Having a clear mission statement would help us ensure that the standard only includes what is necessary. Optional feature could still be available in implementations and/or ("and/or" needs to be discussed) in the "lower-class" of the standard (see issue #179).

I think one needs to be very careful here about a priori restricting the size of a standard. What a group of MPI people consider "necessary" may have no relationship to what another programming model or system management subsystem might feel they need. Requiring people to justify their needs to an unrelated subgroup will be viewed negatively by some people.

I personally totally agree with you and I made a very similar point during the call. This is actually why I added a risk bullet for each pointt and I will try to update the text to reflect this more precisely.

Our "mission statement" has accordingly been rather broad: provide a standardized set of interfaces that support application interactions with the system management stack, including system subsystem-to-subsystem interactions that help enable application-driven workflow orchestration. Thus, we provide an interface by which the app can request a particular operation (e.g., "position this file into a nearby cache") and an abstracted interface by which the resource manager can pass that request on to the storage system. This allows both the app and the SMS to work with abstracted interfaces, thereby reducing the amount of target-specific code they have to write/support.

And I personally believe we should start with the text you have there. Personally, having a mission statement is rubbing me the wrong way because, similarly to your previous point, it may be perceived as being non-inclusive. That said I am personally not opposed to a mission of statement either but the line between having a description that helps defining the standard and the goals of the community and having some text that would interpreted in a negative way is very thin I think.

So long as a proposed extension doesn't interfere with some other group, there seems little reason to reject it out-of-hand.

The core of the standard is the set of functions and attributes that are stable and required to be provided by all implementations.

I respectfully must point out that this requirement will not work. You cannot require an environment-specific implementation to implement APIs and attributes that have nothing to do with their target market. They will either reject your standard or ignore the parts they don't want to do. Either way, the result is the same - alienation of that group.

That is a very valid point and will make sure to include it to foster constructive discussions.

For example, one might think that you could define the "core" to consist of the usual MPI wireup functions - put, commit, fence, and get. However, those functions are of no use to some environments who instead rely on other methods for obtaining their objectives (which may not involve wireup at all). Declaring those implementations to "violate" the standard because they don't include functions of no value to them accomplishes little other than convincing them to migrate to some other standard that is more permissive/flexible.

Note that this is true even within MPI. For example, some implementations have never implemented support for dynamic operations or the more esoteric definitions of recent years. Some don't even provide the APIs in their headers, preferring to have applications fail to compile, while others "stub out" the APIs and simply return "error" if called. Either way, this doesn't mean that they "violate" the standard - it just means that they have implemented the parts their customers care about and documented the parts that they do not support.

Bottom line: The primary reasons for the success we have enjoyed so far has been the "right to not support" plus our willingness to support people who are seeking solutions to their abstraction problems. I would advise careful consideration of that history before making changes to the mission.

I believe I made a similar point during the call but clearly did not capture it correctly in the text (and will address it). My way to phrase it is that if the core definition is too strict, why would anyone go through the pain of the standardization process since the probability of having anything accepted is low? However, and I guess this is the point I did not present clearly, I believe there is a clear value in studying precisely the current use cases, see the intersection of these use cases (i.e., the associated functions and attributes). But I also hear you and start to wonder if this should be used to in the context of the definition of classes. Based on your feedback, I would tend to say that no it should not... but I clearly need to think more about this and will update the text of the issue to try to capture it.

Thanks for your very valuable feedback!

gvallee commented 5 years ago

@rhc54 I made another pass at the text of the issue. Could you please review it again and tell me if I captured what you meant? I think we are slowly identifying points where we do not have consensus/agreement and my goal is to be able to clearly phrase what they are so they can be openly discussed. Please let me know if you think this is making sense or not.

rhc54 commented 5 years ago

I think you are closing in on it. Perhaps another way to look at this is in philosophical terms. The approach that appears to be underlying the discussion group is one of "top-down determination". In other words, some group will get together and make the decision that these things must be implemented everywhere, while these other things are optional because...well, that's what they think it should be.

The alternative approach employed by PMIx has been based on "market-driven determination". In this approach, the standard simply defines the API signatures and behaviors - i.e., if you are going to support a particular API, then this is what it must look like and how it must behave. We even go so far as to define optional behaviors.

However, we let the market decide what any particular implementation should support. Say you are RM vendor "McFoo" and sell primarily into the financial market segment. That segment might be primarily interested in data analytics with some HPC backend processing. McFoo might, therefore, be more interested in the PMIx Group functions plus allocation management - and less interested in some other aspects of the standard. For some centralized group to declare that McFoo must implement an additional set of functions serves no purpose other than to alienate that vendor - there is no economic justification for them to follow the dictums of the centralized group.

Ultimately, it therefore depends upon what the group is trying to do. If you intend to take the "top-down" approach, then you might be better served to narrow the scope of your charter and focus on high-end (i.e., Top50 or so) MPI as your membership more closely resembles that particular segment. Odds are reasonably good that you can get compliance from implementers in that area.

However, if you want to serve a broader market, then you probably need to more carefully consider the "market-driven" approach as the breadth of needs spanning the various affected groups makes it difficult for any small group to dictate what everyone must do. Instead, it might be worth defining what should happen if someone ports an application or subsystem to an environment that doesn't support a given API or attribute - e.g., should it fail to compile or return an error at runtime? This is something that you probably can get compliance on from a broad group.

As for how to determine what must be supported in an RFP: the HPC community resolved that years ago. Rather than putting it in terms of "core" etc., you simply break the standard down into functional blocks - e.g., "wireup support", "tool support", etc. People can then specify "we want PMIx v4.2 minus the tool support and with these attributes for the Fence function". It is what we have done with MPI for years and it has worked well - no reason to reinvent that wheel here, I should think.

HTH

garlick commented 5 years ago

How do you discourage a proposed addition to the standard from compromising some core value if you don't state what the values are? For example, say bootstrapping a million rank MPI_Init in under 30s is expected from PMIx, and someone proposes an exponential exchange of some sort in the critical path that directly impacts MPI wire-up scalability?

Maybe that's a bad example because such an exchange could be optional, but hopefully the general question is clear? How do you keep the spec focused on delivering (and continuing to deliver) benefits that the founders consider a priority if you allow unchecked organic growth?

I guess the other thing that bothers me about letting a thousand flowers bloom is that standardizing an interface (as a general concept) has to be done carefully, as it is always going to dumb down somebody's implementation. If there's this PMIx thing sitting out there as an attractive nuisance letting anybody "standardize" any interface with a low bar for acceptance, then what you'll wind up with is not well designed interfaces that nicely encompass a problem domain, but just the first thing that came along and went in unopposed. It seems much more desirable if you can first collect the right people for a problem domain and build a consensus before proposing a common interface. The current group of PMIx contributors is not the right people for all the domains you mentioned, but merely stepping out of the way doesn't solve the problem.

rhc54 commented 5 years ago

Maybe that's a bad example because such an exchange could be optional, but hopefully the general question is clear? How do you keep the spec focused on delivering (and continuing to deliver) benefits that the founders consider a priority if you allow unchecked organic growth?

I don't believe anyone is proposing there be "unchecked growth" - there will always be proposals that conflict and have to be resolved. One way is to simply reject them, but another way is for the community to redirect them - e.g., by making them optional or utilize a different API especially for that purpose.

However, the underlying question may be a little different - what I believe you may be asking goes back again to the philosophy behind the "standard". Are you trying to define a set of hard APIs that all implementations must support? If so, then yes - the question of allowing growth is a valid concern.

On the other hand, if you follow the PMIx philosophy, then your concerns don't really come into play. If some group defines a new standard API (or attribute to an existing API) that involves behavior you don't like - then just don't implement it. This is the point I'm trying to get across. Allowing someone to return "not supported" removes the requirement that you implement something that doesn't fit your environment, while allowing others to do the same.

In your example, clearly some subset of the community thought this proposed new idea was of use to them - or else it wouldn't have been approved at all. So long as the impact can be contained to only those who believe the new capability is worth the impact, or have use-cases that don't utilize the impacted areas, then why should those not impacted by the proposal reject it? Again, I emphasize - if the new idea isn't of value to your environment, or is harmful to your environment, then simply don't implement it there...or perhaps only activate it when someone runs a job that benefits from it.

Life doesn't have to be so black/white 😄

I guess the other thing that bothers me about letting a thousand flowers bloom is that standardizing an interface (as a general concept) has to be done carefully, as it is always going to dumb down somebody's implementation.

Correct - for example, if we provide standardized APIs for coordinating storage operations (e.g., burst buffer caching of libraries/files), then someone who has a priority library that also performs that function will lose their competitive edge. This is a side-effect of promoting application portability as opposed to getting "locked" to a particular vendor based on proprietary APIs. Part of the shift to focus more on market-driven competition on implementation of the API.

you'll wind up with is not well designed interfaces that nicely encompass a problem domain, but just the first thing that came along and went in unopposed.

Agreed, but you seem to be assuming bad behavior - i.e., that someone will just throw APIs into the standard, getting thru the proposed approval process without at least attempting to involve others interested in that area. If the community is comprised of such people, then no amount of bureaucracy will cure the bad behavior - it is basically dysfunctional.

However, that said, let's imagine we do have a dysfunctional community. If someone really does standardize the wrong interface signature, then adopters will quickly find it doesn't meet their needs and reject it - i.e., nobody will use it. Thus, the market will again correct the community and reject the poor design. Two avenues generally open in such cases: (a) the target adopters who see value in the concept even though they reject the particular definition will get together to propose something better, or (b) the target adopters will see no value in the concept and it will die on the "provisional standard" vine.

So I guess this is a long-winded way of saying that I don't see any lasting harm even in the case of the dysfunctional community. They will make a number of missteps, but the market will correct them and (eventually) the bad behavior should diminish.

garlick commented 5 years ago

So let me understand: unchecked growth of the spec into as yet unidentified problem domains is acceptable as long as

there are no conflicts between proposed changes
everything is optional

I'm a little unclear on what basis the community decides whether something is in or out. Is it solely conflicts that drive rejection/redirection in your view? It still may be useful to have some sort of mission statement so that somebody doesn't go to the trouble to propose something that is clearly at odds with a core value that the community understands but hasn't stated in the name of inclusiveness.

I wasn't so much thinking of people proposing ill-fitting interfaces as bad actors, more like people or companies operating in their self interest or without foresight. If experts in a particular problem domain aren't well represented in the PMIx community, does the community take the word of one new member that a particular change is the right one? Then when the invisible hand of the market says "meh" and ignores it but the one new member is still there advocating, what happens?

It seems better if you can avoid areas where a critical mass of expertise is not represented in the community, or conversely, state up front what is intended to be included in the spec and make sure you have the representation to do a good job.

And by the way as an implementor, I may not be qualified to decide what I "like" in the spec. Give me a spec and I'll implement it, but you are making my job pretty hard if I have to make a subjective call on each aspect of it. Most likely I just want to play nicely with all components in an ecosystem and I may not have a deep understanding of what they all need, so it's nice to know if something is in the spec, it's important.

rhc54 commented 5 years ago

Perhaps I am being unclear. It was not my intention to object to a mission statement for this standard. My only point was that it needs to be sufficiently broad to encompass its intended scope, and that perhaps some of the concerns over "ill-fitting interfaces" really is a reflection of an internal debate over the intended scope. If the scope is a concern, then perhaps a way of dealing with it is for this group to confine itself to defining a subset of the PMIx definitions that are "standard" for MPI support.

In other words, instead of attempting to constrain PMIx itself, an alternative might be to focus down onto some subset of the PMIx scope. PMIx has already committed to not changing interfaces, so this (using MPI as an example) might largely be an effort to (a) identify the subset of direct application to your goals for MPI, (b) requiring that all MPI-supporting implementations contain them, and (c) identifying which attributes an MPI-supporting implementation must support (possibly adding some that are MPI-specific). I'm not advocating that approach - just pointing out that one could come up with different solutions.

Regardless, I see little basis for concern that someone would "break" an existing API by changing its default behavior to something unacceptable to others. There is still a vetting process in place. The only difference in what we are suggesting here is that I'm advocating that the group retain the ability for other interested parties to propose (and have provisionally accepted) alternative behaviors and extensions that are perceived of value to them so long as they don't negatively impact others. Key to that, of course, is the understanding that nobody has to fully implement every API and behavior - they can implement those that are of benefit/value to their target users.

And by the way as an implementor, I may not be qualified to decide what I "like" in the spec. Give me a spec and I'll implement it, but you are making my job pretty hard if I have to make a subjective call on each aspect of it.

You don't have to make any subjective calls. You and your user community can sit down and see what makes sense for them. I suspect we both know that MPI support will be the first layer to be supported. After that, you can pretty much just wait for someone to come forward with a request that some other aspect of PMIx be supported and evaluate their use-case to decide if it merits your effort.

We've been doing the same thing on the 3rd-party "reference implementation" - if someone comes forward with a requested behavior, we look at it and decide if it merits our time. Depends on who the customer is, how much effort is required, etc. If it makes sense to make the investment, then we proceed - otherwise, we don't.

gvallee commented 5 years ago

I will try to summarize what I read and see if I can steer the discussion forward. And I believe that we are pinpointing some of the fundamental challenges we are facing in the current discussions: 1/ There is concern that the standard will follow an unchecked, uncontrolled growth. I personally do not believe that (i) it happened (it all depends on what you care about) and (ii) I believe that the topic of this issue would actually not help addressing that concern: stating a mission statement does not prevent the risk of unchecked growth within the mission statement and within the core functionalities. 2/ Do we want to be top-to-bottom driven or market driven? PMIx was clearly market driven so far and one of the main reasons of its success. The top-to-bottom approach clearly risk shrinking our current community. However, I believe that we can have an hybrid approach: we stay market-driven by being open to new use cases, a use case approved by the standardization community ultimately becoming a chapter of the standard. Use cases are based on what is needed in a broader sense (market driven) but within a use case/chapter, the group leading the standardization work is free to implement top-to-bottom rules. I personally really like that approach: why someone working only on MPI for extreme scale computing should care about the work done in the context of orchestration for edge computing?

A recurring point across this is, I think, the potentially current major disagreement whether we want to tightly control what goes in the standard, keep it as small as possible and ultimately stable over time. To me, this potentially goes against being inclusive, open and be market-driven approach.

Based on my understanding, the main motivation for controlling what gets in the standard is to ease the new implementations of PMIx. However, I fail to understand how controlling what goes in the standard will actually really solve the problem without running the risk of loosing a significant part of our community (both contributors and users). Furthermore, I believe the best option to address that issue would be to focus on use cases: a team only cares about MPI wire-up, only a subset of functions and attributes will be required, the rest can be ignored.

So my question is: should we focus on writing the standard around use cases rather than in a flat structure like it is now (some discussions are already ongoing on this but I believe we did not push it far enough)? Is it doable (at the moment, I do not see why it is not)? And will that address all the concerns regarding size and stability of the standard or chapter/use case. For instance, MPI wire-up should be really small and be very quickly stable; and I believe it would address most of the concerns I am reading here. It definitely addresses the fact that no, everything is not optimal in the context of a use case. I think it also addresses the risk of potential conflicts: a use case/chapter should not, by definition, have any conflicts.

@rhc54 @garlick what do you think about this? Would this chapter/use case approach be acceptable and address all concerns voiced so far? I may open a separate issue on this.

SteVwonder commented 5 years ago

Thanks @gvallee, @rhc54, and @garlick for pushing this conversation forward. These are all interesting and important points. Couple of thoughts and responses from my end:

Based on my understanding, the main motivation for controlling what gets in the standard is to ease the new implementations of PMIx.

Yeah. I think it is true that this is one motivation, especially from the perspective of someone(s) that is looking into starting a new implementation of PMIx from the ground up. Currently, there is little guidance in the standard for navigating which interfaces etc should be implemented, how important each one is, their use-cases etc. The existing work on functionality classes, stability classes, implementation-agnostic document, and others go a long ways to improving that situation, but they do not solve it completely.

IMO, I think there is a lot of value in having multiple, mostly complete implementations of the PMIx standard (or any standard for that matter). With the current approach of allowing anything to be marked "not supported", it is very likely that a second/third implementation will only implement a subset of the parallel bootstrap and parallel debugging interfaces and then stop. At which point, only the reference implementation will contain support for the other interfaces. I believe that the standards document states the "not supported" bit is a feature of the standard, and if I understand correctly, it sounds like what @rhc54 has started to advocate for in this thread.

As a user, this fragmentation of support for PMIx means that any sufficiently advanced usage of PMIx will only be portable across systems that use the reference implementation. I think an argument can be made that, at this point, a second, limited implementation may actually hurt the PMIx community more than it helps. Fragmented support like that would introduce a lot of complexity and effort for the users of the PMIx standard (analogous to the complexity of Android development when the phone market is fragmented across so many versions and variants of the android OS). I think this particular issue (how useful/desirable a second/third implement ion is and what we are willing to accommodate that) underlies a lot of the differences in opinion (in this thread and the others).

The other concern that I have is that there are a finite number of people involved within the PMIx community, each with 24 hours in a day. From a standardization perspective, each new interface, attribute, and key adds overhead to everyone participating in the GitHub discussions and weekly phone call. From an implementation perspective, each interface/attr/key add cost to all of the implementations that were not used for the initial prototype of the extension. Finally, from a user's perspective, each interface/attr/key adds length to the document and complexity to understanding the standard. So setting a scope for standard would allow those limited people with limited resources to focus their efforts.

I think your suggestion, @gvallee, for creating almost a set of "micro-standards" within the larger PMIx standard which are driven by use-cases has value. It helps reduce many of the overheads I call out above, and it allows people to focus on the use-cases that they are interested in. I do not know that it fully addresses my concerns about secondary/tertiary implementations. I will have to contemplate on that a bit more.

garlick commented 5 years ago

Fragmented support like that would introduce a lot of complexity and effort for the users of the PMIx standard (analogous to the complexity of Android development when the phone market is fragmented across so many versions and variants of the android OS).

Good analogy. If the purpose of a standard/spec is to provide an N-way contract between providers and users, "everything is optional" renders it far less useful than it could be. As a provider, I have to guess what interfaces users may try to use, and as a user I have to guess what interfaces are likely to be provided and either limit myself to that, or write code to fall back to other mechanisms if the standard interface isn't there. It's brittle for everyone. Changes to mitigate the inevitable issues don't get pushed out rapidly since often components are packaged with a long update cycle. I feel like "everything is optional" is a diversionary tactic to avoid arguing about questionable additions to the spec and is a complete non-starter when there is more than one provider.

As @SteVwonder mentioned, this may be mitigated by the proposed functionality and stability classes, if the granularity of "optional" can be increased greatly for each functional class. It seems completely reasonable to me to have a whole class be optional or for a class to contain a few optional interfaces.

But once you accept that a majority of interfaces are mandatory, then you want to scrutinize why there are in there at all or whether they are well designed. Which is all to the good IMHO, as it results in higher quality interfaces that are more widely usable and implementable.

Anyway, apologies if I got us off track from the original goal of this issue.

gvallee commented 5 years ago

I believe I understand all the concerns voiced so far.

My concern over what was expressed lately is that complexity is a qualitative metrics and I am afraid you will not find consensus within a group of people to define when a specific point becomes "too complex." The amount of effort on the other end is a quantitative metrics (even if it is usually difficult to evaluate it a priori). If possible, I would like to base our decisions on quantitative aspects to avoid having to consider points that some will perceive as being personal preferences. I personally appreciate considering qualitative aspects but I am afraid it would not help us here.

Based on this, I believe it would be beneficial to let the group think about the option to organize the standard through chapters/use cases and let each chapter control its size and its approach for standardization. I think we should try to see if that addresses all concerns that this issue is addressing. It seems to me it is potentially a compromise for everything I heard so far.

rhc54 commented 5 years ago

I'm willing to let things go that way and see where it all winds up. My concern with what is being stated here is that it is too proscriptive - it has a very high probability of converging to a situation where the needs of MPI will dominate the "standard", forcing those with different needs to seek other solutions. This runs counter to the PMIx objectives - part of the intent is to provide a unifying infrastructure that enables non-MPI paradigms to operate within HPC environments, and vice-versa.

The other concern that I have is that there are a finite number of people involved within the PMIx community, each with 24 hours in a day. From a standardization perspective, each new interface, attribute, and key adds overhead to everyone participating in the GitHub discussions and weekly phone call. From an implementation perspective, each interface/attr/key add cost to all of the implementations that were not used for the initial prototype of the extension. Finally, from a user's perspective, each interface/attr/key adds length to the document and complexity to understanding the standard. So setting a scope for standard would allow those limited people with limited resources to focus their efforts.

I'm afraid I very much disagree. The situation you describe only arises from adopting the "every API and attribute must be implemented" philosophy. If you go down that path, then the RMs and SMS providers out there will reject it for the simple reason that there is no economic justification for any vendor to support something just because it is in a document. Instead, they will either ignore the doc or move forward with their own proprietary interfaces.

Consider MPI as an example. Very few MPIs actually implement the full standard - commercial MPIs, in particular, only implement the features of interest to their customers (i.e., they let the market determine what they support, not the document). If MPI truly attempted to require that they fully implement everything in the "standard", then they would either reject the standard or simply ignore the requirement and proceed as they do today.

Let's be realistic - there is only one organization out there even considering writing an alternative implementation, and that is for a very specific environment. It is not a general purpose 3rd party library and can be expected to only implement those features of interest to their user base. Nobody else has expressed the slightest interest in writing a PMIx library. It is a lot of work and the market does not reward infrastructure - lots of examples out there. Just ask the companies.

Thus, we are not talking about a "standard" here that will spawn multiple general purpose libraries. Outside of that one organization, everyone is using the reference implementation, implementing support for only those features of interest to their target market segments. It may therefore be that a tightly controlled "standard" just isn't the right avenue for something like PMIx. After all, there is a good reason why nobody has succeeded in "standardizing" these interactions in the past!

As I've said many times, there is nothing wrong with being "just a library", and we may wind up there in the end. It's good enough for hwloc, libevent, Python, and numerous other libraries we all use - it might be good enough here too, and there are distinct advantages to that approach (e.g., to the container community).

Not saying it isn't worth trying to work something out - just noting that the notion of a widely used library is something we shouldn't just discard.

SteVwonder commented 5 years ago

Nobody else has expressed the slightest interest in writing a PMIx library. It is a lot of work and the market does not reward infrastructure - lots of examples out there. Just ask the companies.

Thus, we are not talking about a "standard" here that will spawn multiple general purpose libraries. Outside of that one organization, everyone is using the reference implementation, implementing support for only those features of interest to their target market segments. It may therefore be that a tightly controlled "standard" just isn't the right avenue for something like PMIx. After all, there is a good reason why nobody has succeeded in "standardizing" these interactions in the past!

As I've said many times, there is nothing wrong with being "just a library", and we may wind up there in the end. It's good enough for hwloc, libevent, Python, and numerous other libraries we all use - it might be good enough here too, and there are distinct advantages to that approach (e.g., to the container community).

I agree with you. IMO, the reference implementation provides a lot of value by saying to RMs/hosts basically "implement these hand full of callbacks and you will support PMIx" as opposed to implementing every single client interface and attribute from the ground up. There may still be some server-side interfaces that RMs decide not to use, which would result in some client-side interfaces being not supported, but in general, a wide swatch of the PMIx functionality would be supported across all of the RMs that use the PMIx reference implementation. This enables users to confidently program to a large number of PMIx interfaces with minimal portability concerns. I think a secondary, minimal implementation would fragment the community, cause tons of headaches and heartburn for users, and destroy some of the value that the reference implementation provides currently. That is not to say that a secondary implementation would absolutely be bad for the community, but a minimal one probably would be.

There are other auxiliary benefits to adopting the "library" mentality/moniker. One being that the wording in the standards document can be tightened down to more precisely describe the reference implementation semantics, rather than attempting to encompass the behavior/semantics of all possible implementations. A second being that the PMIx community can continue to iterate very rapidly and deploy a large amount of functionality with minimal burden on host RMs. A third being that the cross-version negotiation becomes more tractable (I think).

That being said, I still think there is value in collecting use-cases, marking which interfaces/attributes are necessary versus optional for each use-case, and then tying that back to the server-side interfaces. That way, users and proposal writers have a clear path from "I want these 4 use-cases to work on our system" to "vendor, that means you should implement these PMIx server-side interfaces".

rhc54 commented 5 years ago

That is not to say that a secondary implementation would absolutely be bad for the community, but a minimal one probably would be.

You make a good point that bears further thought.

There are other auxiliary benefits to adopting the "library" mentality/moniker. One being that the wording in the standards document can be tightened down to more precisely describe the reference implementation semantics, rather than attempting to encompass the behavior/semantics of all possible implementations. A second being that the PMIx community can continue to iterate very rapidly and deploy a large amount of functionality with minimal burden on host RMs. A third being that the cross-version negotiation becomes more tractable (I think).

Yeah, I've been dealing with that all this week as I help another RM to complete their integration. We removed things from the doc because they were too implementation specific - unfortunately, they were subtle points that left "gaps" people are stumbling across. Once I explain the gaps, the lightbulb goes off and progress gets made rather quickly. Somehow need to find the right balance - maybe something non-implementation specific (e.g., the use-cases) combined with a user guide and a developer's guide?

That being said, I still think there is value in collecting use-cases, marking which interfaces/attributes are necessary versus optional for each use-case, and then tying that back to the server-side interfaces. That way, users and proposal writers have a clear path from "I want these 4 use-cases to work on our system" to "vendor, that means you should implement these PMIx server-side interfaces".

Agreed - makes a lot of sense.

rhc54 commented 5 years ago

I think a secondary, minimal implementation would fragment the community, cause tons of headaches and heartburn for users, and destroy some of the value that the reference implementation provides currently. That is not to say that a secondary implementation would absolutely be bad for the community, but a minimal one probably would be.

You make a good point that bears further thought.

Thinking about this a bit, I wonder if we can't make the same "market will decide" argument here? A minimal secondary implementation will, in all likelihood, quickly encounter pressure from users to expand their coverage. This is, I believe, what Jim was concerned about - how do you maintain a minimal implementation when the reference library is out there showing what can be done?

Eventually, one would think that the definition of "minimal" would have to expand to become "minimal acceptable to the implementer's user community". Ultimately, I suspect it begs the question: given that users will drive towards increased scope to track the reference implementation, what precisely is it that justifies the investment in an alternative? Would it be more cost-effective to contribute changes to the reference implementation to support whatever alternative functions you want?

For example, if someone doesn't like the way the PRI handles socket connections, it is one plugin to change it - a far smaller investment than rewriting the entire library. If the functionality one wants to replace isn't currently in a plugin, then one could propose and contribute a refactoring of that code path to put it into a framework/plugin so an alternative can be used.

Organizations will have to make their own choices based on their own cost/benefit criteria. However, I do believe that the market can be relied upon to drive convergence at least at the functional level, and perhaps at the library level as well.

dsolt commented 5 years ago

Good discussion. Overall I like the idea of a mission statement and lean slightly toward the side of trying to be cautiously restrictive about what PMIx covers. I am afraid that the current "unofficial?" charter outlined earlier in this discussion really does very little to maintain a focus. My fear is that almost any request for functionality can be made to fit into PMIx. Can PMIx provide an API to allocate distributed shared memory? Can PMIx extend the existing publish/lookup functionality into a more full-featured distributed database? Can PMIx request the imaging a node? Can PMIx request the consolidation of log messages? (Yes, I know it already does) Can PMIx request remote procedure calls? Can PMIx provide access to an agreement protocol? All of these may be of use to an HPC application, but should PMIx be the one API to rule them all?

The two primary risks to a "wide-open" approach are 1) Moving too quickly resulting in: unforeseen problems with interaction between features, low quality implementations, partial implementations such that functionality is rarely portable between implementations frustrating end users 2) Becoming too large resulting in: implementations that do many things but nothing well, a specification that is hard to understand, a specification that overlaps with other efforts. I am probably more fearful of the pace than I am the size, though both concern me.

However, I'm not sure how to narrow the focus in a meaningful way. PMIx did not start with a narrow focus (other than its focus on MPI). It's initial functionalities were already quite scattered: 1) get/put/fence 2) publish/lookup 3) spawn new processes 4) abort. Some are process management and some are not, some require PMIx to be the launcher and some do not, some communicate with existing system software and some do not. The one theme I can see around the current PMIx API's is not actually around process management, but around parallel and distributed computing. While I think a mission statement around providing distributed process management and services to HPC applications would be appropriate based on where we are today, it would certainly not do much to slow a possibly rapid expansion of new API's. It may even be broader than what Ralph mentioned early on around "application-driven workflow orchestration", though captures more of what is already in PMIx.

jjhursey commented 5 years ago

We had a good conversation about this ticket on the June 7, 2019 teleconf. Some notes are here. I won't try to summarize, but I think we at least level set on the options on the table, and refined some ideas. The conversation should continue on this Issue.

rhc54 commented 5 years ago

@gvallee When would you next be able to attend a WG meeting to discuss this? We think we are ready to circle back to the subject.

jjhursey commented 5 years ago

Per teleconf August 2, 2019

@gvallee Will post back with some notes on the current state of the conversation and outstanding items that need more discussion or resolution.