A new split_type value for the MPI_Comm_split_type function

mpiwg-hw-topology / Discussions-and-proposal

4 stars 0 forks source link

A new split_type value for the MPI_Comm_split_type function #1

Closed GuillaumeMercier closed 2 years ago

GuillaumeMercier commented 6 years ago

Currently, only one split_type value for the MPI_comm_split_type function is predefined by MPI: MPI_COMM_TYPE_SHARED.

Furthermore, the Advice to implementors on page 248 says: "Implementations can define their own types, or use the info argument, to assist in creating communicators that help expose platform-specific information to the application".

However, such approach is not portable, since theses new types/hints are implementation-dependent.

Moreover, there is a growing need to access hardware-related information in MPI applications. For instance, an MPI application could optimize its communication pattern by exploiting the underlying memory hierarchy and/or taking into account the possible NUMA effects induced by the physical processors.

To this end, we propose to introduce a new split_type value, MPI_COMM_TYPE_HW_TOPOLOGY that would create new subcommunicators which encompass MPI processes that effectively share physical ressources, such as a node, a socket, a L3 or L2 cache, etc.

NOTE. The current MPI_COMM_TYPE_SHARED value is often seen as a way to create communicators of MPI processes sharing the same physical node but this is not a correct way to get this information. If the underlying hardware features SCI or numaconnect-like technologies, MPI processes can share part of their address spaces even when not located physically on the same node. On the opposite, our proposal would effectively yield subcommunicators (at some point) that do correspond to the expected behaviour of MPI_COMM_TYPE_SHARED. (End of note.)

Advice to users. As most of the hardware is hierarchicaly organized (nodes /CPUs/caches, etc), it is possible to capture this hierarchy by calling the MPI_comm_split_type function with this new split_type value in a "recursive" fashion on each newly subcommunicator created.

The proposed solution does not rely on specific names for such communicators, nor makes any assumptions about the number of levels in the hierarchy of hardware resources to ensure its portability and sustainability. Indeed, if/when new hardware levels are introduced, no changes in the MPI standard are expected. (End of advice to users.)

Advice to implementors. The subcommunicators created should retain the following properties:

In order to avoid the creation of several identical communicators corresponding to redundant levels in the hardware hierarchy (for instance, consider the case of nodes featuring a single socket, a single L3 cache and a single L2 cache. Up to four communicators could be created: one for the node level, one for the socket level, one for the L3 level and one for the L2 level when all levels are in fact the same.), the lowest possible level in the hardawre hierarchy should be considered and intermediate levels should be skipped.
The group of MPI processes supporting a new subcommunicator should be a strict subset of the group supporting the input (parent) communicator. More specifically, a call to: MPI_Comm_compare(comm,newcomm) should return MPI_UNEQUAL
the last valid communicator produced in this fashion may be identical to MPI_COMM_SELF, but not necessarily (e.g. consider the case of several MPI processes bound to the same ressource at some level but not to levels deeper in the hierarchy.)
If no valid communicator can be created, MPI_COMM_NULL should be returned. This specific value can be used to assess if the bottom of the hierarchy has been reached or not.
MPI process binding should be taken into account when creating subcommunicators. Indeed, the deepest level corresponding to a subcommunicator should always correspond to the calling MPI process binding. For instance, if a process is bound to a L3 cache, no information below this level should be returned as the process could use any of the L2 cache below. Any attempt to create a new subcommunicator should return MPI_COMM_NULL.

(End of advice to implementors.)

Draft text of the proposal: here (see pages 26--27)

jeffhammond commented 6 years ago

This is a fine goal but the details are not right.

The implicit recursion through the memory hierarchy is not a good idea. MPI is explicit. Users say what they want. I can also think of plenty of valid hardware designs that don't work with this approach, although I'm not going to describe them.

The strict subset requirement is not appropriate. What happens when I call this function on MPI_COMM_SELF? Is the result MPI_COMM_NULL?

The right way to prescribe hardware topology information is with an info key. https://github.com/open-mpi/ompi/pull/320 has already implemented this, albeit using non-standard types rather than info keys.

The following historical tickets are relevant to this proposal:

https://github.com/mpi-forum/mpi-forum-historic/issues/372 https://github.com/mpi-forum/mpi-forum-historic/issues/297

ejeannot commented 6 years ago

Hi,

The implicit recursion through the memory hierarchy is not a good idea. MPI is explicit. Users say what they want. I can also think of plenty of valid hardware designs that don't work with this approach, although I'm not going to describe them.

The right way to prescribe hardware topology information is with an info key. open-mpi/ompi#320 https://github.com/open-mpi/ompi/pull/320 has already implemented this, albeit using non-standard types rather than info keys.

I think this is exactly what this proposal is trying to avoid. MPI hides the topology and material details. This is a very strong feature of MPI. It enables portability and we must keep this.

The good question is: "what is the right level of abstraction to enable usage of the topology hierarchy within an MPI program?"

If the abstraction is too low level (such as the prescription of the core, cache level, etc.) then, we will be dealing with many corner cases and this will not be portable when these low-level features involve or disappear.

On the contrary their proposal is about a higher level of abstraction that avoids all the above problems, hides the details to MPI developers and enables optimizations when matching the call to a specific platform.

Best regards, —Emmanuel

besnardjb commented 6 years ago

Hi Jeff, All,

@jeffhammond thank you very much for your feedback. In the line of previous answers here are some comments I can provide.

The implicit recursion through the memory hierarchy is not a good idea.

The implicit nature of this topological discovery is on purpose being part of the « Implicit » approach explored in the working group. It is almost certain that split values such as L1, L2, … will never enter the standard at reason being architecture dependent. However, implicitly splitting allows to describe such layout without evoking the possible splitting keys dodging the naming issue. Conversely, if we consider info-keys we end-up with the same issue than the runtime-specific keys which are indeed present in OpenMPI for example. They are useful and does provide the feature but using them makes your code immediately non-portable (e.g. hardware specific). What would be the value of the info-key for example? Yes, it would not be standardized but would be as ignored as the extra values in openMPI and the Info hints for MPI-IO (I mean for the uninitiated).

MPI is explicit. Users say what they want.

It is an interesting idea. However, I personally would not say MPI is always explicit.

A tentative example I can think of is progress: you (may) have to test a request to make it progress in an asynchronous message most people assume that it happens automagically and (may) pay the whole cost in the wait — this is of course implementation and runtime configuration dependent.

Closer to our focus of attention the topology chapter starts with:

@7.1 « A clear distinction must be made between the virtual process topology and the topology of the underlying, physical hardware. The virtual topology can be exploited by the system in the assignment of processes to physical processors if this helps to improve the communication performance on a given machine. How this mapping is done, however, is outside the scope of MPI. »

It is basically an « anything can happen » sentence and it is indeed what is observed very often. The goal of splitting the virtual topology with the new key is to (indeed partially as we miss level names) explicit what is the underlying organization of the system.

I can also think of plenty of valid hardware designs that don't work with this approach, although I'm not going to describe them.

We then have to make a guess here. I can think of several potential configurations such as stacked memories of various types, converged memories with accelerators, Big.LITTLE architectures, NICs capable of doing the computation in main memory (HW MPI) or Converged NICs.

The questions raised to the WG are:

Are these potential configurations hierarchical?
Is the nested communicator abstraction valuable?
More precisely what is a hierarchy?
Shall it include solely the data-scattering or should we expect differentiated ranks?

For the later, differentiation is probably the role of sessions, focussing on a given communicator we just want to express the machine layout.

The strict subset requirement is not appropriate. What happens when I call this function on MPI_COMM_SELF? Is the result MPI_COMM_NULL?

For me, the empty set is a strict subset of any non-empty set. So returning COMM_NULL from COMM_SELF does not break the assumption — maybe I missed what you underlined. Was it in the line of non-hierarchical topologies?

The right way to prescribe hardware topology information is with an info key. open-mpi/ompi#320 has already implemented this, albeit using non-standard types rather than info keys.

If I recall well, we considered it in the WG as not easily standardizable not in the sense of being in MPI but in the sense of being used by actual end-users due to the specificity of the keys (if you consider level/arch-dependent keys). Besides, due to the presence of a split_type enum in the call, a new value/case still has to be added (e.g. use the info) to point to the info-keys explaining the means of splitting (over memory, over a given level, …).

I hope this helps the discussions.

Regards,

Jean-Baptiste.

jeffhammond commented 6 years ago

You claim that you are following the charter of 7.1 to focus on virtual topology not physical hardware topology and yet the keyword contains HW_TOPOLOGY and every example given depends on physical hardware features (L1, L2, L3) that are invisible to software unless one uses performance measurements to detect the physical composition of the memory hierarchy.

I recommend that you go in one of two directions: 1) Focus on virtual topology by defining an abstract metric for distance and proposing features that allow MPI_Comm_split_type to figure out the right answer based upon that metric. For example, you could propose MPI_COMM_TYPE_UNIFORM that would split the output of MPI_COMM_TYPE_SHARED into UMA domains, where UMA is an implementation-defined concept that likely corresponds to a socket, shared LLC or a memory controller. 2) Accept that e.g. cache levels are a universal feature of processors that execute MPI processes and that it is useful to allow users to give less portable requests via info keys to split according to various levels of cache, or similar.

GuillaumeMercier commented 6 years ago

I just saw the emails, not the comments. I better understand what you mean by "following the charter of 7.1". You're referring to Jean-Baptiste's comment. I think he (but I don't want to speak in his stead) wanted to say that right now, were physical topologies are referred to in the standard, no standard behaviour is enforced. The point is we want to allow the user to access hardawre details in a standard fashion. Hence our proposal. And also, mapping/binding of processes are also in the discussion in the frame of this WG. But that's another story.

besnardjb commented 6 years ago

@GuillaumeMercier this is exactly what I meant, sorry for not being clear enough.

@jeffhammond I think what you propose should definitively be considered in the "query" model as Guillaume underlined in his email on the ML.

I sumarized some tentative arguments bellow to fuel-up the discussion:

Currently, MPI_Split_type is mostly a helper function to enable the use of shared-memory windows. My claim would be that there are cases where the topology is more than a matter of memory hierarchy and that it has (and will have) consequences on software (including MPI) way beyond performance.

To back-up my assumption here are some examples of HW topology effects faced by MPI that I can come with:

A - In the WG we discussed the case of the K computer (https://github.com/mpiwg-hw-topology/hw-topology-issues/blob/master/MPI-HW-Topo-WG-fujitsu-extention.pdf) where the virtual topology follows actual topologies with dimensionalities. Programs are developed with this aspect in mind for years and are able to do a perfect embedding of their communication pattern thanks to this. In this case, the virtual topology is a hardware one and the system is programmed in MPI with this in mind;

B - I can think of collocation cases where one would like to have a tool, an in-situ processing, IO nodes, the 'Ocean' in your 'Ocean-Atmosphere' program colocated in a portable and efficient manner, i.e. one per node, one per socket, one per switch, one per accelerator... If you specialize your cores you end-up having a problem of resource negotiation which is (for the moment) in a gray area. Right now, if you run an MPMD program you have to guess which binary you are with respect to other MPI processes (/proc/self/cmdline, argv[0] or a hard-coded color) and you have little control yet (I mean in MPI) relatively to where these processes will spawn. Sessions are addressing this area.

C- If you look at latest many-core processors integrating meshes and/or communication facilities (possible HW MPI via Network on Chip) it seems that MPI will soon face on die layout constraints. Yes, caches are ubiquitous but their role is to « hide » the topology but architectures may choose to adopt a clear die-level topology for scalability and programmability reasons. For example, the SW26010 has Register Level Communications (RLC) at CPE level (https://hpc.sjtu.edu.cn/IPDPSW2017_Benchmarking.pdf) and a Network On Chip between CGs. For me, it seems sensible that similar devices may be used with native MPI calls taking advantage of the hardware. However, you would need to know your neighbors in the virtual topology from hardware constraints to make an informed graph embedding choice.

jeffhammond commented 6 years ago

Currently, MPI_Split_type is mostly a helper function to enable the use of shared-memory windows.

That is its most obvious function, but note that it also identifies the communicator associated with:

a single OS image;
shared local filesystem (obviously a consequence of the first item);
a set of processes that will fail together given certain hardware faults (obviously a consequence of the first item);
ranks that share a fixed set of offload devices (e.g. GPUs);
a specific memory capacity that one or more MPI processes must share;
communication that is likely to bypass the network stack and communication with ranks in it will performance differently than with the other ranks in MPI_COMM_WORLD.

The motivation for https://github.com/mpi-forum/mpi-forum-historic/issues/372 and https://github.com/mpi-forum/mpi-forum-historic/issues/297 was to build on these concepts. Other than splitting the shared-memory communicator into NUMA nodes, I'm not sure how the rest of the proposal fits with this, because L2 and L1 cache topologies are hard to observe and exploit in MPI programs (KNL might be an exception, but only because there is no L3). In most CPUs, L2 is a per-core cache, which means that splitting on this is equivalent to MPI_COMM_SELF unless one is oversubscribing the core.

As for the K torus extensions, I'm quite familiar with the equivalent ones for Blue Gene, but not convinced we need to put these in the MPI standard. These are platform-specific if not machine-specific features and lack the generality of MPI_Cart_create (for good reason). They also work just fine as implementation-defined extensions. One needs to have conditional compilation for them anyways so the non-portability of the symbols is irrelevant. I'm certainly open to a generalization of network topology-aware communicators but I don't think MPI_Comm_split_type is the right way to do that, because it cannot reveal the connectivity associated with a torus to the user. It can give the user a nearest neighbor communicator (https://github.com/mpi-forum/mpi-forum-historic/issues/297) or it can give planes associated with each dimension but the user will have to reconstruct the torus graph from that.

A more useful feature set for network topology is for something like MPI_Cart_get to work on MPI_COMM_WORLD. If MPI_COMM_WORLD is associated with a torus topology, it can say so, else MPI_Cart_get returns an error. A generalization of this would be to have a MPI_Dist_graph_get that would return the distributed graph topology associated with a communicator, where the info key could be used to specify a message size, since the topology of a network relevant to the application may depend on that, although in most cases I'd expect the implementation to return a map of the network.

GuillaumeMercier commented 6 years ago

On 07/30/2018 03:53 PM, Jeff Hammond wrote:

The motivation for mpi-forum/mpi-forum-historic#372 https://github.com/mpi-forum/mpi-forum-historic/issues/372 and mpi-forum/mpi-forum-historic#297 https://github.com/mpi-forum/mpi-forum-historic/issues/297 was to build on these concepts. Other than splitting the shared-memory communicator into NUMA nodes, I'm not sure how the rest of the proposal fits with this, because L2 and L1 cache topologies are hard to observe and exploit in MPI programs (KNL might be an exception, but only because there is no L3). In most CPUs, L2 is a per-core cache, which means that splitting on this is equivalent to |MPI_COMM_SELF| unless one is oversubscribing the core.

Well, we can split into physical nodes, which you can't always obtain when using |MPI_COMM_TYPE_SHARED|. You can also split into NUMANodes (as you mentioned), into packages (or sockets) and obviously into L3.

These are platform-specific if not machine-specific features and lack the generality of |MPI_Cart_create| (for good reason). They also work just fine as implementation-defined extensions. One needs to have conditional compilation for them anyways so the non-portability of the symbols is irrelevant.

We discussed this point specifically during the last meeting in Austin and I agree that some finding a general approach from specific examples of tailor-made interfaces is difficult.

I'm certainly open to a generalization of network topology-aware communicators but I don't think |MPI_Comm_split_type| is the right way to do that, because it cannot reveal the connectivity associated with a torus to the user.

Well, we decided to explore several directions in the WG. One is called the "implicit approach" because it gives the user access to the underlying physical topology without explicit knowledge. The proposed expansion of |MPI_COMM_split_type| falls into this category. However, the extensions, as the ones you' re talking about (e.g. Fujitsu's) is part of what we called the "explicit" approach where users can have a more direct knowledge of the underlying physical topology. The third directions has to do with the mapping/binding of MPI processes.

A more useful feature set for network topology is for something like |MPI_Cart_get| https://www.mpich.org/static/docs/v3.1/www3/MPI_Cart_get.html to work on |MPI_COMM_WORLD|. If |MPI_COMM_WORLD| is associated with a torus topology, it can say so, else |MPI_Cart_get| returns an error.

Once again, this idea was discussed at the last physical meeting in Austin. It is an interesting idea and we will discuss this further in the forthcoming meetings and/or telcos. All minutes of past meetings and telcos are available on the WG github, by the way.

For now, I would like the discussion to remain on track about this first issue.

Guillaume