mpiwg-hw-topology / Discussions-and-proposal

4 stars 0 forks source link

Implicit tree based assumption #2

Open gvallee opened 6 years ago

gvallee commented 6 years ago

Based on all the discussions during the past two forum meetings, there is clearly the implicit assumption that the hardware is always represented as a tree. It is not always the case. Based on this I would like to consider the following questions: is it possible to keep the proposed concepts but assume that we are dealing with a graph instead of a tree?

besnardjb commented 6 years ago

Indeed the hierarchical splitting methodology is clearly yielding a tree as we obtain (as per current definition) a subset of the parent communicator, grouping processes topologically. This abstraction could be of use, for example, to gather processes which are locally bound (accelerator & main cores) or more simply to do MPMD computation while accounting for topology.

On the graph approach, it seems indeed important to clearly express how the processes are laid out at a given level. Our current idea with respect to this approach is to consider that some communicators may have a default topology which would be a hardware one.

More precisely we considered that COMM_WORLD or any communicator generated by SPLIT_TYPE could return a topology from the query (MPI_TOPO_TEST) function and if so, it is a hardware one. Such a model has the advantage that all the functions of 7.5.5 could be leveraged to express a wide variety of topologies with very limited modification to the standard. However, we still have to explore more convoluted consequences of this approach.

So in practice, an end-user should be able to split type SHARED on its COMM_WORLD to retrieve a communicator with an attached topology describing what happens in the shared-memory region possibly with a graph. This also applies to the implicit model where each split level could be enriched with the relationship between ranks which were seen as belonging to the same level.

As per today, there were considerations to open a ticket around this idea in a near future to further explore this graph approach and any input is welcome. I think the two approaches may eventually coexist in a positive manner.

bgoglin commented 6 years ago

@gvallee I don't see anything in the proposal clearly saying that the hardware is a tree, but the proposal organizes ranks in a hierarchy of communicators, which somehow expose the hardware as a tree in the end. But you may still apply the same API to a mesh by splitting into two halves (left and right parts f the mesh), then splitting each half into two quarters (top and bottom) and again. You build a tree even if the mesh isn't a tree.

On the implem side, hwloc indeed wants a tree, but we had to workaround non-tree cases since the beginning. Large NUMA machines aren't trees. For instance https://image.slidesharecdn.com/informix-iwa-intel-operational-analytics-benchmark-130423013315-phpapp02/95/informix-iwa-operational-analytics-performance-29-638.jpg?cb=1446710346 We use NUMA distance to abstract a tree out of this. On that image, you would get two groups of 4 sockets. On large SGI UV, you would get groups of 2 sockets (correspond to blades) and groups of 10 sockets (racks), etc. NVLink or interconnects between cores inside CPU (Xeon rings or KNL mesh) are basically the same.

In practice, exposing as a tree is obviously imperfect. So if people are not happy with it, they still get the original info about the graph (e.g. NUMA distances). So basically we try to find a good matching tree and keep non-tree details on the side. MPI/hsplit could use the NUMA graph/distance if hwloc's automatic groupings as a tree doesn't work anyhow.