open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.15k stars 859 forks source link

MPI_Comm_split_type not creating disjoint subgroups for certain cases #12812

Open mshanthagit opened 1 month ago

mshanthagit commented 1 month ago

Thank you for taking the time to submit an issue!

Background information

MPI_Comm_split_type is creating groups with overlapping ranks in certain cases where ranks are bound to cores across resource domains (say L3). For example, consider the following where ranks 2 and 5 share two L3 domains. (program using MPI_Comm_split_type(MPI_COMM_WORLD, OMPI_COMM_TYPE_L3CACHE, 0, info, &newcomm); on a Genoa machine with 8 cores per L3)

mpirun -np 8 --map-by ppr:8:numa:pe=3 --report-bindings ./a.out [electra016:1374129] Rank 0 bound to package[0][core:0-2] [electra016:1374129] Rank 1 bound to package[0][core:3-5] [electra016:1374129] Rank 2 bound to package[0][core:6-8] [electra016:1374129] Rank 3 bound to package[0][core:9-11] [electra016:1374129] Rank 4 bound to package[0][core:12-14] [electra016:1374129] Rank 5 bound to package[0][core:15-17] [electra016:1374129] Rank 6 bound to package[0][core:18-20] [electra016:1374129] Rank 7 bound to package[0][core:21-23] Hello --- my rank: 0, my comm_size: 8 Hello --- my rank: 1, my comm_size: 8 Hello --- my rank: 7, my comm_size: 8 Hello --- my rank: 6, my comm_size: 8 Hello --- my rank: 5, my comm_size: 8 Hello --- my rank: 4, my comm_size: 8 Hello --- my rank: 3, my comm_size: 8 Hello --- my rank: 2, my comm_size: 8 From split comm: my rank: 0, my split_comm_size: 3 From split comm: my rank: 2, my split_comm_size: 6 From split comm: my rank: 4, my split_comm_size: 4 From split comm: my rank: 6, my split_comm_size: 3 From split comm: my rank: 1, my split_comm_size: 3 From split comm: my rank: 3, my split_comm_size: 4 From split comm: my rank: 5, my split_comm_size: 6 From split comm: my rank: 7, my split_comm_size: 3

As we can see from the above, there are only two ranks with comm_size 6! Although it doesn't print out the ranks within each communicator, here's what it would be:

comm(0): 0, 1, 2 comm(1): 0, 1, 2 comm(2): 0, 1, 2, 3, 4, 5 comm(3): 2, 3, 4, 5 comm(4): 2, 3, 4, 5 comm(5): 2, 3, 4, 5, 6, 7 comm(6): 5, 6, 7 comm(7): 5, 6, 7

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

I tested with 5.0.x and 4.1.6

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

From source (5.0.x)

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running


Details of the problem

Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.

Details in the background section. Here is an example program:

#include <stdlib.h>
#include <stdio.h>
#include "mpi.h"

int main (int argc, char **argv) {

   MPI_Init(&argc, &argv);

   int rank, size, comm_size, newcomm_size;
   int status = 0;

   MPI_Comm newcomm;
   MPI_Info info;

   // Get the number of MPI processes:
   MPI_Comm_size(MPI_COMM_WORLD, &size);
   MPI_Comm_rank(MPI_COMM_WORLD, &rank);

   printf("Hello --- my rank: %d, my comm_size: %d\n", rank, size);

    MPI_Info_create(&info);

   status = MPI_Comm_split_type(MPI_COMM_WORLD, OMPI_COMM_TYPE_L3CACHE, 0, info,  &newcomm);

   if (status) {
    printf("Error in comm split %d\n", status);
   }

   MPI_Comm_size(newcomm, &newcomm_size);
   printf("From split comm: my rank: %d, my split_comm_size: %d\n", rank, newcomm_size);

   MPI_Finalize();

   return status;
} 

====================

Note: If you include verbatim output (or a code block), please use a GitHub Markdown code block like below:

shell$ mpirun -np 8 --map-by ppr:8:numa:pe=3 --report-bindings ./a.out    (on a Genoa machine with 3CCDs per numa and 8 cores per CCD)
bosilca commented 1 month ago

MPI_Comm_split_type creates disjoint communicators, which means the sum of the sizes of the sub-communicators should be 8 (the size of your MPI_COMM_WORLD).

Please provide the output of hwloc-ls to see how the cores numbered on the node.

edgargabriel commented 1 month ago

The way it looks like based on the output, is that rank 2 (which is bound to cores spanning two L3 domains, since cores 0-7 are on L3 domain 1, cores 8-15 are on L3 domain 2, etc.) created a communicator that includes all ranks that have also been bound to either the 1st or the 2nd L3 domain (which is why it ends up with 6 rank). On the other hand, rank 0 is only bound to cores on the first L3 domain, and it only includes ranks that have been bound to that L3 domain.

edgargabriel commented 1 month ago
$ hwloc-ls
  Package L#0
    NUMANode L#0 (P#0 125GB)
    L3 L#0 (32MB)
      L2 L#0 (1024KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
        PU L#0 (P#0)
        PU L#1 (P#48)
      L2 L#1 (1024KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
        PU L#2 (P#1)
        PU L#3 (P#49)
      L2 L#2 (1024KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
        PU L#4 (P#2)
        PU L#5 (P#50)
      L2 L#3 (1024KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
        PU L#6 (P#3)
        PU L#7 (P#51)
      L2 L#4 (1024KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
        PU L#8 (P#4)
        PU L#9 (P#52)
      L2 L#5 (1024KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
        PU L#10 (P#5)
        PU L#11 (P#53)
      L2 L#6 (1024KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
        PU L#12 (P#6)
        PU L#13 (P#54)
      L2 L#7 (1024KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
        PU L#14 (P#7)
        PU L#15 (P#55)
    L3 L#1 (32MB)
      L2 L#8 (1024KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
        PU L#16 (P#8)
        PU L#17 (P#56)
      L2 L#9 (1024KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
        PU L#18 (P#9)
        PU L#19 (P#57)
      L2 L#10 (1024KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
        PU L#20 (P#10)
        PU L#21 (P#58)
      L2 L#11 (1024KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
        PU L#22 (P#11)
        PU L#23 (P#59)
      L2 L#12 (1024KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
        PU L#24 (P#12)
        PU L#25 (P#60)
      L2 L#13 (1024KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
        PU L#26 (P#13)
        PU L#27 (P#61)
      L2 L#14 (1024KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
        PU L#28 (P#14)
        PU L#29 (P#62)
      L2 L#15 (1024KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
        PU L#30 (P#15)
        PU L#31 (P#63)
    L3 L#2 (32MB)
      L2 L#16 (1024KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16
        PU L#32 (P#16)
        PU L#33 (P#64)
      L2 L#17 (1024KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17
        PU L#34 (P#17)
        PU L#35 (P#65)
      L2 L#18 (1024KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18
        PU L#36 (P#18)
        PU L#37 (P#66)
      L2 L#19 (1024KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19
        PU L#38 (P#19)
        PU L#39 (P#67)
      L2 L#20 (1024KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20
        PU L#40 (P#20)
        PU L#41 (P#68)
      L2 L#21 (1024KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21
        PU L#42 (P#21)
        PU L#43 (P#69)
      L2 L#22 (1024KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22
        PU L#44 (P#22)
        PU L#45 (P#70)
      L2 L#23 (1024KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23
        PU L#46 (P#23)
        PU L#47 (P#71)
...
  Package L#1
    NUMANode L#1 (P#1 126GB)
    L3 L#3 (32MB)
      L2 L#24 (1024KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24
        PU L#48 (P#24)
        PU L#49 (P#72)
      L2 L#25 (1024KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25
        PU L#50 (P#25)
        PU L#51 (P#73)
      L2 L#26 (1024KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26
        PU L#52 (P#26)
        PU L#53 (P#74)
      L2 L#27 (1024KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27
        PU L#54 (P#27)
        PU L#55 (P#75)
      L2 L#28 (1024KB) + L1d L#28 (32KB) + L1i L#28 (32KB) + Core L#28
        PU L#56 (P#28)
        PU L#57 (P#76)
      L2 L#29 (1024KB) + L1d L#29 (32KB) + L1i L#29 (32KB) + Core L#29
        PU L#58 (P#29)
        PU L#59 (P#77)
      L2 L#30 (1024KB) + L1d L#30 (32KB) + L1i L#30 (32KB) + Core L#30
        PU L#60 (P#30)
        PU L#61 (P#78)
      L2 L#31 (1024KB) + L1d L#31 (32KB) + L1i L#31 (32KB) + Core L#31
        PU L#62 (P#31)
        PU L#63 (P#79)
    L3 L#4 (32MB)
      L2 L#32 (1024KB) + L1d L#32 (32KB) + L1i L#32 (32KB) + Core L#32
        PU L#64 (P#32)
        PU L#65 (P#80)
      L2 L#33 (1024KB) + L1d L#33 (32KB) + L1i L#33 (32KB) + Core L#33
        PU L#66 (P#33)
        PU L#67 (P#81)
      L2 L#34 (1024KB) + L1d L#34 (32KB) + L1i L#34 (32KB) + Core L#34
        PU L#68 (P#34)
        PU L#69 (P#82)
      L2 L#35 (1024KB) + L1d L#35 (32KB) + L1i L#35 (32KB) + Core L#35
        PU L#70 (P#35)
        PU L#71 (P#83)
      L2 L#36 (1024KB) + L1d L#36 (32KB) + L1i L#36 (32KB) + Core L#36
        PU L#72 (P#36)
        PU L#73 (P#84)
      L2 L#37 (1024KB) + L1d L#37 (32KB) + L1i L#37 (32KB) + Core L#37
        PU L#74 (P#37)
        PU L#75 (P#85)
      L2 L#38 (1024KB) + L1d L#38 (32KB) + L1i L#38 (32KB) + Core L#38
        PU L#76 (P#38)
        PU L#77 (P#86)
      L2 L#39 (1024KB) + L1d L#39 (32KB) + L1i L#39 (32KB) + Core L#39
        PU L#78 (P#39)
        PU L#79 (P#87)
    L3 L#5 (32MB)
      L2 L#40 (1024KB) + L1d L#40 (32KB) + L1i L#40 (32KB) + Core L#40
        PU L#80 (P#40)
        PU L#81 (P#88)
      L2 L#41 (1024KB) + L1d L#41 (32KB) + L1i L#41 (32KB) + Core L#41
        PU L#82 (P#41)
        PU L#83 (P#89)
      L2 L#42 (1024KB) + L1d L#42 (32KB) + L1i L#42 (32KB) + Core L#42
        PU L#84 (P#42)
        PU L#85 (P#90)
      L2 L#43 (1024KB) + L1d L#43 (32KB) + L1i L#43 (32KB) + Core L#43
        PU L#86 (P#43)
        PU L#87 (P#91)
      L2 L#44 (1024KB) + L1d L#44 (32KB) + L1i L#44 (32KB) + Core L#44
        PU L#88 (P#44)
        PU L#89 (P#92)
      L2 L#45 (1024KB) + L1d L#45 (32KB) + L1i L#45 (32KB) + Core L#45
        PU L#90 (P#45)
        PU L#91 (P#93)
      L2 L#46 (1024KB) + L1d L#46 (32KB) + L1i L#46 (32KB) + Core L#46
        PU L#92 (P#46)
        PU L#93 (P#94)
      L2 L#47 (1024KB) + L1d L#47 (32KB) + L1i L#47 (32KB) + Core L#47
        PU L#94 (P#47)
        PU L#95 (P#95)
bosilca commented 1 month ago

Funny, that's kind of what I suspected based on the published documentation of the Genoa architecture but with the topo things would have been more clear.

OMPI current split by type code uses topology masks (with an or operation) which explains why the ranks mapped across multiple levels think they belong to a larger group. This also means OMPI can only supports symmetric cases, a sensible approach from my perspective.

Indeed, taking in account that split creates disjoint communicators, what should we expect from a non-symmetric case like the one here ? One potential outcome could be to eliminate processes spanning across multiple domains from the split operation and return something like this:

comm(0): 0, 1
comm(1): 0, 1
comm(2): MPI_COMM_SELF
comm(3): 3, 4
comm(4): 3, 4
comm(5): MPI_COMM_SELF
comm(6): 6, 7
comm(7): 6, 7

In this case it is not obvious to users why some processes (here 2 and 5) are not part of a larger set.

mshanthagit commented 1 month ago

Is it still possible to create disjoint communicators for non-symmetric cases, belonging to a specific communicator based on some rule, say "I only belong to the first domain (in some ordering)"?

edgargabriel commented 1 month ago

@mshanthagit that is also what I was about to suggest :-)

@bosilca I agree that there is no clear and good solution in this scenario, so the primary goal has to be that it is consistent across processes. That being said, having processes being just by themselves in a comm probably is not desirable.

Would it make sense to have a rule along the lines: if a process is part of multiple L3 domains, the Comm_split_type will be applied as if it would only be part of the first domain that it is part of?

bosilca commented 1 month ago

What is the first domain ? One could argue that it should belong to the location where it has the most resources bound to ? Or where are less resources bound on the location ?

I quickly looked into the code base, and we use this API extensively in the collective modules. We need to find a solution that would make sense for our internal collectives as well.

edgargabriel commented 1 month ago

What is the first domain ? One could argue that it should belong to the location where it has the most resources bound to ? Or where are less resources bound on the location ?

If we can easily figure out where most of its resources are bound to, sure. But otherwise, I would define the 'first domain' as the domain of the first core that it is bound to.

I quickly looked into the code base, and we use this API extensively in the collective modules. We need to find a solution that would make sense for our internal collectives as well.

yes, that's how we found the issue, from the usage in a collective component :-)

bosilca commented 1 month ago

The simple fix, aka strict matching, is relatively easy to do by changing hwloc_bitmap_intersects to hwloc_bitmap_isequal in opal_hwloc_compute_relative_locality (opal/mca/hwloc/base/hwloc_base_util.c:638). The more complex matching that @edgargabriel proposed will require a complete rewrite of opal_hwloc_compute_relative_locality and a different storage of the locality in PMIX and opal_proc_t.

mshanthagit commented 1 month ago

@bosilca what does strict matching do? Say in the example above, how does the split happen?

bosilca commented 1 month ago

Strictly the same binding at a specific level. I gave the outcome for the example provided here few comments above. On Genoa with the bindings provided by the user (-np 8 --map-by ppr:8:numa:pe=3), the split by type for L3 locality you will return:

comm(0): 0, 1
comm(1): 0, 1
comm(2): MPI_COMM_SELF
comm(3): 3, 4
comm(4): 3, 4
comm(5): MPI_COMM_SELF
comm(6): 6, 7
comm(7): 6, 7
edgargabriel commented 1 month ago

@bosilca this is not consistent, comm(2) cannot think that it is alone in the communicator, while comm(0) and comm(1) think that it is part of their subgroup

mshanthagit commented 1 month ago

@bosilca I think it could lead to incorrect behavior (pardon me if I am wrong). Say I do an allreduce with the new comm after split, what's the behavior? What will rank 2 have?

Ignore, I thought I saw 2 in other comms

bosilca commented 1 month ago

@mshanthagit as @edgargabriel noticed my example was incorrect. It should now be fixed: Rank 2 and 5 will be alone in their own communicator.

edgargabriel commented 1 month ago

Maybe we should do a 2-step process: first step is to add the simple solution that @bosilca suggested, and backport it to 5.0.x and 4.1.x. There is clearly value in this solution, since it fixes an inconsistency.

If somebody has the cycles to work on my suggested approach, we can still do that at a later stage, maybe for 6.0. (There is a good chance that it might not happen though, in which case we have however at least the first fix)

mshanthagit commented 1 month ago

Just thinking out loud, the solution @bosilca suggested creates 5 communicators whereas one would expect 3 (as there are 3 L3 domains). Will there be any side effects?

bosilca commented 1 month ago

Honestly, I think that users binding processes as in the example here (overlapping several domains), deserve what they get and any split type is good, for as long as it is consistent across the board. The strict mode has the advantage of being a one-liner.

edgargabriel commented 1 month ago

I don't necessarily disagree with you @bosilca I would just caution that sometimes these decisions are not dominated by MPI requirements. In this particular instance, it was the compute performance that was significantly better than using --map-by ppr:8:numa:pe=2 or --map-by ppr:8:numa:pe=4 that was driving the decision, and we just have to find a way to make the best out of that on the MPI side

rhc54 commented 1 month ago

Or maybe a different mapping pattern? Not sure exactly what you are trying to achieve, but seems like mapping to L3 is something we have already enabled, so I'm a tad confused.

edgargabriel commented 1 month ago

I am happy to take any help @rhc54 , I couldn't find a solution on how to map 8 processes onto a group of 3 CCD, that is the first three rank to the first L3 domain, the second three ranks to that 2nd L3 domain, and the last 2 ranks to the 3rd L3 domain, given that the node has a second package with another 3 L3 domains and we would like to repeat the same pattern. I tried mapping by L3 domains, but it didn't do what we wanted, not because the mapping didn't work correctly, but because I didn't find a syntax for how to express this slight imbalance of 3/3/2 ranks per L3 domains

bosilca commented 1 month ago

There is only so much we can do automatically in the MPI library. For everything else, the users can fall back to either a manual MPI_Comm_split or to a guided split. However, for the more precise MPI_COMM_TYPE_HW_GUIDED split the MPI standard clarifies:

The MPI processes in the group associated with the output communicator newcomm utilize that specific hardware resource type instance, and no other instance of the same hardware resource type.

edgargabriel commented 1 month ago

@bosilca does that mean that a process that does not fulfill this criteria (i.e. utilize that specific hardware resource type instance, and no other instance of the same hardware resource type) should actually not be part of any resulting communicator, e.g. MPI_COMM_NULL?

rhc54 commented 1 month ago

I am happy to take any help @rhc54 , I couldn't find a solution on how to map 8 processes onto a group of 3 CCD, that is the first three rank to the first L3 domain, the second three ranks to that 2nd L3 domain, and the last 2 ranks to the 3rd L3 domain, given that the node has a second package with another 3 L3 domains and we would like to repeat the same pattern. I tried mapping by L3 domains, but it didn't do what we wanted, not because the mapping didn't work correctly, but because I didn't find a syntax for how to express this slight imbalance of 3/3/2 ranks per L3 domains

Yeah, we do that frequently - can you post your topology so I can give you the correct cmd line?

edgargabriel commented 1 month ago

@rhc54 the output of hwloc-ls is listed above on the ticket, is that what you are looking for or do you need additional information?

rhc54 commented 1 month ago

I need the actually topology file output - the XML output so I can use it as input to PRRTE.

bosilca commented 1 month ago

@bosilca does that mean that a process that does not fulfill this criteria (i.e. utilize that specific hardware resource type instance, and no other instance of the same hardware resource type) should actually not be part of any resulting communicator, e.g. MPI_COMM_NULL?

Yes, we can return MPI_COMM_NULL instead of MPI_COMM_SELF for the processes that are not strictly bound to a single instance of the specified resource.

edgargabriel commented 1 month ago

I need the actually topology file output - the XML output so I can use it as input to PRRTE.

@rhc54 I pingged you on slack for that, thank you!

rhc54 commented 1 week ago

@edgargabriel Sent you a note over the weekend to verify the fix - committed upstream.