Open mshanthagit opened 1 month ago
MPI_Comm_split_type
creates disjoint communicators, which means the sum of the sizes of the sub-communicators should be 8 (the size of your MPI_COMM_WORLD
).
Please provide the output of hwloc-ls
to see how the cores numbered on the node.
The way it looks like based on the output, is that rank 2 (which is bound to cores spanning two L3 domains, since cores 0-7 are on L3 domain 1, cores 8-15 are on L3 domain 2, etc.) created a communicator that includes all ranks that have also been bound to either the 1st or the 2nd L3 domain (which is why it ends up with 6 rank). On the other hand, rank 0 is only bound to cores on the first L3 domain, and it only includes ranks that have been bound to that L3 domain.
$ hwloc-ls
Package L#0
NUMANode L#0 (P#0 125GB)
L3 L#0 (32MB)
L2 L#0 (1024KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#48)
L2 L#1 (1024KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
PU L#2 (P#1)
PU L#3 (P#49)
L2 L#2 (1024KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
PU L#4 (P#2)
PU L#5 (P#50)
L2 L#3 (1024KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
PU L#6 (P#3)
PU L#7 (P#51)
L2 L#4 (1024KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
PU L#8 (P#4)
PU L#9 (P#52)
L2 L#5 (1024KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
PU L#10 (P#5)
PU L#11 (P#53)
L2 L#6 (1024KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
PU L#12 (P#6)
PU L#13 (P#54)
L2 L#7 (1024KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
PU L#14 (P#7)
PU L#15 (P#55)
L3 L#1 (32MB)
L2 L#8 (1024KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
PU L#16 (P#8)
PU L#17 (P#56)
L2 L#9 (1024KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
PU L#18 (P#9)
PU L#19 (P#57)
L2 L#10 (1024KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
PU L#20 (P#10)
PU L#21 (P#58)
L2 L#11 (1024KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
PU L#22 (P#11)
PU L#23 (P#59)
L2 L#12 (1024KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
PU L#24 (P#12)
PU L#25 (P#60)
L2 L#13 (1024KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
PU L#26 (P#13)
PU L#27 (P#61)
L2 L#14 (1024KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
PU L#28 (P#14)
PU L#29 (P#62)
L2 L#15 (1024KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
PU L#30 (P#15)
PU L#31 (P#63)
L3 L#2 (32MB)
L2 L#16 (1024KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16
PU L#32 (P#16)
PU L#33 (P#64)
L2 L#17 (1024KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17
PU L#34 (P#17)
PU L#35 (P#65)
L2 L#18 (1024KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18
PU L#36 (P#18)
PU L#37 (P#66)
L2 L#19 (1024KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19
PU L#38 (P#19)
PU L#39 (P#67)
L2 L#20 (1024KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20
PU L#40 (P#20)
PU L#41 (P#68)
L2 L#21 (1024KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21
PU L#42 (P#21)
PU L#43 (P#69)
L2 L#22 (1024KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22
PU L#44 (P#22)
PU L#45 (P#70)
L2 L#23 (1024KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23
PU L#46 (P#23)
PU L#47 (P#71)
...
Package L#1
NUMANode L#1 (P#1 126GB)
L3 L#3 (32MB)
L2 L#24 (1024KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24
PU L#48 (P#24)
PU L#49 (P#72)
L2 L#25 (1024KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25
PU L#50 (P#25)
PU L#51 (P#73)
L2 L#26 (1024KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26
PU L#52 (P#26)
PU L#53 (P#74)
L2 L#27 (1024KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27
PU L#54 (P#27)
PU L#55 (P#75)
L2 L#28 (1024KB) + L1d L#28 (32KB) + L1i L#28 (32KB) + Core L#28
PU L#56 (P#28)
PU L#57 (P#76)
L2 L#29 (1024KB) + L1d L#29 (32KB) + L1i L#29 (32KB) + Core L#29
PU L#58 (P#29)
PU L#59 (P#77)
L2 L#30 (1024KB) + L1d L#30 (32KB) + L1i L#30 (32KB) + Core L#30
PU L#60 (P#30)
PU L#61 (P#78)
L2 L#31 (1024KB) + L1d L#31 (32KB) + L1i L#31 (32KB) + Core L#31
PU L#62 (P#31)
PU L#63 (P#79)
L3 L#4 (32MB)
L2 L#32 (1024KB) + L1d L#32 (32KB) + L1i L#32 (32KB) + Core L#32
PU L#64 (P#32)
PU L#65 (P#80)
L2 L#33 (1024KB) + L1d L#33 (32KB) + L1i L#33 (32KB) + Core L#33
PU L#66 (P#33)
PU L#67 (P#81)
L2 L#34 (1024KB) + L1d L#34 (32KB) + L1i L#34 (32KB) + Core L#34
PU L#68 (P#34)
PU L#69 (P#82)
L2 L#35 (1024KB) + L1d L#35 (32KB) + L1i L#35 (32KB) + Core L#35
PU L#70 (P#35)
PU L#71 (P#83)
L2 L#36 (1024KB) + L1d L#36 (32KB) + L1i L#36 (32KB) + Core L#36
PU L#72 (P#36)
PU L#73 (P#84)
L2 L#37 (1024KB) + L1d L#37 (32KB) + L1i L#37 (32KB) + Core L#37
PU L#74 (P#37)
PU L#75 (P#85)
L2 L#38 (1024KB) + L1d L#38 (32KB) + L1i L#38 (32KB) + Core L#38
PU L#76 (P#38)
PU L#77 (P#86)
L2 L#39 (1024KB) + L1d L#39 (32KB) + L1i L#39 (32KB) + Core L#39
PU L#78 (P#39)
PU L#79 (P#87)
L3 L#5 (32MB)
L2 L#40 (1024KB) + L1d L#40 (32KB) + L1i L#40 (32KB) + Core L#40
PU L#80 (P#40)
PU L#81 (P#88)
L2 L#41 (1024KB) + L1d L#41 (32KB) + L1i L#41 (32KB) + Core L#41
PU L#82 (P#41)
PU L#83 (P#89)
L2 L#42 (1024KB) + L1d L#42 (32KB) + L1i L#42 (32KB) + Core L#42
PU L#84 (P#42)
PU L#85 (P#90)
L2 L#43 (1024KB) + L1d L#43 (32KB) + L1i L#43 (32KB) + Core L#43
PU L#86 (P#43)
PU L#87 (P#91)
L2 L#44 (1024KB) + L1d L#44 (32KB) + L1i L#44 (32KB) + Core L#44
PU L#88 (P#44)
PU L#89 (P#92)
L2 L#45 (1024KB) + L1d L#45 (32KB) + L1i L#45 (32KB) + Core L#45
PU L#90 (P#45)
PU L#91 (P#93)
L2 L#46 (1024KB) + L1d L#46 (32KB) + L1i L#46 (32KB) + Core L#46
PU L#92 (P#46)
PU L#93 (P#94)
L2 L#47 (1024KB) + L1d L#47 (32KB) + L1i L#47 (32KB) + Core L#47
PU L#94 (P#47)
PU L#95 (P#95)
Funny, that's kind of what I suspected based on the published documentation of the Genoa architecture but with the topo things would have been more clear.
OMPI current split by type
code uses topology masks (with an or
operation) which explains why the ranks mapped across multiple levels think they belong to a larger group. This also means OMPI can only supports symmetric cases, a sensible approach from my perspective.
Indeed, taking in account that split creates disjoint communicators, what should we expect from a non-symmetric case like the one here ? One potential outcome could be to eliminate processes spanning across multiple domains from the split operation and return something like this:
comm(0): 0, 1
comm(1): 0, 1
comm(2): MPI_COMM_SELF
comm(3): 3, 4
comm(4): 3, 4
comm(5): MPI_COMM_SELF
comm(6): 6, 7
comm(7): 6, 7
In this case it is not obvious to users why some processes (here 2 and 5) are not part of a larger set.
Is it still possible to create disjoint communicators for non-symmetric cases, belonging to a specific communicator based on some rule, say "I only belong to the first domain (in some ordering)"?
@mshanthagit that is also what I was about to suggest :-)
@bosilca I agree that there is no clear and good solution in this scenario, so the primary goal has to be that it is consistent across processes. That being said, having processes being just by themselves in a comm probably is not desirable.
Would it make sense to have a rule along the lines: if a process is part of multiple L3 domains, the Comm_split_type will be applied as if it would only be part of the first domain that it is part of?
What is the first domain
? One could argue that it should belong to the location where it has the most resources bound to ? Or where are less resources bound on the location ?
I quickly looked into the code base, and we use this API extensively in the collective modules. We need to find a solution that would make sense for our internal collectives as well.
What is the
first domain
? One could argue that it should belong to the location where it has the most resources bound to ? Or where are less resources bound on the location ?
If we can easily figure out where most of its resources are bound to, sure. But otherwise, I would define the 'first domain' as the domain of the first core that it is bound to.
I quickly looked into the code base, and we use this API extensively in the collective modules. We need to find a solution that would make sense for our internal collectives as well.
yes, that's how we found the issue, from the usage in a collective component :-)
The simple fix, aka strict matching, is relatively easy to do by changing hwloc_bitmap_intersects
to hwloc_bitmap_isequal
in opal_hwloc_compute_relative_locality
(opal/mca/hwloc/base/hwloc_base_util.c:638). The more complex matching that @edgargabriel proposed will require a complete rewrite of opal_hwloc_compute_relative_locality
and a different storage of the locality in PMIX and opal_proc_t.
@bosilca what does strict matching do? Say in the example above, how does the split happen?
Strictly the same binding at a specific level. I gave the outcome for the example provided here few comments above. On Genoa with the bindings provided by the user (-np 8 --map-by ppr:8:numa:pe=3
), the split by type for L3 locality you will return:
comm(0): 0, 1
comm(1): 0, 1
comm(2): MPI_COMM_SELF
comm(3): 3, 4
comm(4): 3, 4
comm(5): MPI_COMM_SELF
comm(6): 6, 7
comm(7): 6, 7
@bosilca this is not consistent, comm(2) cannot think that it is alone in the communicator, while comm(0) and comm(1) think that it is part of their subgroup
@bosilca I think it could lead to incorrect behavior (pardon me if I am wrong). Say I do an allreduce with the new comm after split, what's the behavior? What will rank 2 have?
Ignore, I thought I saw 2 in other comms
@mshanthagit as @edgargabriel noticed my example was incorrect. It should now be fixed: Rank 2 and 5 will be alone in their own communicator.
Maybe we should do a 2-step process: first step is to add the simple solution that @bosilca suggested, and backport it to 5.0.x and 4.1.x. There is clearly value in this solution, since it fixes an inconsistency.
If somebody has the cycles to work on my suggested approach, we can still do that at a later stage, maybe for 6.0. (There is a good chance that it might not happen though, in which case we have however at least the first fix)
Just thinking out loud, the solution @bosilca suggested creates 5 communicators whereas one would expect 3 (as there are 3 L3 domains). Will there be any side effects?
Honestly, I think that users binding processes as in the example here (overlapping several domains), deserve what they get and any split type is good, for as long as it is consistent across the board. The strict mode has the advantage of being a one-liner.
I don't necessarily disagree with you @bosilca I would just caution that sometimes these decisions are not dominated by MPI requirements. In this particular instance, it was the compute performance that was significantly better than using --map-by ppr:8:numa:pe=2
or --map-by ppr:8:numa:pe=4
that was driving the decision, and we just have to find a way to make the best out of that on the MPI side
Or maybe a different mapping pattern? Not sure exactly what you are trying to achieve, but seems like mapping to L3 is something we have already enabled, so I'm a tad confused.
I am happy to take any help @rhc54 , I couldn't find a solution on how to map 8 processes onto a group of 3 CCD, that is the first three rank to the first L3 domain, the second three ranks to that 2nd L3 domain, and the last 2 ranks to the 3rd L3 domain, given that the node has a second package with another 3 L3 domains and we would like to repeat the same pattern. I tried mapping by L3 domains, but it didn't do what we wanted, not because the mapping didn't work correctly, but because I didn't find a syntax for how to express this slight imbalance of 3/3/2 ranks per L3 domains
There is only so much we can do automatically in the MPI library. For everything else, the users can fall back to either a manual MPI_Comm_split
or to a guided split. However, for the more precise MPI_COMM_TYPE_HW_GUIDED
split the MPI standard clarifies:
The MPI processes in the group associated with the output communicator newcomm utilize that specific hardware resource type instance, and no other instance of the same hardware resource type.
@bosilca does that mean that a process that does not fulfill this criteria (i.e. utilize that specific hardware resource type instance, and no other instance of the same hardware resource type) should actually not be part of any resulting communicator, e.g. MPI_COMM_NULL?
I am happy to take any help @rhc54 , I couldn't find a solution on how to map 8 processes onto a group of 3 CCD, that is the first three rank to the first L3 domain, the second three ranks to that 2nd L3 domain, and the last 2 ranks to the 3rd L3 domain, given that the node has a second package with another 3 L3 domains and we would like to repeat the same pattern. I tried mapping by L3 domains, but it didn't do what we wanted, not because the mapping didn't work correctly, but because I didn't find a syntax for how to express this slight imbalance of 3/3/2 ranks per L3 domains
Yeah, we do that frequently - can you post your topology so I can give you the correct cmd line?
@rhc54 the output of hwloc-ls is listed above on the ticket, is that what you are looking for or do you need additional information?
I need the actually topology file output - the XML output so I can use it as input to PRRTE.
@bosilca does that mean that a process that does not fulfill this criteria (i.e. utilize that specific hardware resource type instance, and no other instance of the same hardware resource type) should actually not be part of any resulting communicator, e.g. MPI_COMM_NULL?
Yes, we can return MPI_COMM_NULL
instead of MPI_COMM_SELF
for the processes that are not strictly bound to a single instance of the specified resource.
I need the actually topology file output - the XML output so I can use it as input to PRRTE.
@rhc54 I pingged you on slack for that, thank you!
@edgargabriel Sent you a note over the weekend to verify the fix - committed upstream.
Thank you for taking the time to submit an issue!
Background information
MPI_Comm_split_type is creating groups with overlapping ranks in certain cases where ranks are bound to cores across resource domains (say L3). For example, consider the following where ranks 2 and 5 share two L3 domains. (program using MPI_Comm_split_type(MPI_COMM_WORLD, OMPI_COMM_TYPE_L3CACHE, 0, info, &newcomm); on a Genoa machine with 8 cores per L3)
mpirun -np 8 --map-by ppr:8:numa:pe=3 --report-bindings ./a.out [electra016:1374129] Rank 0 bound to package[0][core:0-2] [electra016:1374129] Rank 1 bound to package[0][core:3-5] [electra016:1374129] Rank 2 bound to package[0][core:6-8] [electra016:1374129] Rank 3 bound to package[0][core:9-11] [electra016:1374129] Rank 4 bound to package[0][core:12-14] [electra016:1374129] Rank 5 bound to package[0][core:15-17] [electra016:1374129] Rank 6 bound to package[0][core:18-20] [electra016:1374129] Rank 7 bound to package[0][core:21-23] Hello --- my rank: 0, my comm_size: 8 Hello --- my rank: 1, my comm_size: 8 Hello --- my rank: 7, my comm_size: 8 Hello --- my rank: 6, my comm_size: 8 Hello --- my rank: 5, my comm_size: 8 Hello --- my rank: 4, my comm_size: 8 Hello --- my rank: 3, my comm_size: 8 Hello --- my rank: 2, my comm_size: 8 From split comm: my rank: 0, my split_comm_size: 3 From split comm: my rank: 2, my split_comm_size: 6 From split comm: my rank: 4, my split_comm_size: 4 From split comm: my rank: 6, my split_comm_size: 3 From split comm: my rank: 1, my split_comm_size: 3 From split comm: my rank: 3, my split_comm_size: 4 From split comm: my rank: 5, my split_comm_size: 6 From split comm: my rank: 7, my split_comm_size: 3
As we can see from the above, there are only two ranks with comm_size 6! Although it doesn't print out the ranks within each communicator, here's what it would be:
comm(0): 0, 1, 2 comm(1): 0, 1, 2 comm(2): 0, 1, 2, 3, 4, 5 comm(3): 2, 3, 4, 5 comm(4): 2, 3, 4, 5 comm(5): 2, 3, 4, 5, 6, 7 comm(6): 5, 6, 7 comm(7): 5, 6, 7
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
I tested with 5.0.x and 4.1.6
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
From source (5.0.x)
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.Please describe the system on which you are running
Details of the problem
Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.
Details in the background section. Here is an example program:
====================
Note: If you include verbatim output (or a code block), please use a GitHub Markdown code block like below: