Closed hunsa closed 6 years ago
Hello, I'm not sure if this will help but here we go.
You can check your process binding by the mpirun flag --report-bindings
.
Open MPI process binding defaults can be found on Open MPI website.
You can enforce the process binding by using rankfile or use the built-in flag --bind-to [core,socket,none]
(I believe the default is none, that's why the processes are not bound).
Combining with --map-by [node,core,socket,numa]
you should be able to dictate any kind of binding you want.
Also, if you are using OPA, it might be better to use ofi MTL as it is maintained by Intel. You should get better performance there. To force Open MPI to use ofi MTL. The runtime is -mca mtl ofi
.
[hydra01:89124] [rank=0] openib: using port hfi1_1:1 [hydra01:89124] [rank=0] openib: using port hfi1_0:1 [hydra01:89125] [rank=1] openib: using port hfi1_1:1 [hydra01:89125] [rank=1] openib: using port hfi1_0:1 [hydra01:89126] [rank=2] openib: using port hfi1_1:1 [hydra01:89126] [rank=2] openib: using port hfi1_0:1
I'm not sure how to ready these lines. Since you are not forcing openib btl, this execution should be actually running on the PSM2 MTL. Which has the default highest priority for Omni-Path. I would suggest to add -x PSM2_IDENTIFY=1 -mca mtl_base_verbose 10
.
With that said, libpspm2 has it built in mechanism to choose the hfi/port to use. See section 9.0 of https://www.intel.com/content/dam/support/us/en/documents/network-and-i-o/fabric-products/Intel_OP_Fabric_Host_Software_UG_H76470_v9_0.pdf
Thanks thananon for pointing out --mca mtl ofi
.
However, when using ofi
I run into the same problem: it does bind the MCW rank, but when computing the distances it fails with Process is not bound:
(btw.. the mpirun documentation says: --bind-to <foo>
Bind processes to the specified object, defaults to core. )
mpirun --mca mtl ofi --mca btl_base_verbose 5 --mca ess_base_verbose 5 -np 32 --hostfile ~/tmp/machinefile_36 ./mpi/collective/osu_allreduce -f -m 1:2048
[hydra01:265012] MCW rank 1 bound to NM1:SK1:L31:L216-31:L116-31:CR16-31:HT32-63
[hydra01:265014] MCW rank 3 bound to NM1:SK1:L31:L216-31:L116-31:CR16-31:HT32-63
[hydra01:265017] MCW rank 5 bound to NM1:SK1:L31:L216-31:L116-31:CR16-31:HT32-63
[hydra01:265013] MCW rank 2 bound to NM0:SK0:L30:L20-15:L10-15:CR0-15:HT0-31
[hydra01:265015] MCW rank 4 bound to NM0:SK0:L30:L20-15:L10-15:CR0-15:HT0-31
[hydra01:265011] MCW rank 0 bound to NM0:SK0:L30:L20-15:L10-15:CR0-15:HT0-31
[hydra01:265019] MCW rank 6 bound to NM0:SK0:L30:L20-15:L10-15:CR0-15:HT0-31
[hydra01:265021] MCW rank 7 bound to NM1:SK1:L31:L216-31:L116-31:CR16-31:HT32-63
[hydra01:265012] Checking distance from this process to device=hfi1_1
[hydra01:265012] Process is not bound: distance to device is 0.000000
[hydra01:265012] Checking distance from this process to device=hfi1_0
[hydra01:265012] Process is not bound: distance to device is 0.000000
[hydra01:265015] Checking distance from this process to device=hfi1_1
...
Interestingly, if I force processes not to be bound, I get this
mpirun --bind-to none --mca mtl ofi --mca btl_base_verbose 5 --mca ess_base_verbose 5 -np 32 --hostfile ~/tmp/machinefile_36 ./mpi/collective/osu_allreduce -f -m 1:2048
[hydra01:265434] MCW rank 0 is not bound (or bound to all available processors)
[hydra01:265434] Checking distance from this process to device=hfi1_1
[hydra01:265434] hwloc_distances->nbobjs=2
[hydra01:265434] hwloc_distances->latency[0]=1.000000
[hydra01:265434] hwloc_distances->latency[1]=2.100000
[hydra01:265434] hwloc_distances->latency[2]=2.100000
[hydra01:265434] hwloc_distances->latency[3]=1.000000
[hydra01:265434] ibv_obj->logical_index=1
[hydra01:265434] Process is bound: distance to device is 1.000000
[hydra01:265434] Checking distance from this process to device=hfi1_0
[hydra01:265434] hwloc_distances->nbobjs=2
[hydra01:265434] hwloc_distances->latency[0]=1.000000
[hydra01:265434] hwloc_distances->latency[1]=2.100000
[hydra01:265434] hwloc_distances->latency[2]=2.100000
[hydra01:265434] hwloc_distances->latency[3]=1.000000
[hydra01:265434] ibv_obj->logical_index=0
[hydra01:265434] Process is bound: distance to device is 1.000000
[hydra01:265434] [rank=0] openib: using port hfi1_1:1
[hydra01:265434] [rank=0] openib: using port hfi1_0:1
Now, the Process is bound
but the distances to the hfis are the same.
My problem is that I have no idea what the expected behavior in my case should be.
Thank you Matias,
-x PSM2_IDENTIFY=1
helped to actually use PSM2. That's a very good point.
Even after looking into the OPA user guide (as suggested), I am still not sure which MPI process is now using which hfi/port, or is that completely transparent to the MPI layer?
[hydra01:265012] MCW rank 1 bound to NM1:SK1:L31:L216-31:L116-31:CR16-31:HT32-63
[hydra01:265014] MCW rank 3 bound to NM1:SK1:L31:L216-31:L116-31:CR16-31:HT32-63
[hydra01:265017] MCW rank 5 bound to NM1:SK1:L31:L216-31:L116-31:CR16-31:HT32-63
[hydra01:265013] MCW rank 2 bound to NM0:SK0:L30:L20-15:L10-15:CR0-15:HT0-31
[hydra01:265015] MCW rank 4 bound to NM0:SK0:L30:L20-15:L10-15:CR0-15:HT0-31
[hydra01:265011] MCW rank 0 bound to NM0:SK0:L30:L20-15:L10-15:CR0-15:HT0-31
from this result, it seems that the process are bind to socket. (0,2,4,.. on oine and 1,3,5,.. on another) I'm not familiar with this output. Is this from --report-bindings
? (I would suggest you use this as it gives you some visual)
As for the binding default, I think @rhc54 might be able to answer this better than me. (maybe we need to update the doc?)
Now for selecting the right hfi for MPI processes, this is totally on the component to do so. So for that MTL/PSM2 and MTL/OFI, @matcabral might be able to help.
Hi @hunsa,
Yes, the device selection is transparent to MPI, since it is done by libpsm2. The PSM2_TRACEMASK env variable can dump useful info, but may be a lot. Please try -x PSM2_TRACEMASK=0x0002
you will see lines like ...psmi_ep_open_device: [12262]use unit 0 port 1
which will show the PID and hfi device and port. You will have to map PID to rank since libpsm2 doesn't know about ranks. I think the --report-bindigs
does show the PID and rank, right ?
@hunsa does this solve your problem?
I will close this issue if there is no more response this week.
Hi @thananon and @matcabral ,
I see much clearer now. So, what works is the following. (I should note that I have 2 sockets per compute node, each comprising 16 cores.) That being said, if I run
mpirun -x PSM2_TRACEMASK=0x0002 --mca mtl ofi --mca btl_base_verbose 5 --mca ess_base_verbose 5 -np 17 --hostfile ~/tmp/machinefile_36 ./mpi/collective/osu_allreduce -f -m 1:1000000
[hydra01:22836] MCW rank 4 bound to NM0:SK0:L30:L20-15:L10-15:CR0-15:HT0-31
[hydra01:22834] MCW rank 2 bound to NM0:SK0:L30:L20-15:L10-15:CR0-15:HT0-31
[hydra01:22832] MCW rank 0 bound to NM0:SK0:L30:L20-15:L10-15:CR0-15:HT0-31
[hydra01:22835] MCW rank 3 bound to NM1:SK1:L31:L216-31:L116-31:CR16-31:HT32-63
[hydra01:22837] MCW rank 5 bound to NM1:SK1:L31:L216-31:L116-31:CR16-31:HT32-63
[hydra01:22833] MCW rank 1 bound to NM1:SK1:L31:L216-31:L116-31:CR16-31:HT32-63
[hydra01:22839] MCW rank 6 bound to NM0:SK0:L30:L20-15:L10-15:CR0-15:HT0-31
[hydra01:22836] Checking distance from this process to device=hfi1_1
[hydra01:22836] Process is not bound: distance to device is 0.000000
[hydra01:22836] Checking distance from this process to device=hfi1_0
[hydra01:22836] Process is not bound: distance to device is 0.000000
[hydra01:22832] Checking distance from this process to device=hfi1_1
[hydra01:22832] Process is not bound: distance to device is 0.000000
[hydra01:22832] Checking distance from this process to device=hfi1_0
[hydra01:22832] Process is not bound: distance to device is 0.000000
hydra01.22849psmi_ep_open_device: [22849]use unit 1 port 1
hydra01.22851psmi_ep_open_device: [22851]use unit 0 port 1
hydra01.22853psmi_ep_open_device: [22853]use unit 1 port 1
hydra01.22855psmi_ep_open_device: [22855]use unit 0 port 1
hydra01.22857psmi_ep_open_device: [22857]use unit 1 port 1
hydra01.22859psmi_ep_open_device: [22859]use unit 0 port 1
I do not get the (my) desired behavior, as processes are not properly bound to cores.
(the default binding and mapping strategies are --bind-to core
and --map-by socket
)
In this case, I expected the processes to be bound to a specific core on alternating sockets. But that's not the case.
However, if I use --map-by core
, I get this
mpirun --map-by core -x PSM2_TRACEMASK=0x0002 --mca mtl ofi --mca btl_base_verbose 5 --mca ess_base_verbose 5 -np 17 --hostfile ~/tmp/machinefile_36 ./mpi/collective/osu_allreduce -f -m 1:1000000
[hydra01:23066] MCW rank 2 bound to NM0:SK0:L30:L22:L12:CR2:HT4-5
[hydra01:23067] MCW rank 3 bound to NM0:SK0:L30:L23:L13:CR3:HT6-7
[hydra01:23065] MCW rank 1 bound to NM0:SK0:L30:L21:L11:CR1:HT2-3
[hydra01:23068] MCW rank 4 bound to NM0:SK0:L30:L24:L14:CR4:HT8-9
[hydra01:23069] MCW rank 5 bound to NM0:SK0:L30:L25:L15:CR5:HT10-11
[hydra01:23064] MCW rank 0 bound to NM0:SK0:L30:L20:L10:CR0:HT0-1
[hydra01:23071] MCW rank 6 bound to NM0:SK0:L30:L26:L16:CR6:HT12-13
[hydra01:23073] MCW rank 7 bound to NM0:SK0:L30:L27:L17:CR7:HT14-15
[hydra01:23075] MCW rank 8 bound to NM0:SK0:L30:L28:L18:CR8:HT16-17
[hydra01:23064] Checking distance from this process to device=hfi1_1
[hydra01:23064] Process is not bound: distance to device is 0.000000
[hydra01:23064] Checking distance from this process to device=hfi1_0
[hydra01:23064] Process is not bound: distance to device is 0.000000
[hydra01:23069] Checking distance from this process to device=hfi1_1
[hydra01:23069] Process is not bound: distance to device is 0.000000
[hydra01:23069] Checking distance from this process to device=hfi1_0
[hydra01:23069] Process is not bound: distance to device is 0.000000
[hydra01:23067] Checking distance from this process to device=hfi1_1
...
hydra01.23085psmi_ep_open_device: [23085]use unit 0 port 1
hydra01.23069psmi_ep_open_device: [23069]use unit 0 port 1
hydra01.23066psmi_ep_open_device: [23066]use unit 0 port 1
hydra01.23077psmi_ep_open_device: [23077]use unit 0 port 1
hydra01.23079psmi_ep_open_device: [23079]use unit 0 port 1
hydra01.23083psmi_ep_open_device: [23083]use unit 0 port 1
hydra01.23073psmi_ep_open_device: [23073]use unit 0 port 1
hydra01.23064psmi_ep_open_device: [23064]use unit 0 port 1
hydra01.23075psmi_ep_open_device: [23075]use unit 0 port 1
hydra01.23091psmi_ep_open_device: [23091]use unit 1 port 1
hydra01.23089psmi_ep_open_device: [23089]use unit 0 port 1
hydra01.23087psmi_ep_open_device: [23087]use unit 0 port 1
which is exactly as desired. The first 16 processes are pinned to the first socket and use unit 0, and the other one uses unit 1. Perfect. I can live with this for now.
BUT, there are two questions remaining:
1) Why is --map-by socket
in combination with --bind-to core
not binding processes to individual cores?
2) Why although ranks are obviously bound (e.g., MCW rank 2 bound to...
), do we still see messages like
[hydra01:23064] Checking distance from this process to device=hfi1_1
[hydra01:23064] Process is not bound: distance to device is 0.000000
?
So, if I would not use the PSM2 device, I would be in trouble.
Thanks for your help
[hydra01:23064] Checking distance from this process to device=hfi1_1 [hydra01:23064] Process is not bound: distance to device is 0.000000
I think you are getting this messages from the openib btl which is actually not being used. Therefore, this is misleading. Since you are using an MTL the -mca btl_base_verbose 5
may not be very useful here. Try -mca mtl_base_verbose xyz
for more relevant info.
Btw, note that you are using the ofi mt to run on psm2 which is a valid use can. However, you can also use psm2 directly without going through ofi (libfabric) by specifying -mca mtl psm2
. Which, btw, is selected by default if you pass no parameters and are running on Omni-Path
Why is --map-by socket in combination with --bind-to core not binding processes to individual cores?
I believe that the default bindings you read is wrongly documented. For me. I always explicitly specify what I want and do --report-bindings
to verify it.
I would suggest you run your app again but this time explicitly specify --map-by socket --bind-to core
. I'm pretty sure that Open MPI will do exactly that. (result should be different from what you posted here) If not, we have a bug. Please report back.
On your second result, you did specify --map-by core
so you get your intended result.
Why although ranks are obviously bound (e.g., MCW rank 2 bound to...), do we still see messages like
I'm not well equipped to answer this question. It might be the interaction between PMIx binding that does not report back to opal (hence the cpuset is not set). I'm pretty sure the bindings from --report-bindings is correct but the printout in openib btl is just not getting updated.
Also, as Mattias said. Try not to use -mca btl_base_verbose xyz
as now you are using MTL. I think it is ok in this case but it might be misleading as you are printing information of the component that you are not using. (On this case, you printed stuff from openib btl initialization)
@jsquyres I am out of my depth. Maybe you can shed some light on this?
FWIW, I have been using other vendors NIC and some of them handles the NIC selection by the distance very well.
@thananon
Here it is with --map-by socket --bind-to core
:
mpirun --map-by socket --bind-to core --mca ess_base_verbose 5 -np 17 --hostfile ~/tmp/machinefile_36 ./mpi/collective/osu_allreduce -f -m 1:1000000
[hydra01:23669] MCW rank 3 bound to NM0:SK0:L30:L23:L13:CR3:HT6-7
[hydra01:23668] MCW rank 2 bound to NM0:SK0:L30:L22:L12:CR2:HT4-5
[hydra01:23670] MCW rank 4 bound to NM0:SK0:L30:L24:L14:CR4:HT8-9
[hydra01:23671] MCW rank 5 bound to NM0:SK0:L30:L25:L15:CR5:HT10-11
[hydra01:23666] MCW rank 0 bound to NM0:SK0:L30:L20:L10:CR0:HT0-1
[hydra01:23667] MCW rank 1 bound to NM0:SK0:L30:L21:L11:CR1:HT2-3
[hydra01:23674] MCW rank 6 bound to NM0:SK0:L30:L26:L16:CR6:HT12-13
[hydra01:23676] MCW rank 7 bound to NM0:SK0:L30:L27:L17:CR7:HT14-15
[hydra01:23677] MCW rank 8 bound to NM0:SK0:L30:L28:L18:CR8:HT16-17
..
You were right.
I get it now (consulting one more time man mpirun
).
For np > 2, we get --bind-to socket
and --map-by socket
. That's what I got.
So, this default bind-to core
is somewhat misleading, as it is not really used since the np rule applies most of the time.
But good, we can close this now. I have a solution and know what's going on.
Thanks
Alright. Glad we can help!
I have a multi-rail setup with 2 OPA NICs and I am running Open MPI 3.0.1.
AFAIK, Open MPI should select the NIC to be used by a process depending on where the process resides. Thus, I wanted to check which NIC has actually been selected for each rank. The problem is that I get the following message: "Process is not bound: distance to device is 0.000000". (The full output is attached in the end.)
in
ompi/opal/mca/btl/openib/btl_openib_component.c
I checked thatThis evaluates to NULL in my case. So, I checked
orte/mca/ess/base/ess_base_fns.c
because that is the place where thecpuset
should be set.Here, I enter the conditional (line 60)
and return with
ORTE_SUCCESS
In this code path, no
cpuset
is obtained and then I obviously fail inbtl_openib_component.c
.So, I would like to know what I am doing wrong or how the process binding is supposed to work in my case.
I'd very much appreciate your help.
Here, is the output (I omit the OSU benchmark output, as it provides no valuable information for this problem)