open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.16k stars 859 forks source link

Process is not bound #5091

Closed hunsa closed 6 years ago

hunsa commented 6 years ago

I have a multi-rail setup with 2 OPA NICs and I am running Open MPI 3.0.1.

AFAIK, Open MPI should select the NIC to be used by a process depending on where the process resides. Thus, I wanted to check which NIC has actually been selected for each rank. The problem is that I get the following message: "Process is not bound: distance to device is 0.000000". (The full output is attached in the end.)

in ompi/opal/mca/btl/openib/btl_openib_component.c I checked that

if (opal_process_info.cpuset) {

This evaluates to NULL in my case. So, I checked orte/mca/ess/base/ess_base_fns.c because that is the place where the cpuset should be set.

Here, I enter the conditional (line 60)

if (NULL != getenv(OPAL_MCA_PREFIX"orte_bound_at_launch")) {

and return with ORTE_SUCCESS

In this code path, no cpuset is obtained and then I obviously fail in btl_openib_component.c.

So, I would like to know what I am doing wrong or how the process binding is supposed to work in my case.

I'd very much appreciate your help.

Here, is the output (I omit the OSU benchmark output, as it provides no valuable information for this problem)

$ ompi_info |grep Open
                 Package: Open MPI hunold@hydra Distribution
                Open MPI: 3.0.1
  Open MPI repo revision: v3.0.1
   Open MPI release date: Mar 29, 2018
                Open RTE: 3.0.1
  Open RTE repo revision: v3.0.1
   Open RTE release date: Mar 29, 2018

$ ompi_info |grep hwloc
               MCA hwloc: hwloc1117 (MCA v2.1.0, API v2.0.0, Component v3.0.1)
                 MCA rtc: hwloc (MCA v2.1.0, API v1.0.0, Component v3.0.1)

$ mpirun --mca btl_base_verbose 5 --mca ess_base_verbose 5  -np 4 --hostfile ~/tmp/machinefile4  ./mpi/collective/osu_gather
[hydra01:89124] MCW rank 0 bound to NM0:SK0:L30:L20-15:L10-15:CR0-15:HT0-31
[hydra01:89125] MCW rank 1 bound to NM1:SK1:L31:L216-31:L116-31:CR16-31:HT32-63
[hydra01:89126] MCW rank 2 bound to NM0:SK0:L30:L20-15:L10-15:CR0-15:HT0-31
[hydra01:89127] MCW rank 3 bound to NM1:SK1:L31:L216-31:L116-31:CR16-31:HT32-63
[hydra01:89124] Checking distance from this process to device=hfi1_1
[hydra01:89124] Process is not bound: distance to device is 0.000000
[hydra01:89124] Checking distance from this process to device=hfi1_0
[hydra01:89124] Process is not bound: distance to device is 0.000000
[hydra01:89125] Checking distance from this process to device=hfi1_1
[hydra01:89125] Process is not bound: distance to device is 0.000000
[hydra01:89125] Checking distance from this process to device=hfi1_0
[hydra01:89125] Process is not bound: distance to device is 0.000000
[hydra01:89126] Checking distance from this process to device=hfi1_1
[hydra01:89126] Process is not bound: distance to device is 0.000000
[hydra01:89126] Checking distance from this process to device=hfi1_0
[hydra01:89126] Process is not bound: distance to device is 0.000000
[hydra01:89124] [rank=0] openib: using port hfi1_1:1
[hydra01:89124] [rank=0] openib: using port hfi1_0:1
[hydra01:89125] [rank=1] openib: using port hfi1_1:1
[hydra01:89125] [rank=1] openib: using port hfi1_0:1
[hydra01:89126] [rank=2] openib: using port hfi1_1:1
[hydra01:89126] [rank=2] openib: using port hfi1_0:1
[hydra01:89127] Checking distance from this process to device=hfi1_1
[hydra01:89127] Process is not bound: distance to device is 0.000000
[hydra01:89127] Checking distance from this process to device=hfi1_0
[hydra01:89127] Process is not bound: distance to device is 0.000000
[hydra01:89127] [rank=3] openib: using port hfi1_1:1
[hydra01:89127] [rank=3] openib: using port hfi1_0:1
thananon commented 6 years ago

Hello, I'm not sure if this will help but here we go.

You can check your process binding by the mpirun flag --report-bindings.

Open MPI process binding defaults can be found on Open MPI website.

You can enforce the process binding by using rankfile or use the built-in flag --bind-to [core,socket,none] (I believe the default is none, that's why the processes are not bound).

Combining with --map-by [node,core,socket,numa] you should be able to dictate any kind of binding you want.

thananon commented 6 years ago

Also, if you are using OPA, it might be better to use ofi MTL as it is maintained by Intel. You should get better performance there. To force Open MPI to use ofi MTL. The runtime is -mca mtl ofi.

matcabral commented 6 years ago

[hydra01:89124] [rank=0] openib: using port hfi1_1:1 [hydra01:89124] [rank=0] openib: using port hfi1_0:1 [hydra01:89125] [rank=1] openib: using port hfi1_1:1 [hydra01:89125] [rank=1] openib: using port hfi1_0:1 [hydra01:89126] [rank=2] openib: using port hfi1_1:1 [hydra01:89126] [rank=2] openib: using port hfi1_0:1

I'm not sure how to ready these lines. Since you are not forcing openib btl, this execution should be actually running on the PSM2 MTL. Which has the default highest priority for Omni-Path. I would suggest to add -x PSM2_IDENTIFY=1 -mca mtl_base_verbose 10 .

With that said, libpspm2 has it built in mechanism to choose the hfi/port to use. See section 9.0 of https://www.intel.com/content/dam/support/us/en/documents/network-and-i-o/fabric-products/Intel_OP_Fabric_Host_Software_UG_H76470_v9_0.pdf

hunsa commented 6 years ago

PART 1

Thanks thananon for pointing out --mca mtl ofi. However, when using ofi I run into the same problem: it does bind the MCW rank, but when computing the distances it fails with Process is not bound: (btw.. the mpirun documentation says: --bind-to <foo> Bind processes to the specified object, defaults to core. )

mpirun --mca mtl ofi  --mca btl_base_verbose 5 --mca ess_base_verbose 5 -np 32  --hostfile ~/tmp/machinefile_36  ./mpi/collective/osu_allreduce -f -m 1:2048
[hydra01:265012] MCW rank 1 bound to NM1:SK1:L31:L216-31:L116-31:CR16-31:HT32-63
[hydra01:265014] MCW rank 3 bound to NM1:SK1:L31:L216-31:L116-31:CR16-31:HT32-63
[hydra01:265017] MCW rank 5 bound to NM1:SK1:L31:L216-31:L116-31:CR16-31:HT32-63
[hydra01:265013] MCW rank 2 bound to NM0:SK0:L30:L20-15:L10-15:CR0-15:HT0-31
[hydra01:265015] MCW rank 4 bound to NM0:SK0:L30:L20-15:L10-15:CR0-15:HT0-31
[hydra01:265011] MCW rank 0 bound to NM0:SK0:L30:L20-15:L10-15:CR0-15:HT0-31
[hydra01:265019] MCW rank 6 bound to NM0:SK0:L30:L20-15:L10-15:CR0-15:HT0-31
[hydra01:265021] MCW rank 7 bound to NM1:SK1:L31:L216-31:L116-31:CR16-31:HT32-63
[hydra01:265012] Checking distance from this process to device=hfi1_1
[hydra01:265012] Process is not bound: distance to device is 0.000000
[hydra01:265012] Checking distance from this process to device=hfi1_0
[hydra01:265012] Process is not bound: distance to device is 0.000000
[hydra01:265015] Checking distance from this process to device=hfi1_1
...

Interestingly, if I force processes not to be bound, I get this

mpirun --bind-to none --mca mtl ofi  --mca btl_base_verbose 5 --mca ess_base_verbose 5 -np 32  --hostfile ~/tmp/machinefile_36  ./mpi/collective/osu_allreduce -f -m 1:2048
[hydra01:265434] MCW rank 0 is not bound (or bound to all available processors)
[hydra01:265434] Checking distance from this process to device=hfi1_1
[hydra01:265434] hwloc_distances->nbobjs=2
[hydra01:265434] hwloc_distances->latency[0]=1.000000
[hydra01:265434] hwloc_distances->latency[1]=2.100000
[hydra01:265434] hwloc_distances->latency[2]=2.100000
[hydra01:265434] hwloc_distances->latency[3]=1.000000
[hydra01:265434] ibv_obj->logical_index=1
[hydra01:265434] Process is bound: distance to device is 1.000000
[hydra01:265434] Checking distance from this process to device=hfi1_0
[hydra01:265434] hwloc_distances->nbobjs=2
[hydra01:265434] hwloc_distances->latency[0]=1.000000
[hydra01:265434] hwloc_distances->latency[1]=2.100000
[hydra01:265434] hwloc_distances->latency[2]=2.100000
[hydra01:265434] hwloc_distances->latency[3]=1.000000
[hydra01:265434] ibv_obj->logical_index=0
[hydra01:265434] Process is bound: distance to device is 1.000000
[hydra01:265434] [rank=0] openib: using port hfi1_1:1
[hydra01:265434] [rank=0] openib: using port hfi1_0:1

Now, the Process is bound but the distances to the hfis are the same.

My problem is that I have no idea what the expected behavior in my case should be.

PART 2

Thank you Matias, -x PSM2_IDENTIFY=1 helped to actually use PSM2. That's a very good point. Even after looking into the OPA user guide (as suggested), I am still not sure which MPI process is now using which hfi/port, or is that completely transparent to the MPI layer?

thananon commented 6 years ago
[hydra01:265012] MCW rank 1 bound to NM1:SK1:L31:L216-31:L116-31:CR16-31:HT32-63
[hydra01:265014] MCW rank 3 bound to NM1:SK1:L31:L216-31:L116-31:CR16-31:HT32-63
[hydra01:265017] MCW rank 5 bound to NM1:SK1:L31:L216-31:L116-31:CR16-31:HT32-63
[hydra01:265013] MCW rank 2 bound to NM0:SK0:L30:L20-15:L10-15:CR0-15:HT0-31
[hydra01:265015] MCW rank 4 bound to NM0:SK0:L30:L20-15:L10-15:CR0-15:HT0-31
[hydra01:265011] MCW rank 0 bound to NM0:SK0:L30:L20-15:L10-15:CR0-15:HT0-31

from this result, it seems that the process are bind to socket. (0,2,4,.. on oine and 1,3,5,.. on another) I'm not familiar with this output. Is this from --report-bindings? (I would suggest you use this as it gives you some visual)

As for the binding default, I think @rhc54 might be able to answer this better than me. (maybe we need to update the doc?)

Now for selecting the right hfi for MPI processes, this is totally on the component to do so. So for that MTL/PSM2 and MTL/OFI, @matcabral might be able to help.

matcabral commented 6 years ago

Hi @hunsa, Yes, the device selection is transparent to MPI, since it is done by libpsm2. The PSM2_TRACEMASK env variable can dump useful info, but may be a lot. Please try -x PSM2_TRACEMASK=0x0002 you will see lines like ...psmi_ep_open_device: [12262]use unit 0 port 1 which will show the PID and hfi device and port. You will have to map PID to rank since libpsm2 doesn't know about ranks. I think the --report-bindigs does show the PID and rank, right ?

thananon commented 6 years ago

@hunsa does this solve your problem?

I will close this issue if there is no more response this week.

hunsa commented 6 years ago

Hi @thananon and @matcabral ,

I see much clearer now. So, what works is the following. (I should note that I have 2 sockets per compute node, each comprising 16 cores.) That being said, if I run

mpirun -x PSM2_TRACEMASK=0x0002 --mca mtl ofi  --mca btl_base_verbose 5 --mca ess_base_verbose 5 -np 17  --hostfile ~/tmp/machinefile_36  ./mpi/collective/osu_allreduce -f -m 1:1000000
[hydra01:22836] MCW rank 4 bound to NM0:SK0:L30:L20-15:L10-15:CR0-15:HT0-31
[hydra01:22834] MCW rank 2 bound to NM0:SK0:L30:L20-15:L10-15:CR0-15:HT0-31
[hydra01:22832] MCW rank 0 bound to NM0:SK0:L30:L20-15:L10-15:CR0-15:HT0-31
[hydra01:22835] MCW rank 3 bound to NM1:SK1:L31:L216-31:L116-31:CR16-31:HT32-63
[hydra01:22837] MCW rank 5 bound to NM1:SK1:L31:L216-31:L116-31:CR16-31:HT32-63
[hydra01:22833] MCW rank 1 bound to NM1:SK1:L31:L216-31:L116-31:CR16-31:HT32-63
[hydra01:22839] MCW rank 6 bound to NM0:SK0:L30:L20-15:L10-15:CR0-15:HT0-31
[hydra01:22836] Checking distance from this process to device=hfi1_1
[hydra01:22836] Process is not bound: distance to device is 0.000000
[hydra01:22836] Checking distance from this process to device=hfi1_0
[hydra01:22836] Process is not bound: distance to device is 0.000000
[hydra01:22832] Checking distance from this process to device=hfi1_1
[hydra01:22832] Process is not bound: distance to device is 0.000000
[hydra01:22832] Checking distance from this process to device=hfi1_0
[hydra01:22832] Process is not bound: distance to device is 0.000000

hydra01.22849psmi_ep_open_device: [22849]use unit 1 port 1
hydra01.22851psmi_ep_open_device: [22851]use unit 0 port 1
hydra01.22853psmi_ep_open_device: [22853]use unit 1 port 1
hydra01.22855psmi_ep_open_device: [22855]use unit 0 port 1
hydra01.22857psmi_ep_open_device: [22857]use unit 1 port 1
hydra01.22859psmi_ep_open_device: [22859]use unit 0 port 1

I do not get the (my) desired behavior, as processes are not properly bound to cores. (the default binding and mapping strategies are --bind-to core and --map-by socket) In this case, I expected the processes to be bound to a specific core on alternating sockets. But that's not the case.

However, if I use --map-by core , I get this

mpirun  --map-by core -x PSM2_TRACEMASK=0x0002 --mca mtl ofi  --mca btl_base_verbose 5 --mca ess_base_verbose 5 -np 17  --hostfile ~/tmp/machinefile_36  ./mpi/collective/osu_allreduce -f -m 1:1000000

[hydra01:23066] MCW rank 2 bound to NM0:SK0:L30:L22:L12:CR2:HT4-5
[hydra01:23067] MCW rank 3 bound to NM0:SK0:L30:L23:L13:CR3:HT6-7
[hydra01:23065] MCW rank 1 bound to NM0:SK0:L30:L21:L11:CR1:HT2-3
[hydra01:23068] MCW rank 4 bound to NM0:SK0:L30:L24:L14:CR4:HT8-9
[hydra01:23069] MCW rank 5 bound to NM0:SK0:L30:L25:L15:CR5:HT10-11
[hydra01:23064] MCW rank 0 bound to NM0:SK0:L30:L20:L10:CR0:HT0-1
[hydra01:23071] MCW rank 6 bound to NM0:SK0:L30:L26:L16:CR6:HT12-13
[hydra01:23073] MCW rank 7 bound to NM0:SK0:L30:L27:L17:CR7:HT14-15
[hydra01:23075] MCW rank 8 bound to NM0:SK0:L30:L28:L18:CR8:HT16-17
[hydra01:23064] Checking distance from this process to device=hfi1_1
[hydra01:23064] Process is not bound: distance to device is 0.000000
[hydra01:23064] Checking distance from this process to device=hfi1_0
[hydra01:23064] Process is not bound: distance to device is 0.000000
[hydra01:23069] Checking distance from this process to device=hfi1_1
[hydra01:23069] Process is not bound: distance to device is 0.000000
[hydra01:23069] Checking distance from this process to device=hfi1_0
[hydra01:23069] Process is not bound: distance to device is 0.000000
[hydra01:23067] Checking distance from this process to device=hfi1_1
...

hydra01.23085psmi_ep_open_device: [23085]use unit 0 port 1
hydra01.23069psmi_ep_open_device: [23069]use unit 0 port 1
hydra01.23066psmi_ep_open_device: [23066]use unit 0 port 1
hydra01.23077psmi_ep_open_device: [23077]use unit 0 port 1
hydra01.23079psmi_ep_open_device: [23079]use unit 0 port 1
hydra01.23083psmi_ep_open_device: [23083]use unit 0 port 1
hydra01.23073psmi_ep_open_device: [23073]use unit 0 port 1
hydra01.23064psmi_ep_open_device: [23064]use unit 0 port 1
hydra01.23075psmi_ep_open_device: [23075]use unit 0 port 1
hydra01.23091psmi_ep_open_device: [23091]use unit 1 port 1
hydra01.23089psmi_ep_open_device: [23089]use unit 0 port 1
hydra01.23087psmi_ep_open_device: [23087]use unit 0 port 1

which is exactly as desired. The first 16 processes are pinned to the first socket and use unit 0, and the other one uses unit 1. Perfect. I can live with this for now.

BUT, there are two questions remaining: 1) Why is --map-by socket in combination with --bind-to core not binding processes to individual cores? 2) Why although ranks are obviously bound (e.g., MCW rank 2 bound to...), do we still see messages like

[hydra01:23064] Checking distance from this process to device=hfi1_1
[hydra01:23064] Process is not bound: distance to device is 0.000000

?

So, if I would not use the PSM2 device, I would be in trouble.

Thanks for your help

matcabral commented 6 years ago

[hydra01:23064] Checking distance from this process to device=hfi1_1 [hydra01:23064] Process is not bound: distance to device is 0.000000

I think you are getting this messages from the openib btl which is actually not being used. Therefore, this is misleading. Since you are using an MTL the -mca btl_base_verbose 5 may not be very useful here. Try -mca mtl_base_verbose xyz for more relevant info. Btw, note that you are using the ofi mt to run on psm2 which is a valid use can. However, you can also use psm2 directly without going through ofi (libfabric) by specifying -mca mtl psm2. Which, btw, is selected by default if you pass no parameters and are running on Omni-Path

thananon commented 6 years ago

Why is --map-by socket in combination with --bind-to core not binding processes to individual cores?

I believe that the default bindings you read is wrongly documented. For me. I always explicitly specify what I want and do --report-bindings to verify it.

I would suggest you run your app again but this time explicitly specify --map-by socket --bind-to core. I'm pretty sure that Open MPI will do exactly that. (result should be different from what you posted here) If not, we have a bug. Please report back.

On your second result, you did specify --map-by core so you get your intended result.

Why although ranks are obviously bound (e.g., MCW rank 2 bound to...), do we still see messages like

I'm not well equipped to answer this question. It might be the interaction between PMIx binding that does not report back to opal (hence the cpuset is not set). I'm pretty sure the bindings from --report-bindings is correct but the printout in openib btl is just not getting updated.

Also, as Mattias said. Try not to use -mca btl_base_verbose xyz as now you are using MTL. I think it is ok in this case but it might be misleading as you are printing information of the component that you are not using. (On this case, you printed stuff from openib btl initialization)

@jsquyres I am out of my depth. Maybe you can shed some light on this?

FWIW, I have been using other vendors NIC and some of them handles the NIC selection by the distance very well.

hunsa commented 6 years ago

@thananon

Here it is with --map-by socket --bind-to core:

mpirun --map-by socket --bind-to core --mca ess_base_verbose 5 -np 17  --hostfile ~/tmp/machinefile_36  ./mpi/collective/osu_allreduce -f -m 1:1000000
[hydra01:23669] MCW rank 3 bound to NM0:SK0:L30:L23:L13:CR3:HT6-7
[hydra01:23668] MCW rank 2 bound to NM0:SK0:L30:L22:L12:CR2:HT4-5
[hydra01:23670] MCW rank 4 bound to NM0:SK0:L30:L24:L14:CR4:HT8-9
[hydra01:23671] MCW rank 5 bound to NM0:SK0:L30:L25:L15:CR5:HT10-11
[hydra01:23666] MCW rank 0 bound to NM0:SK0:L30:L20:L10:CR0:HT0-1
[hydra01:23667] MCW rank 1 bound to NM0:SK0:L30:L21:L11:CR1:HT2-3
[hydra01:23674] MCW rank 6 bound to NM0:SK0:L30:L26:L16:CR6:HT12-13
[hydra01:23676] MCW rank 7 bound to NM0:SK0:L30:L27:L17:CR7:HT14-15
[hydra01:23677] MCW rank 8 bound to NM0:SK0:L30:L28:L18:CR8:HT16-17
..

You were right.

I get it now (consulting one more time man mpirun). For np > 2, we get --bind-to socket and --map-by socket. That's what I got. So, this default bind-to core is somewhat misleading, as it is not really used since the np rule applies most of the time. But good, we can close this now. I have a solution and know what's going on.

Thanks

thananon commented 6 years ago

Alright. Glad we can help!