icl-utk-edu / cluster

2 stars 0 forks source link

OpenMPI doesn't work when docker is running #1

Open G-Ragghianti opened 1 year ago

G-Ragghianti commented 1 year ago

Problem: When a docker container is running, simple OpenMPI jobs cannot run using the tcp interface. For example, a broadcast test will hang.

Steps to reproduce:

$ spack install osu-micro-benchmarks ^openmpi~rsh fabric=ucx
$ spack load osu-micro-benchmarks
$ mpirun -n 2 osu_bcast

# OSU MPI Broadcast Latency Test v7.1
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
1                       4.32
2                       4.32
4                       4.36
8                       4.30
16                      4.30
32                      4.32
64                      4.33
128                     4.10
256                     4.30
512                     5.72
1024                    5.81
2048                    6.07
4096                    5.74
8192                    6.67
16384                   7.74
32768                  13.65
<hangs>

Expected result:

$mpirun -n 2 --mca oob_base_verbose 100 osu_bcast

# OSU MPI Broadcast Latency Test v7.1
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
1                       3.26
2                       4.05
4                       4.40
8                       7.55
16                      5.53
32                      5.53
64                      4.06
128                     4.49
256                     6.37
512                     7.11
1024                    5.92
2048                    7.26
4096                    6.74
8192                    8.74
16384                  10.93
32768                  14.40
65536                  33.09
131072                 48.18
262144                 70.30
524288                118.22
1048576               200.32

Verbose output:

[histamine0:1785348] mca: base: components_register: registering framework oob components
[histamine0:1785348] mca: base: components_register: found loaded component tcp
[histamine0:1785348] mca: base: components_register: component tcp register function successful
[histamine0:1785348] mca: base: components_open: opening oob components
[histamine0:1785348] mca: base: components_open: found loaded component tcp
[histamine0:1785348] mca: base: components_open: component tcp open function successful
[histamine0:1785348] mca:oob:select: checking available component tcp
[histamine0:1785348] mca:oob:select: Querying component [tcp]
[histamine0:1785348] oob:tcp: component_available called
[histamine0:1785348] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[histamine0:1785348] [[3819,0],0] oob:tcp:init rejecting loopback interface lo
[histamine0:1785348] WORKING INTERFACE 2 KERNEL INDEX 2 FAMILY: V4
[histamine0:1785348] [[3819,0],0] oob:tcp:init adding 10.0.0.49 to our list of V4 connections
[histamine0:1785348] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4
[histamine0:1785348] [[3819,0],0] oob:tcp:init adding 172.17.0.1 to our list of V4 connections
[histamine0:1785348] [[3819,0],0] TCP STARTUP
[histamine0:1785348] [[3819,0],0] attempting to bind to IPv4 port 0
[histamine0:1785348] [[3819,0],0] assigned IPv4 port 36725
[histamine0:1785348] mca:oob:select: Adding component to end
[histamine0:1785348] mca:oob:select: Found 1 active transports
[histamine0:1785348] [[3819,0],0]: get transports
[histamine0:1785348] [[3819,0],0]:get transports for component tcp

# OSU MPI Broadcast Latency Test v7.1
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
1                       4.45
2                       4.61
4                       4.66
8                       4.63
16                      4.02
32                      4.06
64                      4.07
128                     4.10
256                     4.13
512                     5.82
1024                    5.92
2048                    6.27
4096                    5.98
8192                    6.69
16384                   7.57
32768                  14.08
<hangs>
G-Ragghianti commented 1 year ago

It appears that this occurs because openmpi tries to use the virtual network interface that is set up for the docker container. This is the interface with IP 172.17.0.1 in the verbose log. It is not clear what we should do to avoid this.

@bosilca @abouteiller @mgates3

bosilca commented 1 year ago

To prevent OMPI from using a specific IP interface you can do --mca btl_tcp_if_exclude 172.17.0.0/16 or use the explicit interface name --mca btl_tcp_if_exclude docker0.

G-Ragghianti commented 1 year ago

Yes, but I'm assuming that you want openmpi to work without the users of our systems all having to know this an always run with this?

abouteiller commented 1 year ago

You can set this in the Open MPI mca param file for the installation $OMPI_PREFIX/etc/openmpi-mca-params.conf.

btl_tcp_if_exclude=docker0,virbr0

Disadvantage is that this need to be done for all installs and will not carry over to user-compiled open MPI. This is also the advantage (having implicit things carry-over to user installs can be confusing).

Aurelien

On Jul 27, 2023, at 10:59, G-Ragghianti @.***> wrote:

Yes, but I'm assuming that you want openmpi to work without the users of our systems all having to know this an always run with this?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.

bosilca commented 1 year ago

Indeed, there is what I want and then there is what is possible. Is there a consistent way to identify the interfaces created by dockers or interfaces that are virtual and cannot be used for data exchanges ? Unfortunately the answer is no, and thus either the users/sys admin provide the correct configuration files (either user or system wide MCA param) or we will be reliant on the system timeout (btw, the execution did not deadlock it is just waiting for the timeout to signal that the interface cannot be used, and the default timeout is extremely long).

G-Ragghianti commented 1 year ago

Yes, disabling the docker0 interfaces avoids the problem. I would have to think of the best way to set this. This would not be very clean to manually set it within the spack openmpi install directory, but it looks like it doesn't look anywhere else for the conf file.

Also, I'm confused why openmpi isn't using vader/sm. Even if I set "--mca btl self,vader" it doesn't work correctly (doesn't run the osu_bcast):

[guyot:342029] mca: base: components_register: registering framework oob components
[guyot:342029] mca: base: components_register: found loaded component tcp
[guyot:342029] mca: base: components_register: component tcp register function successful
[guyot:342029] mca: base: components_open: opening oob components
[guyot:342029] mca: base: components_open: found loaded component tcp
[guyot:342029] mca: base: components_open: component tcp open function successful
[guyot:342029] mca:oob:select: checking available component tcp
[guyot:342029] mca:oob:select: Querying component [tcp]
[guyot:342029] oob:tcp: component_available called
[guyot:342029] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[guyot:342029] [[45588,0],0] oob:tcp:init rejecting loopback interface lo
[guyot:342029] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4
[guyot:342029] [[45588,0],0] oob:tcp:init adding 10.0.0.151 to our list of V4 connections
[guyot:342029] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4
[guyot:342029] [[45588,0],0] oob:tcp:init adding 160.36.131.188 to our list of V4 connections
[guyot:342029] WORKING INTERFACE 4 KERNEL INDEX 8 FAMILY: V4
[guyot:342029] [[45588,0],0] oob:tcp:init adding 172.17.0.1 to our list of V4 connections
[guyot:342029] [[45588,0],0] TCP STARTUP
[guyot:342029] [[45588,0],0] attempting to bind to IPv4 port 0
[guyot:342029] [[45588,0],0] assigned IPv4 port 59527
[guyot:342029] mca:oob:select: Adding component to end
[guyot:342029] mca:oob:select: Found 1 active transports
[guyot:342029] [[45588,0],0]: get transports
[guyot:342029] [[45588,0],0]:get transports for component tcp
[guyot:342029] [[45588,0],0] TCP SHUTDOWN
[guyot:342029] [[45588,0],0] TCP SHUTDOWN done
[guyot:342029] mca: base: close: component tcp closed
[guyot:342029] mca: base: close: unloading component tcp
bosilca commented 1 year ago

All these output messages are from PMIX and not from OMPI. So based on these we cannot conclude if vader/sm was or not used. Use --mca pml_base_verbose 10 to see what PML is used and what it loads.

G-Ragghianti commented 1 year ago

OK:

[guyot:738468] mca: base: components_register: registering framework pml components
[guyot:738468] mca: base: components_register: found loaded component cm
[guyot:738468] mca: base: components_register: component cm register function successful
[guyot:738468] mca: base: components_register: found loaded component ob1
[guyot:738467] mca: base: components_register: registering framework pml components
[guyot:738467] mca: base: components_register: found loaded component cm
[guyot:738467] mca: base: components_register: component cm register function successful
[guyot:738467] mca: base: components_register: found loaded component ob1
[guyot:738467] mca: base: components_register: component ob1 register function successful
[guyot:738467] mca: base: components_register: found loaded component ucx
[guyot:738468] mca: base: components_register: component ob1 register function successful
[guyot:738468] mca: base: components_register: found loaded component ucx
[guyot:738467] mca: base: components_register: component ucx register function successful
[guyot:738468] mca: base: components_register: component ucx register function successful
[guyot:738467] mca: base: components_register: found loaded component v
[guyot:738468] mca: base: components_register: found loaded component v
[guyot:738468] mca: base: components_register: component v register function successful
[guyot:738467] mca: base: components_register: component v register function successful
[guyot:738468] mca: base: components_open: opening pml components
[guyot:738468] mca: base: components_open: found loaded component cm
[guyot:738467] mca: base: components_open: opening pml components
[guyot:738467] mca: base: components_open: found loaded component cm
[guyot:738467] mca: base: close: component cm closed
[guyot:738467] mca: base: close: unloading component cm
[guyot:738468] mca: base: close: component cm closed
[guyot:738468] mca: base: close: unloading component cm
[guyot:738468] mca: base: components_open: found loaded component ob1
[guyot:738467] mca: base: components_open: found loaded component ob1
[guyot:738468] mca: base: components_open: component ob1 open function successful
[guyot:738467] mca: base: components_open: component ob1 open function successful
[guyot:738467] mca: base: components_open: found loaded component ucx
[guyot:738468] mca: base: components_open: found loaded component ucx
[guyot:738467] mca: base: components_open: component ucx open function successful
[guyot:738467] mca: base: components_open: found loaded component v
[guyot:738467] mca: base: components_open: component v open function successful
[guyot:738468] mca: base: components_open: component ucx open function successful
[guyot:738468] mca: base: components_open: found loaded component v
[guyot:738468] mca: base: components_open: component v open function successful
[guyot:738467] select: initializing pml component ob1
[guyot:738467] select: init returned priority 20
[guyot:738467] select: initializing pml component ucx
[guyot:738468] select: initializing pml component ob1
[guyot:738468] select: init returned priority 20
[guyot:738468] select: initializing pml component ucx
[guyot:738467] select: init returned failure for component ucx
[guyot:738467] select: component v not in the include list
[guyot:738467] selected ob1 best priority 20
[guyot:738467] select: component ob1 selected
[guyot:738468] select: init returned failure for component ucx
[guyot:738468] select: component v not in the include list
[guyot:738468] selected ob1 best priority 20
[guyot:738468] select: component ob1 selected
[guyot:738467] mca: base: close: component ucx closed
[guyot:738467] mca: base: close: unloading component ucx
[guyot:738467] mca: base: close: component v closed
[guyot:738467] mca: base: close: unloading component v
[guyot:738468] mca: base: close: component ucx closed
[guyot:738468] mca: base: close: unloading component ucx
[guyot:738468] mca: base: close: component v closed
[guyot:738468] mca: base: close: unloading component v
[guyot:738467] check:select: PML check not necessary on self
[guyot:738468] check:select: checking my pml ob1 against process [[52872,1],0] pml ob1
bosilca commented 1 year ago

OB1 is selected, so all BTLs should be up and running, if you did not specifically excluded them (with --mca btl ˆsomething) . If you want more details, you can use --mca btl_base_verbose 10 to see specifically what BTL are loaded and what they do for initialization. However, being loaded does not mean it will be used, this will depend on the application's communication pattern.

abouteiller commented 6 months ago

We should investigate an upgrade of UCX to latest and Open MPI to 5.0.2, that may have resolved these problems.

G-Ragghianti commented 6 months ago

I have scheduled a rebuild of the module that will be placed in a new location (date code 2024-03-01).

G-Ragghianti commented 6 months ago

I'm building a new software module set of the latest openmpi@5.0.2 and ucx@1.15, but the changes in UCX are scheduled for 1.16.

G-Ragghianti commented 6 months ago

There is a problem with updating to openmpi@5 on our newer systems. The systems use pmix@3.2.3 (required by slurm), but there is an incompatibility with this pmix version and openmpi version 5. It would be possible to use an "internal" pmix in openmpi, but I don't know if it will work with slurm then. Ideas?

G-Ragghianti commented 6 months ago

Using openmpi's internal pmix, this is available to test on login.icl.utk.edu:

export MODULEPATH=/apps/spacks/2024-03-05/share/spack/modules/linux-rocky9-x86_64

abouteiller commented 3 months ago

Using that open mpi works as expected except for the following warning message

52: A requested component was not found, or was unable to be opened.  This
52: means that this component is either not installed or is unable to be
52: used on your system (e.g., sometimes this means that shared libraries
52: that the component requires are unable to be found/loaded).  Note that
52: PMIx stopped checking at the first component that it did not find.
52:
52: Host:      leconte
52: Framework: psec
52: Component: munge
52: --------------------------------------------------------------------------

This can be resolved by installing the munge package (from the slurm installation rpms, it doesn't get installed automatically in the client image when installing slurm, but it should).