open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.16k stars 859 forks source link

Random freezes of Infiniband #10432

Closed robertsawko closed 2 years ago

robertsawko commented 2 years ago

Hello, I would appreciate some advice on the following issue.

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

OpenMPI 4.1.1 and UCX 1.12.1

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

OpenMPI and UCX were installed from sources:

./contrib/configure-release \
    --prefix=/lustre/scafellpike/local/apps/hierarchy//compiler/gcc/6.5/ucx/1.12.1 \
    --enable-mt \
    --with-knem=${KNEM_DIR}
./configure \
  --prefix=/lustre/scafellpike/local/apps/hierarchy/compiler/gcc/6.5/openmpi/4.1.1-ucx \
  --enable-shared --disable-static \
  --enable-mpi-fortran=usempi \
  --disable-libompitrace \
  --enable-wrapper-rpath \
  --with-lsf=${LSF_LIBDIR%%linux*} \
  --with-lsf-libdir=${LSF_LIBDIR} \
  -with-knem=${knem_dir} \
  --without-mxm \
  --with-ucx=/lustre/scafellpike/local/apps/hierarchy/compiler/gcc/6.5/ucx/1.12.1 \
  --without-verbs \
  --without-cuda \
  && make -j32

Please describe the system on which you are running


Details of the problem

I am having issues at MPI initialisation stage. As a sanity check I started running Intel MPI Benchmark

mpirun ./IMB-MPI Sendrecv

The code simply freezes when we reach the actual benchmark. Forcing TCP makes it work which makes me think it's either a hardware problem or still some issue in my setup.

mpirun --mca btl tcp,self ./IMB-MPI Sendrecv

I've used OMPI_MCA_pml_ucx_verbose=100 following a similar problem I was also having before]and here is the output for just two processes:

[sqg1cintr17.bullx:51367] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../..]
[sqg1cintr22.bullx:09180] MCW rank 1 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../..]
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/ompi/mca/pml/ucx/pml_ucx.c:197 mca_pml_ucx_open: UCX version 1.12.1
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 posix/memory: did not match transport list
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 sysv/memory: did not match transport list
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 self/memory0: did not match transport list
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 tcp/ens1f0: did not match transport list
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 tcp/ens1f1: did not match transport list
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 tcp/lo: did not match transport list
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 tcp/ib0: did not match transport list
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:299 rc_verbs/mlx5_0:1: matched transport list but not device list
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:299 rc_mlx5/mlx5_0:1: matched transport list but not device list
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:299 dc_mlx5/mlx5_0:1: matched transport list but not device list
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:299 ud_verbs/mlx5_0:1: matched transport list but not device list
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 ud_mlx5/mlx5_0:1: did not match transport list
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 cma/memory: did not match transport list
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 knem/memory: did not match transport list
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:311 support level is transports only
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/ompi/mca/pml/ucx/pml_ucx.c:289 mca_pml_ucx_init
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/ompi/mca/pml/ucx/pml_ucx.c:114 Pack remote worker address, size 249
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/ompi/mca/pml/ucx/pml_ucx.c:114 Pack local worker address, size 414
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/ompi/mca/pml/ucx/pml_ucx.c:351 created ucp context 0x2249f90, worker 0x7fd174074010
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/ompi/mca/pml/ucx/pml_ucx_component.c:129 returning priority 19
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/ompi/mca/pml/ucx/pml_ucx.c:367 mca_pml_ucx_cleanup
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/ompi/mca/pml/ucx/pml_ucx.c:197 mca_pml_ucx_open: UCX version 1.12.1
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/ompi/mca/pml/ucx/pml_ucx.c:268 mca_pml_ucx_close
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 posix/memory: did not match transport list
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 sysv/memory: did not match transport list
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 self/memory0: did not match transport list
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 tcp/ens1f1: did not match transport list
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 tcp/ens1f0: did not match transport list
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 tcp/lo: did not match transport list
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 tcp/ib0: did not match transport list
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:299 rc_verbs/mlx5_0:1: matched transport list but not device list
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:299 rc_mlx5/mlx5_0:1: matched transport list but not device list
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:299 dc_mlx5/mlx5_0:1: matched transport list but not device list
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:299 ud_verbs/mlx5_0:1: matched transport list but not device list
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 ud_mlx5/mlx5_0:1: did not match transport list
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 cma/memory: did not match transport list
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 knem/memory: did not match transport list
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:311 support level is transports only
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/ompi/mca/pml/ucx/pml_ucx.c:289 mca_pml_ucx_init
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/ompi/mca/pml/ucx/pml_ucx.c:114 Pack remote worker address, size 249
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/ompi/mca/pml/ucx/pml_ucx.c:114 Pack local worker address, size 414
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/ompi/mca/pml/ucx/pml_ucx.c:351 created ucp context 0x2738040, worker 0x7fc570031010
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/ompi/mca/pml/ucx/pml_ucx_component.c:129 returning priority 19
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/ompi/mca/pml/ucx/pml_ucx.c:367 mca_pml_ucx_cleanup
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/ompi/mca/pml/ucx/pml_ucx.c:268 mca_pml_ucx_close
jsquyres commented 2 years ago

@open-mpi/ucx FYI

janjust commented 2 years ago

@robertsawko Which IB Device is present on your system?

robertsawko commented 2 years ago

Thanks for responding so quickly!

Is this what you are asking?

ibstat
CA 'mlx5_0'
        CA type: MT4115
        Number of ports: 1
        Firmware version: 12.24.1000
        Hardware version: 0
        Node GUID: 0x248a07030091fde0
        System image GUID: 0x248a07030091fde0
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 100
                Base lid: 53
                LMC: 0
                SM lid: 17
                Capability mask: 0x2651e848
                Port GUID: 0x248a07030091fde0
                Link layer: InfiniBand

ibv_devinfo
hca_id: mlx5_0
        transport:                      InfiniBand (0)
        fw_ver:                         12.24.1000
        node_guid:                      248a:0703:0091:fde0
        sys_image_guid:                 248a:0703:0091:fde0
        vendor_id:                      0x02c9
        vendor_part_id:                 4115
        hw_ver:                         0x0
        board_id:                       MT_2180110032
        phys_port_cnt:                  1
        Device ports:
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 17
                        port_lid:               53
                        port_lmc:               0x00
                        link_layer:             InfiniBand
janjust commented 2 years ago

Thanks, what if you specify -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1

What do you mean by random freezes? Does this mean it happens sporadically, or it simply doesn't go past the first send/recv message? If the above runtime parameters don't help, what's the backtrace on both ranks when it freezes?

bosilca commented 2 years ago

@janjust is right, according to your logs, UCX PML disqualify itself because the list of transports was empty.

yosefe commented 2 years ago

@robertsawko what is the output of ls -l /sys/class/infiniband/mlx5_0/device/driver ? Also, can you pls try with latest v4.1.x branch, perhaps f38878e9ce7e4c164a392258d1505544a57666a2 is fixing the issue?

robertsawko commented 2 years ago

Hi! Thanks again to everyone for all their commitment and responding over the weekend too.

@yosefe

ls -l /sys/class/infiniband/mlx5_0/device/driver
lrwxrwxrwx. 1 root root 0 May  8 22:08 /sys/class/infiniband/mlx5_0/device/driver -> ../../../../bus/pci/drivers/mlx5_core

Also, I am using 4.1.1 stable. But I am happy to recompile with the commit you specified.

@janjust you are right, when I specify:

mpirun -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 ./IMB-MPI1 Sendrecv

the benchmark runs like a sprint runner on the last 10m of the final day of an Olympic competition with a fighting chance of breaking a world record... Sorry. So is that something that I need to specify? Maybe include in the Lmod file? Why is that list empty?

robertsawko commented 2 years ago

@yosefe, I can confirm that the problem is actually fixed with 4.1.x - I no longer need to specify the variable and the Sendrecv produces number that I expect of our Infiniband. Many thanks for pointing this out.