bertiethorpe commented 2 months ago

Describe the bug

Setting UCX_NET_DEVICES to target only TCP devices when RoCE is available seems to be ignored in favour of some fallback.

I'm running a 2 node IMB_MPI PingPong to benchmark RoCE against regular TCP ethernet.

Setting UCX_NET_DEVICES=all or mlx5_0:1 gives the optimal performance and uses RDMA as expected. Setting UCX_NET_DEVICES=eth0, eth1, or anything else still appears to use RoCE at only a slightly longer latency

As per the docs, setting UCX_NET_DEVICES to one of the TCP devices, I should expect TCP-like latencies of ~15us but am seeing closer to RoCE performance with latencies ~2.1us.

Stranger still, is the latency for specifically targeting mlx5_0:1 or all is different (lower latency ~1.6us), so it looks like the fallback is not all when setting to eth0 etc.

Is this behaviour determined somewhere else or accounted for in some way?

Steps to Reproduce

Batch Script:
```
#!/usr/bin/env bash
```

SBATCH --ntasks=2

SBATCH --ntasks-per-node=1

SBATCH --output=%x.%j.out

SBATCH --error=%x.%j.out

SBATCH --exclusive

SBATCH --partition=standard

module load gnu12 openmpi4 imb

export UCX_NET_DEVICES=mlx5_0:1

echo SLURM_JOB_NODELIST: $SLURM_JOB_NODELIST echo SLURM_JOB_ID: $SLURM_JOB_ID echo UCX_NET_DEVICES: $UCX_NET_DEVICES

export UCX_LOG_LEVEL=data

srun --mpi=pmi2 IMB-MPI1 pingpong # doesn't work in ohpc v2.1

mpirun IMB-MPI1 pingpong -iter_policy off


- UCX version 1.17.0 
- Git branch '', revision 7bb2722
 ```Configured with: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --enable-mt --disable-params-check --without-go --without-java --enable-cma --without-cuda --without-gdrcopy --with-verbs --with-knem --with-rdmacm --without-rocm --with-xpmem --without-fuse3 --without-ugni --without-mad --without-ze```
- **Any UCX environment variables used**
  - See logs

### Setup and versions
- OS version (e.g Linux distro)
   - Rocky Linux release 9.4 (Blue Onyx)
- Driver version:
  - rdma-core-2404mlnx51-1.2404066.x86_64
  - MLNX_OFED_LINUX-24.04-0.6.6.0
- HW information from `ibstat` or `ibv_devinfo -vv` command
     - ```hca_id: mlx5_0
        transport:                      InfiniBand (0)
        fw_ver:                         20.36.1010
        node_guid:                      fa16:3eff:fe4f:f5e9
        sys_image_guid:                 0c42:a103:0003:5d82
        vendor_id:                      0x02c9
        vendor_part_id:                 4124
        hw_ver:                         0x0
        board_id:                       MT_0000000224
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             1024 (3)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

### Additional information (depending on the issue)
- OpenMPI version 4.1.5

Logs:
- [eth0.txt](https://github.com/user-attachments/files/16510734/ethlog.txt)

- [mlxlog.txt](https://github.com/user-attachments/files/16510758/mlxlog.txt)

gleon99 commented 2 months ago

Hi @bertiethorpe In the attached eth0.txt log file, there's no evidence of UCX connection establishment, also the environment variable UCX_NET_DEVICES is not propagated to the config parser - unlike in the mlxlog.txt file.

Therefore we suggest:

Please double-check the command line for both cases and ensure UCX is used.
Run ucx_info -e -u t -P inter with various UCX_NET_DEVICES and check whether the used devices are the ones you expect.

yosefe commented 2 months ago

@bertiethorpe can you pls run with UCX_NET_DEVICES=eth0 and also add -mca pml_base_verbose 99 -mca pml_ucx_verbose 99 -mca pml ucx to mpirun? Also, what were the configure flags for OpenMPI? It seems OpenMPI is not using UCX component when UCX_NET_DEVICES=eth0, due to a higher priority of OpenMPI's btl/openib component, which is also using RDMA.

bertiethorpe commented 2 months ago

Some more information:

This is all virtualised

Run ucx_info -e -u t -P inter with various UCX_NET_DEVICES and check whether the used devices are the ones you expect.

ucx_info -e -u t -P inter
#
# UCP endpoint 
#
#               peer: <no debug data>
#                 lane[0]:  8:rc_mlx5/mlx5_0:1.0 md[4]      -> md[4]/ib/sysdev[255] rma_bw#0 am am_bw#0
#                 lane[1]:  3:tcp/eth1.0 md[1]              -> md[1]/tcp/sysdev[255] rma_bw#1 wireup
#
#                tag_send: 0..<egr/short>..227..<egr/bcopy>..263060..<rndv>..(inf)
#            tag_send_nbr: 0..<egr/short>..227..<egr/bcopy>..262144..<rndv>..(inf)
#           tag_send_sync: 0..<egr/short>..227..<egr/bcopy>..263060..<rndv>..(inf)
#
#                  rma_bw: mds [1] [4] #
#                     rma: mds rndv_rkey_size 19
#

UCX_NET_DEVICES=eth0 ucx_info -e -u t -P inter
#
# UCP endpoint 
#
#               peer: <no debug data>
#                 lane[0]:  1:tcp/eth0.0 md[1]              -> md[1]/tcp/sysdev[255] rma_bw#0 am am_bw#0 wireup
#
#                tag_send: 0..<egr/short>..8185..<egr/zcopy>..20424..<rndv>..(inf)
#            tag_send_nbr: 0..<egr/short>..8185..<egr/bcopy>..262144..<rndv>..(inf)
#           tag_send_sync: 0..<egr/short>..8185..<egr/zcopy>..20424..<rndv>..(inf)
#
#                  rma_bw: mds [1] #
#                     rma: mds rndv_rkey_size 10
#

UCX_NET_DEVICES=eth1 ucx_info -e -u t -P inter
#
# UCP endpoint 
#
#               peer: <no debug data>
#                 lane[0]:  1:tcp/eth1.0 md[1]              -> md[1]/tcp/sysdev[255] rma_bw#0 am am_bw#0 wireup
#
#                tag_send: 0..<egr/short>..8185..<egr/zcopy>..19505..<rndv>..(inf)
#            tag_send_nbr: 0..<egr/short>..8185..<egr/bcopy>..262144..<rndv>..(inf)
#           tag_send_sync: 0..<egr/short>..8185..<egr/zcopy>..19505..<rndv>..(inf)
#
#                  rma_bw: mds [1] #
#                     rma: mds rndv_rkey_size 10
#

Are these expected? I should be expecting the mlx to be with eth1 because they're on the same NIC

bertiethorpe commented 2 months ago

can you pls run with UCX_NET_DEVICES=eth0 and also add -mca pml_base_verbose 99 -mca pml_ucx_verbose 99 -mca pml ucx to mpirun?

ucxlog.txt

yosefe commented 2 months ago

can you pls run with UCX_NET_DEVICES=eth0 and also add -mca pml_base_verbose 99 -mca pml_ucx_verbose 99 -mca pml ucx to mpirun?

ucxlog.txt

Can you pls configure OpenMPI with --with-platform=contrib/platform/mellanox/optimized ? It will force using UCX also with TCP transports. Alternatively, can add -mca pml_ucx_tls any -mca pml_ucx_devices any to mpirun

bertiethorpe commented 2 months ago

ucxlog2.txt

So that seems to have done the trick. Now getting the latency I expected.

bertiethorpe commented 2 months ago

It seems OpenMPI is not using UCX component when UCX_NET_DEVICES=eth0, due to a higher priority of OpenMPI's btl/openib component, which is also using RDMA.

Where can you see this in the logs? Forgive my ignorance, but I can't actually see the btl openib component is available at all. Was it removed in v4.1.x?

ompi_info |  grep btl
                 MCA btl: ofi (MCA v2.1.0, API v3.1.0, Component v4.1.5)
                 MCA btl: self (MCA v2.1.0, API v3.1.0, Component v4.1.5)
                 MCA btl: tcp (MCA v2.1.0, API v3.1.0, Component v4.1.5)
                 MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.1.5)
                MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v4.1.5)

This is all I see.

openucx / ucx

UCX ignores exclusively setting TCP devices when RoCE is available. #10049

Describe the bug

Steps to Reproduce

SBATCH --ntasks=2

SBATCH --ntasks-per-node=1

SBATCH --output=%x.%j.out

SBATCH --error=%x.%j.out

SBATCH --exclusive

SBATCH --partition=standard

srun --mpi=pmi2 IMB-MPI1 pingpong # doesn't work in ohpc v2.1