Open bertiethorpe opened 2 months ago
Hi @bertiethorpe
In the attached eth0.txt
log file, there's no evidence of UCX connection establishment, also the environment variable UCX_NET_DEVICES
is not propagated to the config parser - unlike in the mlxlog.txt
file.
Therefore we suggest:
ucx_info -e -u t -P inter
with various UCX_NET_DEVICES
and check whether the used devices are the ones you expect.@bertiethorpe can you pls run with UCX_NET_DEVICES=eth0 and also add -mca pml_base_verbose 99 -mca pml_ucx_verbose 99 -mca pml ucx
to mpirun?
Also, what were the configure flags for OpenMPI?
It seems OpenMPI is not using UCX component when UCX_NET_DEVICES=eth0, due to a higher priority of OpenMPI's btl/openib component, which is also using RDMA.
Some more information:
Run ucx_info -e -u t -P inter with various UCX_NET_DEVICES and check whether the used devices are the ones you expect.
ucx_info -e -u t -P inter
#
# UCP endpoint
#
# peer: <no debug data>
# lane[0]: 8:rc_mlx5/mlx5_0:1.0 md[4] -> md[4]/ib/sysdev[255] rma_bw#0 am am_bw#0
# lane[1]: 3:tcp/eth1.0 md[1] -> md[1]/tcp/sysdev[255] rma_bw#1 wireup
#
# tag_send: 0..<egr/short>..227..<egr/bcopy>..263060..<rndv>..(inf)
# tag_send_nbr: 0..<egr/short>..227..<egr/bcopy>..262144..<rndv>..(inf)
# tag_send_sync: 0..<egr/short>..227..<egr/bcopy>..263060..<rndv>..(inf)
#
# rma_bw: mds [1] [4] #
# rma: mds rndv_rkey_size 19
#
UCX_NET_DEVICES=eth0 ucx_info -e -u t -P inter
#
# UCP endpoint
#
# peer: <no debug data>
# lane[0]: 1:tcp/eth0.0 md[1] -> md[1]/tcp/sysdev[255] rma_bw#0 am am_bw#0 wireup
#
# tag_send: 0..<egr/short>..8185..<egr/zcopy>..20424..<rndv>..(inf)
# tag_send_nbr: 0..<egr/short>..8185..<egr/bcopy>..262144..<rndv>..(inf)
# tag_send_sync: 0..<egr/short>..8185..<egr/zcopy>..20424..<rndv>..(inf)
#
# rma_bw: mds [1] #
# rma: mds rndv_rkey_size 10
#
UCX_NET_DEVICES=eth1 ucx_info -e -u t -P inter
#
# UCP endpoint
#
# peer: <no debug data>
# lane[0]: 1:tcp/eth1.0 md[1] -> md[1]/tcp/sysdev[255] rma_bw#0 am am_bw#0 wireup
#
# tag_send: 0..<egr/short>..8185..<egr/zcopy>..19505..<rndv>..(inf)
# tag_send_nbr: 0..<egr/short>..8185..<egr/bcopy>..262144..<rndv>..(inf)
# tag_send_sync: 0..<egr/short>..8185..<egr/zcopy>..19505..<rndv>..(inf)
#
# rma_bw: mds [1] #
# rma: mds rndv_rkey_size 10
#
Are these expected? I should be expecting the mlx to be with eth1 because they're on the same NIC
can you pls run with UCX_NET_DEVICES=eth0 and also add -mca pml_base_verbose 99 -mca pml_ucx_verbose 99 -mca pml ucx to mpirun?
can you pls run with UCX_NET_DEVICES=eth0 and also add -mca pml_base_verbose 99 -mca pml_ucx_verbose 99 -mca pml ucx to mpirun?
Can you pls configure OpenMPI with --with-platform=contrib/platform/mellanox/optimized
?
It will force using UCX also with TCP transports.
Alternatively, can add -mca pml_ucx_tls any -mca pml_ucx_devices any
to mpirun
So that seems to have done the trick. Now getting the latency I expected.
It seems OpenMPI is not using UCX component when UCX_NET_DEVICES=eth0, due to a higher priority of OpenMPI's btl/openib component, which is also using RDMA.
Where can you see this in the logs? Forgive my ignorance, but I can't actually see the btl openib component is available at all. Was it removed in v4.1.x?
ompi_info | grep btl
MCA btl: ofi (MCA v2.1.0, API v3.1.0, Component v4.1.5)
MCA btl: self (MCA v2.1.0, API v3.1.0, Component v4.1.5)
MCA btl: tcp (MCA v2.1.0, API v3.1.0, Component v4.1.5)
MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.1.5)
MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v4.1.5)
This is all I see.
Describe the bug
Setting UCX_NET_DEVICES to target only TCP devices when RoCE is available seems to be ignored in favour of some fallback.
I'm running a 2 node IMB_MPI PingPong to benchmark RoCE against regular TCP ethernet.
Setting
UCX_NET_DEVICES=all
ormlx5_0:1
gives the optimal performance and uses RDMA as expected. SettingUCX_NET_DEVICES=eth0
,eth1
, or anything else still appears to use RoCE at only a slightly longer latencyAs per the docs, setting UCX_NET_DEVICES to one of the TCP devices, I should expect TCP-like latencies of ~15us but am seeing closer to RoCE performance with latencies ~2.1us.
Stranger still, is the latency for specifically targeting
mlx5_0:1
orall
is different (lower latency ~1.6us), so it looks like the fallback is notall
when setting toeth0
etc.Is this behaviour determined somewhere else or accounted for in some way?
Steps to Reproduce
SBATCH --ntasks=2
SBATCH --ntasks-per-node=1
SBATCH --output=%x.%j.out
SBATCH --error=%x.%j.out
SBATCH --exclusive
SBATCH --partition=standard
module load gnu12 openmpi4 imb
export UCX_NET_DEVICES=mlx5_0:1
echo SLURM_JOB_NODELIST: $SLURM_JOB_NODELIST echo SLURM_JOB_ID: $SLURM_JOB_ID echo UCX_NET_DEVICES: $UCX_NET_DEVICES
export UCX_LOG_LEVEL=data
srun --mpi=pmi2 IMB-MPI1 pingpong # doesn't work in ohpc v2.1
mpirun IMB-MPI1 pingpong -iter_policy off