Open gregfi opened 2 years ago
@gregfi can you pls build UCX with debug info and run
UCX_LOG_LEVEL=data ucx_info -d -t dc_mlx5
and upload the output?
@gregfi UCX no longer supports MLNX_OFED 4.9. pls see https://github.com/openucx/ucx#supported-transports I'd advise to install latest MLNX_OFED 5.7 on the system.
So if I have a system that still needs to support mlx4
hardware, which version of UCX would you advise using?
it should use the UCX version that comes with MLNX_OFED 4.9 (AFAIR it's 1.8) for the system with mlx5 it's recommended to use latest MLNX_OFED 5.x
when I do that (use UCX 1.8.0 / OpenMPI 4.0.3rc4), I get:
ib_mlx5_dv.c:223 UCX ERROR mlx5dv_devx_qp_modify(502) failed, syndrome 0: Invalid argument
UCX_LOG_LEVEL=data ucx_info -d
attached.
@gregfi can you pls follow the instructions on https://github.com/openucx/ucx/wiki/How-to-get-FW-syndrome-when-using-DEVX to get the FW syndrome from dmesg?
Here's the dmesg.log that we captured: dmesg.log
My admins tell me that two ethernet interfaces (eth4
and eth6
) are "teamed" together to team0
to provide (in theory) 200G throughput:
hpap13n1:/tmp # ifconfig | tail -8
team0 Link encap:Ethernet HWaddr 88:E9:A4:1C:FE:74
inet addr:10.179.33.218 Bcast:10.179.33.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:7028393 errors:0 dropped:34 overruns:0 frame:0
TX packets:7939350 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:5405896968 (5155.4 Mb) TX bytes:4180773337 (3987.0 Mb)
Does that seem connected to the failure here?
@gregfi are you sure echo 'func mlx5_cmd_check +p' | sudo tee /sys/kernel/debug/dynamic_debug/control
was done ? was it done before running ucx_info
?
seems the errors in dmesg are not exactly what i was looking for, but it could be related.
The dmesg output was produced after re-executing the ring
test, not ucx_info
. Let me see if my admin changed it back already...
Here's what I get with reproducer=UCX_LOG_LEVEL=data ucx_info -d
:
@gregfi could you please add some more dynamic debug
echo 1 | sudo tee /sys/module/mlx5_core/parameters/debug_mask
echo 'func dump_buf +p' | sudo tee /sys/kernel/debug/dynamic_debug/control
echo 'func dump_command +p' | sudo tee /sys/kernel/debug/dynamic_debug/control
echo 'func mlx5_cmd_check +p' | sudo tee /sys/kernel/debug/dynamic_debug/control
and re-execute the reproducer
@gregfi ping
@gregfi this debug prints a lot of data on your setup so the dmesg buffer can hold only ~20sec of the history could you please try to capture the dmesg output immediately after the reproducer so the line
RST2INIT_QP(0x502) op_mod(0x0) failed, status bad resource state(0x9), syndrome (0xc7051)
still will be in the buffer. something like
dmesg -C ; UCX_LOG_LEVEL=data ucx_info -d ; dmesg -c > dmesg.log
@gregfi can you try latest UCX master branch with MLNX_OFED 5.x? perhaps #8677 could fix the issue
I compiled the latest master branch. I haven't re-compiled OpenMPI and tried the original reproducer, but ucx_info -d
still produces:
[1667590544.077395] [hpap13n1:36263:0] dc_mlx5.c:509 UCX ERROR ibv_modify_qp(DCI, RTR) failed : Invalid argument
@gregfi can you try to run some IB test, e.g ib_send_lat
with DC? Check whether it works and provide the output.
@gregfi ping
Describe the bug
Running
ring_c
yields output like the following on our "next gen" systems:The same installation works fine on
mlx4
hardware. I tried running withUCX_TLS=sm,self,rc,ud,tcp
(disablingdc
), but then I get timeout errors.Steps to Reproduce
mpirun -np 128 --mca pml ucx ./ring_c
Setup and versions
uname -a
output:Linux hpap13n1 4.12.14-122.106-default #1 SMP Mon Jan 3 22:02:42 UTC 2022 (cca30be) x86_64 x86_64 x86_64 GNU/Linux
rdma-core
fromMLNX_OFED_SRC-4.9-5.1.0.0
ibstat
:Additional information (depending on the issue)
ucx_info -d
: