openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.16k stars 427 forks source link

UCX ERROR ibv_modify_qp(DCI, RTR) failed : Invalid argument #8604

Open gregfi opened 2 years ago

gregfi commented 2 years ago

Describe the bug

Running ring_c yields output like the following on our "next gen" systems:

[1665103135.306110] [hpap13n2:7391 :0]         dc_mlx5.c:456  UCX  ERROR ibv_modify_qp(DCI, RTR) failed : Invalid argument
[hpap13n2:07330] ../../../../../openmpi-4.1.4/ompi/mca/pml/ucx/pml_ucx.c:309  Error: Failed to create UCP worker

The same installation works fine on mlx4 hardware. I tried running with UCX_TLS=sm,self,rc,ud,tcp (disabling dc), but then I get timeout errors.

Steps to Reproduce

mpirun -np 128 --mca pml ucx ./ring_c

Setup and versions

Additional information (depending on the issue)

yosefe commented 2 years ago

@gregfi can you pls build UCX with debug info and run UCX_LOG_LEVEL=data ucx_info -d -t dc_mlx5 and upload the output?

gregfi commented 2 years ago

debug.txt

yosefe commented 2 years ago

@gregfi UCX no longer supports MLNX_OFED 4.9. pls see https://github.com/openucx/ucx#supported-transports I'd advise to install latest MLNX_OFED 5.7 on the system.

gregfi commented 2 years ago

So if I have a system that still needs to support mlx4 hardware, which version of UCX would you advise using?

yosefe commented 2 years ago

it should use the UCX version that comes with MLNX_OFED 4.9 (AFAIR it's 1.8) for the system with mlx5 it's recommended to use latest MLNX_OFED 5.x

gregfi commented 2 years ago

when I do that (use UCX 1.8.0 / OpenMPI 4.0.3rc4), I get:

ib_mlx5_dv.c:223 UCX ERROR mlx5dv_devx_qp_modify(502) failed, syndrome 0: Invalid argument

UCX_LOG_LEVEL=data ucx_info -d attached.

ucx_info_1.8.0.txt

yosefe commented 2 years ago

@gregfi can you pls follow the instructions on https://github.com/openucx/ucx/wiki/How-to-get-FW-syndrome-when-using-DEVX to get the FW syndrome from dmesg?

gregfi commented 2 years ago

Here's the dmesg.log that we captured: dmesg.log

My admins tell me that two ethernet interfaces (eth4 and eth6) are "teamed" together to team0 to provide (in theory) 200G throughput:

hpap13n1:/tmp # ifconfig | tail -8
team0     Link encap:Ethernet  HWaddr 88:E9:A4:1C:FE:74  
          inet addr:10.179.33.218  Bcast:10.179.33.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:7028393 errors:0 dropped:34 overruns:0 frame:0
          TX packets:7939350 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:5405896968 (5155.4 Mb)  TX bytes:4180773337 (3987.0 Mb)

Does that seem connected to the failure here?

yosefe commented 2 years ago

@gregfi are you sure echo 'func mlx5_cmd_check +p' | sudo tee /sys/kernel/debug/dynamic_debug/control was done ? was it done before running ucx_info? seems the errors in dmesg are not exactly what i was looking for, but it could be related.

gregfi commented 2 years ago

The dmesg output was produced after re-executing the ring test, not ucx_info. Let me see if my admin changed it back already...

gregfi commented 2 years ago

Here's what I get with reproducer=UCX_LOG_LEVEL=data ucx_info -d:

dmesg.2.log

Artemy-Mellanox commented 2 years ago

@gregfi could you please add some more dynamic debug

echo 1 | sudo tee /sys/module/mlx5_core/parameters/debug_mask
echo 'func dump_buf +p' | sudo tee /sys/kernel/debug/dynamic_debug/control
echo 'func dump_command +p' | sudo tee /sys/kernel/debug/dynamic_debug/control
echo 'func mlx5_cmd_check +p' | sudo tee /sys/kernel/debug/dynamic_debug/control

and re-execute the reproducer

yosefe commented 2 years ago

@gregfi ping

gregfi commented 2 years ago

Attached.

ucx_info.txt

dmesg.3.log

Artemy-Mellanox commented 2 years ago

@gregfi this debug prints a lot of data on your setup so the dmesg buffer can hold only ~20sec of the history could you please try to capture the dmesg output immediately after the reproducer so the line

RST2INIT_QP(0x502) op_mod(0x0) failed, status bad resource state(0x9), syndrome (0xc7051)

still will be in the buffer. something like

dmesg -C ; UCX_LOG_LEVEL=data ucx_info -d ; dmesg -c > dmesg.log
gregfi commented 2 years ago

dmesg.4.log

yosefe commented 2 years ago

@gregfi can you try latest UCX master branch with MLNX_OFED 5.x? perhaps #8677 could fix the issue

gregfi commented 2 years ago

I compiled the latest master branch. I haven't re-compiled OpenMPI and tried the original reproducer, but ucx_info -d still produces:

[1667590544.077395] [hpap13n1:36263:0] dc_mlx5.c:509 UCX ERROR ibv_modify_qp(DCI, RTR) failed : Invalid argument

gleon99 commented 2 years ago

@gregfi can you try to run some IB test, e.g ib_send_lat with DC? Check whether it works and provide the output.

yosefe commented 1 year ago

@gregfi ping