Open richardnixonshead opened 10 months ago
@richardnixonshead is there and mlx5 error in dmesg?
@yosefe No. Nothing in dmesg or /var/log/messages
I'm seeing this as well on RHEL 9, OpenMPI 4.1.1, this sometimes fixes the warnings:
export OMPI_MCA_btl=^openib
Oddly putting the var in a module file does not work I have to export it.
We seem to have fixed most/all of them by upgrading the Mellanox firmware. We've started rolling out 16_35_3502 and it seems to have stopped
Describe the bug
When running a job through OpenMPI and UCX, a warning/error of Failed to modify UD QP to INIT on mlx5_bond_0: Invalid argument shows up in the output.
It doesn't happen everytime
[scrosby@spartan-bm035 OpenMPI]$ mpirun -np 2 ./mpi-pingpong spartan-bm035.hpc.unimelb.edu.au:rank0.mpi-pingpong: Failed to modify UD QP to INIT on mlx5_bond_0: Invalid argument spartan-bm035.hpc.unimelb.edu.au:rank1.mpi-pingpong: Failed to modify UD QP to INIT on mlx5_bond_0: Invalid argument spartan-bm035.hpc.unimelb.edu.au:rank1.mpi-pingpong: Failed to modify UD QP to INIT on mlx5_bond_0: Invalid argument spartan-bm035.hpc.unimelb.edu.au:rank1.mpi-pingpong: Failed to modify UD QP to INIT on mlx5_bond_0: Invalid argument spartan-bm035.hpc.unimelb.edu.au:rank0.mpi-pingpong: Failed to modify UD QP to INIT on mlx5_bond_0: Invalid argument spartan-bm035.hpc.unimelb.edu.au:rank0.mpi-pingpong: Failed to modify UD QP to INIT on mlx5_bond_0: Invalid argument Hello from 1 of 2 Hello from 0 of 2 Timer accuracy of ~0.188000 usecs
16384 bytes took 26 usec (1269.881 MB/sec) 32768 bytes took 39 usec (1692.300 MB/sec) 65536 bytes took 66 usec (1996.253 MB/sec) 131072 bytes took 130 usec (2020.533 MB/sec) 262144 bytes took 336 usec (1560.780 MB/sec) 524288 bytes took 299 usec (3512.594 MB/sec) 1048576 bytes took 467 usec (4493.143 MB/sec)
Asynchronous ping-pong
16384 bytes took 8 usec (4124.355 MB/sec) 32768 bytes took 9 usec (6980.827 MB/sec) 65536 bytes took 30 usec (4420.939 MB/sec) 131072 bytes took 111 usec (2353.135 MB/sec) 262144 bytes took 54 usec (9758.371 MB/sec) 524288 bytes took 82 usec (12848.779 MB/sec) 1048576 bytes took 171 usec (12228.577 MB/sec)
Bi-directional asynchronous ping-pong
16384 bytes took 7 usec (4946.860 MB/sec) 32768 bytes took 38 usec (1711.748 MB/sec) 65536 bytes took 38 usec (3450.989 MB/sec) 131072 bytes took 50 usec (5225.531 MB/sec) 262144 bytes took 155 usec (3389.304 MB/sec) 524288 bytes took 165 usec (6345.661 MB/sec) 1048576 bytes took 290 usec (7230.711 MB/sec)
If I rerun it a couple of times immediately after a failure, the command seems to run without warning
Steps to Reproduce
$ ucx_info -v
Setup and versions
$ ibv_devinfo -vv
Additional information (depending on the issue)