openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.12k stars 418 forks source link

Failed to modify UD QP to INIT on mlx5_bond_0: Invalid argument warning showing in output #9468

Open richardnixonshead opened 10 months ago

richardnixonshead commented 10 months ago

Describe the bug

When running a job through OpenMPI and UCX, a warning/error of Failed to modify UD QP to INIT on mlx5_bond_0: Invalid argument shows up in the output.

It doesn't happen everytime

[scrosby@spartan-bm035 OpenMPI]$ mpirun -np 2 ./mpi-pingpong spartan-bm035.hpc.unimelb.edu.au:rank0.mpi-pingpong: Failed to modify UD QP to INIT on mlx5_bond_0: Invalid argument spartan-bm035.hpc.unimelb.edu.au:rank1.mpi-pingpong: Failed to modify UD QP to INIT on mlx5_bond_0: Invalid argument spartan-bm035.hpc.unimelb.edu.au:rank1.mpi-pingpong: Failed to modify UD QP to INIT on mlx5_bond_0: Invalid argument spartan-bm035.hpc.unimelb.edu.au:rank1.mpi-pingpong: Failed to modify UD QP to INIT on mlx5_bond_0: Invalid argument spartan-bm035.hpc.unimelb.edu.au:rank0.mpi-pingpong: Failed to modify UD QP to INIT on mlx5_bond_0: Invalid argument spartan-bm035.hpc.unimelb.edu.au:rank0.mpi-pingpong: Failed to modify UD QP to INIT on mlx5_bond_0: Invalid argument Hello from 1 of 2 Hello from 0 of 2 Timer accuracy of ~0.188000 usecs

   8 bytes took         6 usec (   2.789 MB/sec)
  16 bytes took         3 usec (   9.762 MB/sec)
  32 bytes took         1 usec ( 118.519 MB/sec)
  64 bytes took         0 usec ( 288.288 MB/sec)
 128 bytes took        21 usec (  12.388 MB/sec)
 256 bytes took         3 usec ( 169.480 MB/sec)
 512 bytes took        14 usec (  75.611 MB/sec)
1024 bytes took         9 usec ( 236.162 MB/sec)
2048 bytes took        10 usec ( 397.709 MB/sec)
4096 bytes took        11 usec ( 770.214 MB/sec)
8192 bytes took        14 usec (1184.328 MB/sec)

16384 bytes took 26 usec (1269.881 MB/sec) 32768 bytes took 39 usec (1692.300 MB/sec) 65536 bytes took 66 usec (1996.253 MB/sec) 131072 bytes took 130 usec (2020.533 MB/sec) 262144 bytes took 336 usec (1560.780 MB/sec) 524288 bytes took 299 usec (3512.594 MB/sec) 1048576 bytes took 467 usec (4493.143 MB/sec)

Asynchronous ping-pong

   8 bytes took         1 usec (  12.251 MB/sec)
  16 bytes took         0 usec (  79.208 MB/sec)
  32 bytes took         1 usec (  99.533 MB/sec)
  64 bytes took         1 usec ( 215.852 MB/sec)
 128 bytes took         2 usec ( 147.891 MB/sec)
 256 bytes took         7 usec (  73.903 MB/sec)
 512 bytes took         1 usec ( 947.271 MB/sec)
1024 bytes took        13 usec ( 159.950 MB/sec)
2048 bytes took         2 usec (2442.457 MB/sec)
4096 bytes took         8 usec (1034.343 MB/sec)
8192 bytes took         7 usec (2272.084 MB/sec)

16384 bytes took 8 usec (4124.355 MB/sec) 32768 bytes took 9 usec (6980.827 MB/sec) 65536 bytes took 30 usec (4420.939 MB/sec) 131072 bytes took 111 usec (2353.135 MB/sec) 262144 bytes took 54 usec (9758.371 MB/sec) 524288 bytes took 82 usec (12848.779 MB/sec) 1048576 bytes took 171 usec (12228.577 MB/sec)

Bi-directional asynchronous ping-pong

   8 bytes took         1 usec (  30.476 MB/sec)
  16 bytes took         0 usec (  71.910 MB/sec)
  32 bytes took         0 usec ( 132.505 MB/sec)
  64 bytes took         1 usec ( 228.980 MB/sec)
 128 bytes took         1 usec ( 247.582 MB/sec)
 256 bytes took         1 usec ( 455.922 MB/sec)
 512 bytes took         1 usec ( 767.616 MB/sec)
1024 bytes took         1 usec (1684.211 MB/sec)
2048 bytes took         2 usec (2142.259 MB/sec)
4096 bytes took         2 usec (3457.999 MB/sec)
8192 bytes took         4 usec (4397.209 MB/sec)

16384 bytes took 7 usec (4946.860 MB/sec) 32768 bytes took 38 usec (1711.748 MB/sec) 65536 bytes took 38 usec (3450.989 MB/sec) 131072 bytes took 50 usec (5225.531 MB/sec) 262144 bytes took 155 usec (3389.304 MB/sec) 524288 bytes took 165 usec (6345.661 MB/sec) 1048576 bytes took 290 usec (7230.711 MB/sec)

If I rerun it a couple of times immediately after a failure, the command seems to run without warning

Steps to Reproduce

Setup and versions

yosefe commented 10 months ago

@richardnixonshead is there and mlx5 error in dmesg?

richardnixonshead commented 10 months ago

@yosefe No. Nothing in dmesg or /var/log/messages

SomePersonSomeWhereInTheWorld commented 6 months ago

I'm seeing this as well on RHEL 9, OpenMPI 4.1.1, this sometimes fixes the warnings: export OMPI_MCA_btl=^openib

Oddly putting the var in a module file does not work I have to export it.

richardnixonshead commented 6 months ago

We seem to have fixed most/all of them by upgrading the Mellanox firmware. We've started rolling out 16_35_3502 and it seems to have stopped