open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.12k stars 857 forks source link

A problem about how to set btl_openib_allow_ib MCA parameter to true #12779

Closed LiuJiao0408 closed 2 weeks ago

LiuJiao0408 commented 2 weeks ago

Hi all!!!! When I run with openmpi, I met the problem: Someone mentioned this issue before, but I couldn't find a solution. Is there anyone who can help this is the print


By default, for Open MPI 4.0 and later, infiniband ports on a device are not used by default. _The intent is to use UCX for these devices. You can override this policy by setting the btl_openib_allowib MCA parameter to true.

Local host: cu25 Local adapter: hfi1_0 Local port: 1



WARNING: There was an error initializing an OpenFabrics device.

Local host: cu25 Local device: hfi1_0

[cu25][[13139,1],1][btl_tcp_endpoint.c:626:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[13139,1],4] [cu25][[13139,1],3][btl_tcp_endpoint.c:626:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[13139,1],13]

WARNING: Open MPI failed to TCP connect to a peer MPI process. This should not happen.

Your Open MPI job may now hang or fail.

Local host: cu25 PID: 310733 Message: connect() to 192.168.122.1:1045 failed Error: Operation now in progress (115)

[cu25][[13139,1],0][btl_tcp_endpoint.c:626:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[13139,1],2] [cu26][[13139,1],18][btl_tcp_endpoint.c:626:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[13139,1],24] [cu26][[13139,1],20][btl_tcp_endpoint.c:626:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[13139,1],24]

WARNING: Open MPI accepted a TCP connection from what appears to be a another Open MPI process but cannot find a corresponding process entry for that peer.

This attempted connection will be ignored; your MPI job may or may not continue properly.

Local host: cu26 PID: 85674

[cu25:310720] 35 more processes have sent help message help-mpi-btl-openib.txt / ib port not selected [cu25:310720] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [cu25:310720] 35 more processes have sent help message help-mpi-btl-openib.txt / error in device init

jsquyres commented 2 weeks ago

This documentation will probably be helpful: https://docs.open-mpi.org/en/v5.0.x/mca.html#setting-mca-parameter-values

LiuJiao0408 commented 2 weeks ago

This documentation will probably be helpful: https://docs.open-mpi.org/en/v5.0.x/mca.html#setting-mca-parameter-values

Thank u so much!!!