Closed LiuJiao0408 closed 2 weeks ago
This documentation will probably be helpful: https://docs.open-mpi.org/en/v5.0.x/mca.html#setting-mca-parameter-values
This documentation will probably be helpful: https://docs.open-mpi.org/en/v5.0.x/mca.html#setting-mca-parameter-values
Thank u so much!!!
Hi all!!!! When I run with openmpi, I met the problem: Someone mentioned this issue before, but I couldn't find a solution. Is there anyone who can help this is the print
By default, for Open MPI 4.0 and later, infiniband ports on a device are not used by default. _The intent is to use UCX for these devices. You can override this policy by setting the btl_openib_allowib MCA parameter to true.
Local host: cu25 Local adapter: hfi1_0 Local port: 1
WARNING: There was an error initializing an OpenFabrics device.
Local host: cu25 Local device: hfi1_0
[cu25][[13139,1],1][btl_tcp_endpoint.c:626:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[13139,1],4] [cu25][[13139,1],3][btl_tcp_endpoint.c:626:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[13139,1],13]
WARNING: Open MPI failed to TCP connect to a peer MPI process. This should not happen.
Your Open MPI job may now hang or fail.
Local host: cu25 PID: 310733 Message: connect() to 192.168.122.1:1045 failed Error: Operation now in progress (115)
[cu25][[13139,1],0][btl_tcp_endpoint.c:626:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[13139,1],2] [cu26][[13139,1],18][btl_tcp_endpoint.c:626:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[13139,1],24] [cu26][[13139,1],20][btl_tcp_endpoint.c:626:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[13139,1],24]
WARNING: Open MPI accepted a TCP connection from what appears to be a another Open MPI process but cannot find a corresponding process entry for that peer.
This attempted connection will be ignored; your MPI job may or may not continue properly.
Local host: cu26 PID: 85674
[cu25:310720] 35 more processes have sent help message help-mpi-btl-openib.txt / ib port not selected [cu25:310720] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [cu25:310720] 35 more processes have sent help message help-mpi-btl-openib.txt / error in device init