My job was working fine with 128 cores (1 node) and 256 cores (2 nodes). When I increase the number of workers to 3 or 4 I get the following error on the launcher level:
[join-worker-1][[60663,1],153][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on locaed: Address already in use (98)
[join-worker-1][[60663,1],164][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[join-worker-1][[60663,1],164][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[join-worker-1][[60663,1],164][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[join-worker-1][[60663,1],164][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[join-worker-1][[60663,1],164][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[join-worker-1][[60663,1],164][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[join-worker-1][[60663,1],164][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[join-worker-1][[60663,1],164][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[join-worker-1][[60663,1],164][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[join-worker-1][[60663,1],164][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[join-worker-1][[60663,1],164][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[join-worker-1][[60663,1],164][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(197) failed: Bad file descriptor (9)
[join-worker-1][[60663,1],164][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(199) failed: Bad file descriptor (9)
[join-worker-1][[60663,1],164][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(203) failed: Bad file descriptor (9)
[join-worker-1][[60663,1],164][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(204) failed: Bad file descriptor (9)
[join-worker-1][[60663,1],164][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(214) failed: Bad file descriptor (9)
[join-worker-1][[60663,1],164][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(210) failed: Bad file descriptor (9)
[join-worker-1][[60663,1],164][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(22) failed: Connection reset by peer (104)
[join-worker-1][[60663,1],164][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[join-worker-1:18649] *** Process received signal ***
[join-worker-1:18649] Signal: Segmentation fault (11)
[join-worker-1:18649] Signal code: Address not mapped (1)
[join-worker-1:18649] Failing at address: (nil)
[join-worker-1:18649] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7f967caa6090]
[join-worker-1:18649] *** End of error message ***
[join-launcher:00001] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[join-launcher:00001] 24 more processes have sent help message help-mpi-btl-tcp.txt / socket flag fail
[join-launcher:00001] 93 more processes have sent help message help-mpi-btl-tcp.txt / peer hung up
[join-launcher:00001] 11 more processes have sent help message help-mpi-btl-tcp.txt / client connect fail
Can you please help me if I am doing something wrong or if I am missing something in my setup?
I can not test that frequent as the failure only happened on AWS and it is expansive for this specific setup.
Hello all,
I was setting up a large scale MPI job on EKS-EFA cluster. I used a
x2idn.32large
instances as follows:efa cluster setup
here: https://docs.aws.amazon.com/eks/latest/userguide/node-efa.html as my job requires high performance networking.worker replicas
with the number of nodes.SlotsPerWorker
are equal to the number ofcores_per_node -2
My job was working fine with 128 cores (1 node) and 256 cores (2 nodes). When I increase the number of workers to 3 or 4 I get the following error on the launcher level:
Can you please help me if I am doing something wrong or if I am missing something in my setup? I can not test that frequent as the failure only happened on AWS and it is expansive for this specific setup.
Thank you.