kubeflow / mpi-operator

Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
https://www.kubeflow.org/docs/components/training/mpi/
Apache License 2.0
419 stars 210 forks source link

MPI Job fails on EKS with > 2 instances of 128 core per instance #594

Closed AymenFJA closed 9 months ago

AymenFJA commented 9 months ago

Hello all,

I was setting up a large scale MPI job on EKS-EFA cluster. I used a x2idn.32large instances as follows:

  1. Each instance has 128 cores (total 512 cores)
  2. Each instance has 2048 GB of memory (total 8192 GB)
  3. I followed the efa cluster setup here: https://docs.aws.amazon.com/eks/latest/userguide/node-efa.html as my job requires high performance networking.
  4. I deploy a single worker per node so eventually I increase the number of worker replicas with the number of nodes.
  5. SlotsPerWorker are equal to the number of cores_per_node -2
  6. Example for my MPIJob setup:
    apiVersion: kubeflow.org/v2beta1
    kind: MPIJob
    metadata:
    name: join-operation
    spec:
    slotsPerWorker: 126 
    runPolicy:
    cleanPodPolicy: Running # Running or All
    sshAuthMountPath: /home/mpiuser/.ssh
    mpiReplicaSpecs:
    Launcher:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
          - image: mpi_workload_image
            name: h-launcher
            securityContext:
              runAsUser: 1000
            command:
              - mpirun
              - --allow-run-as-root
              - -np
              - "378"
              - -x
              - LD_LIBRARY_PATH
              - /usr/ENV/bin/python3
            args:
              - /xxxx/xxxx/scaling.py
              - -n 35000000
            resources:
              limits:
                cpu: 1
                memory: 1Gi
    Worker:
      replicas: 3 # number of nodes per cluster (if you want to make all the cluster as MPI)
      template:
        spec:
          containers:
          - image: mpi_workload_image
            name: h-worker
            securityContext:
              runAsUser: 1000
            command:
            - /usr/sbin/sshd
            args:
            - -De
            - -f
            - /home/mpiuser/.sshd_config
            resources:
              requests:
                cpu: 126
                memory: 1950Gi
              limits:
                cpu: 126
                memory: 1950Gi

My job was working fine with 128 cores (1 node) and 256 cores (2 nodes). When I increase the number of workers to 3 or 4 I get the following error on the launcher level:

[join-worker-1][[60663,1],153][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on locaed: Address already in use (98)
[join-worker-1][[60663,1],164][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[join-worker-1][[60663,1],164][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[join-worker-1][[60663,1],164][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[join-worker-1][[60663,1],164][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[join-worker-1][[60663,1],164][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[join-worker-1][[60663,1],164][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[join-worker-1][[60663,1],164][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[join-worker-1][[60663,1],164][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[join-worker-1][[60663,1],164][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[join-worker-1][[60663,1],164][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[join-worker-1][[60663,1],164][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[join-worker-1][[60663,1],164][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(197) failed: Bad file descriptor (9)
[join-worker-1][[60663,1],164][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(199) failed: Bad file descriptor (9)
[join-worker-1][[60663,1],164][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(203) failed: Bad file descriptor (9)
[join-worker-1][[60663,1],164][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(204) failed: Bad file descriptor (9)
[join-worker-1][[60663,1],164][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(214) failed: Bad file descriptor (9)
[join-worker-1][[60663,1],164][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(210) failed: Bad file descriptor (9)
[join-worker-1][[60663,1],164][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(22) failed: Connection reset by peer (104)
[join-worker-1][[60663,1],164][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[join-worker-1:18649] *** Process received signal ***
[join-worker-1:18649] Signal: Segmentation fault (11)
[join-worker-1:18649] Signal code: Address not mapped (1)
[join-worker-1:18649] Failing at address: (nil)
[join-worker-1:18649] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7f967caa6090]
[join-worker-1:18649] *** End of error message ***
[join-launcher:00001] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[join-launcher:00001] 24 more processes have sent help message help-mpi-btl-tcp.txt / socket flag fail
[join-launcher:00001] 93 more processes have sent help message help-mpi-btl-tcp.txt / peer hung up
[join-launcher:00001] 11 more processes have sent help message help-mpi-btl-tcp.txt / client connect fail

Can you please help me if I am doing something wrong or if I am missing something in my setup? I can not test that frequent as the failure only happened on AWS and it is expansive for this specific setup.

Thank you.

alculquicondor commented 9 months ago

A segmentation fault in the application is completely outside of the scope for the mpi-operator.

AymenFJA commented 9 months ago

Thanks.