kubeflow / mpi-operator

Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
https://www.kubeflow.org/docs/components/training/mpi/
Apache License 2.0
426 stars 216 forks source link

tensorflow-bencharmark worker Authentication: Permissions 0640 for '/root/.ssh/id_rsa' are too open. #429

Open celiawa opened 2 years ago

celiawa commented 2 years ago

Hi, I'm trying to run a exmaple job in /examples/v1/tensorflow-benchmarks.yaml. I have below error for the launcher pod. Could you please help to have a look.

[epwiann@node-10-210-152-99 kubeflow]$ k turing001 -n ping-wang logs tensorflow-benchmarks-launcher-62t7l
Using deprecated annotation `kubectl.kubernetes.io/default-logs-container` in pod/tensorflow-benchmarks-launcher-62t7l. Please use `kubectl.kubernetes.io/default-container` instead
Warning: Permanently added 'tensorflow-benchmarks-worker-1.tensorflow-benchmarks-worker,10.42.138.147' (ECDSA) to the list of known hosts.
Warning: Permanently added 'tensorflow-benchmarks-worker-0.tensorflow-benchmarks-worker,10.42.91.237' (ECDSA) to the list of known hosts.
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@         WARNING: UNPROTECTED PRIVATE KEY FILE!          @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Permissions 0640 for '/root/.ssh/id_rsa' are too open.
It is required that your private key files are NOT accessible by others.
This private key will be ignored.
Load key "/root/.ssh/id_rsa": bad permissions
Permission denied, please try again.
Permission denied, please try again.
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@         WARNING: UNPROTECTED PRIVATE KEY FILE!          @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Permissions 0640 for '/root/.ssh/id_rsa' are too open.
It is required that your private key files are NOT accessible by others.
This private key will be ignored.
Load key "/root/.ssh/id_rsa": bad permissions
root@tensorflow-benchmarks-worker-1.tensorflow-benchmarks-worker: Permission denied (publickey,password).
Permission denied, please try again.
Permission denied, please try again.
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------
root@tensorflow-benchmarks-worker-0.tensorflow-benchmarks-worker: Permission denied (publickey,password).
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:

  my node:   tensorflow-benchmarks-launcher
  target node:  tensorflow-benchmarks-worker-0.tensorflow-benchmarks-worker

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
--------------------------------------------------------------------------
$ kubectl get po -n mynamespace |grep benchmar
tensorflow-benchmarks-launcher-62t7l               1/2     CrashLoopBackOff   5          4m34s
tensorflow-benchmarks-worker-0                     2/2     Running            0          4m35s
tensorflow-benchmarks-worker-1                     2/2     Running            0          4m35s

My mpioperator/tensorflow-benchmarks is below, "2 weeks ago" much new.

# docker images|grep  tensorflow
mpioperator/tensorflow-benchmarks                                                  latest                           840932631c4c        2 weeks ago         9.72GB

And I've checked that in this tensorflow-benchmarks image, It have the following configuration: https://github.com/kubeflow/mpi-operator/blob/master/examples/tensorflow-benchmarks/Dockerfile#L3-L7

$ k turing001 -n ping-wang exec -it tensorflow-benchmarks-worker-0 sh
# ls -al
total 12
drwxr-sr-x 2 root 1337 100 Sep 29 06:27 .
drwxrwsrwt 3 root 1337 140 Sep 29 06:27 ..
-rw-r----- 1 root 1337 253 Sep 29 06:27 authorized_keys
-rw-r----- 1 root 1337 365 Sep 29 06:27 id_rsa
-rw-r----- 1 root 1337 253 Sep 29 06:27 id_rsa.pub
#
#
#
#
#cat /etc/ssh/ssh_config

# This is the ssh client system-wide configuration file.  See
# ssh_config(5) for more information.  This file provides defaults for
# users, and the values can be changed in per-user configuration files
# or on the command line.

# Configuration data is parsed as follows:
#  1. command line options
#  2. user-specific file
#  3. system-wide file
# Any configuration value is only changed the first time it is set.
# Thus, host-specific definitions should be at the beginning of the
# configuration file, and defaults at the end.

# Site-wide defaults for some commonly used options.  For a comprehensive
# list of available options, their meanings and defaults, please see the
# ssh_config(5) man page.

Host *
#   ForwardAgent no
#   ForwardX11 no
#   ForwardX11Trusted yes
#   PasswordAuthentication yes
#   HostbasedAuthentication no
#   GSSAPIAuthentication no
#   GSSAPIDelegateCredentials no
#   GSSAPIKeyExchange no
#   GSSAPITrustDNS no
#   BatchMode no
#   CheckHostIP yes
#   AddressFamily any
#   ConnectTimeout 0
#   IdentityFile ~/.ssh/id_rsa
#   IdentityFile ~/.ssh/id_dsa
#   IdentityFile ~/.ssh/id_ecdsa
#   IdentityFile ~/.ssh/id_ed25519
#   Port 22
#   Protocol 2
#   Ciphers aes128-ctr,aes192-ctr,aes256-ctr,aes128-cbc,3des-cbc
#   MACs hmac-md5,hmac-sha1,umac-64@openssh.com
#   EscapeChar ~
#   Tunnel no
#   TunnelDevice any:any
#   PermitLocalCommand no
#   VisualHostKey no
#   ProxyCommand ssh -q -W %h:%p gateway.example.com
#   RekeyLimit 1G 1h
    SendEnv LANG LC_*
    HashKnownHosts yes
    GSSAPIAuthentication yes
    StrictHostKeyChecking no
    UserKnownHostsFile /dev/null
alculquicondor commented 2 years ago

This log line:

Permissions 0640 for '/root/.ssh/id_rsa' are too open.

Implies that somehow this didn't run:

https://github.com/kubeflow/mpi-operator/blob/c5c0c3ef99ec9de948600766988ea7134d3d2af6/v2/pkg/controller/mpi_job_controller.go#L1526

Can you show the output of kubectl get -o yaml pod tensorflow-benchmarks-launcher-62t7l? I wonder if the volumes were setup correctly.

celiawa commented 2 years ago

Hi @alculquicondor , thanks for taking time on this issue.

I think I found that maybe related to the istio sidercar. If I deployed /examples/v1/tensorflow-benchmarks.yaml in a namespace without sidecar.istio.io/inject: "true". Then the Launcher pod running without problem.

If deployed in a namespace with auto istio sider car injection. Then Launcher pod with this error.

apiVersion: v1
kind: Pod
metadata:
  annotations:
    cni.projectcalico.org/podIP: 10.42.198.115/32
    cni.projectcalico.org/podIPs: 10.42.198.115/32
    kubectl.kubernetes.io/default-logs-container: tensorflow-benchmarks
    prometheus.io/path: /stats/prometheus
    prometheus.io/port: "15020"
    prometheus.io/scrape: "true"
    sidecar.istio.io/status: '{"initContainers":["istio-init"],"containers":["istio-proxy"],"volumes":["istio-envoy","istio-data","istio-podinfo","istio-token","istiod-ca-cert"],"imagePullSecrets":null}'
  creationTimestamp: "2021-09-30T01:05:12Z"
  generateName: tensorflow-benchmarks-launcher-
  labels:
    controller-uid: 83a477d9-dab2-4cc9-a91b-e5b4ac946cc7
    istio.io/rev: default
    job-name: tensorflow-benchmarks-launcher
    security.istio.io/tlsMode: istio
    service.istio.io/canonical-name: tensorflow-benchmarks-launcher
    service.istio.io/canonical-revision: latest
    training.kubeflow.org/job-name: tensorflow-benchmarks
    training.kubeflow.org/job-role: launcher
    training.kubeflow.org/operator-name: mpi-operator
  name: tensorflow-benchmarks-launcher-5jb5n
  namespace: ping-wang
  ownerReferences:
  - apiVersion: batch/v1
    blockOwnerDeletion: true
    controller: true
    kind: Job
    name: tensorflow-benchmarks-launcher
    uid: 83a477d9-dab2-4cc9-a91b-e5b4ac946cc7
  resourceVersion: "249187576"
  selfLink: /api/v1/namespaces/ping-wang/pods/tensorflow-benchmarks-launcher-5jb5n
  uid: 007af6b3-fda2-485d-8f83-f0d829732764
spec:
  containers:
  - command:
    - mpirun
    - --allow-run-as-root
    - -np
    - "2"
    - -bind-to
    - none
    - -map-by
    - slot
    - -x
    - NCCL_DEBUG=INFO
    - -x
    - LD_LIBRARY_PATH
    - -x
    - PATH
    - -mca
    - pml
    - ob1
    - -mca
    - btl
    - ^openib
    - python
    - scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
    - --model=resnet101
    - --batch_size=64
    - --variable_update=horovod
    env:
    - name: K_MPI_JOB_ROLE
      value: launcher
    - name: OMPI_MCA_orte_keep_fqdn_hostnames
      value: "true"
    - name: OMPI_MCA_orte_default_hostfile
      value: /etc/mpi/hostfile
    - name: OMPI_MCA_plm_rsh_args
      value: -o ConnectionAttempts=10
    - name: OMPI_MCA_orte_set_default_slots
      value: "1"
    - name: NVIDIA_VISIBLE_DEVICES
    - name: NVIDIA_DRIVER_CAPABILITIES
    image: mpioperator/tensorflow-benchmarks:latest
    imagePullPolicy: Always
    name: tensorflow-benchmarks
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /root/.ssh
      name: ssh-auth
    - mountPath: /etc/mpi
      name: mpi-job-config
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-jsmkb
      readOnly: true
  - args:
    - proxy
    - sidecar
    - --domain
    - $(POD_NAMESPACE).svc.cluster.local
    - --serviceCluster
    - tensorflow-benchmarks-launcher.ping-wang
    - --proxyLogLevel=warning
    - --proxyComponentLogLevel=misc:error
    - --log_output_level=default:info
    - --concurrency
    - "2"
    env:
    - name: JWT_POLICY
      value: third-party-jwt
    - name: PILOT_CERT_PROVIDER
      value: istiod
    - name: CA_ADDR
      value: istiod.istio-system.svc:15012
    - name: POD_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    - name: POD_NAMESPACE
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.namespace
    - name: INSTANCE_IP
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: status.podIP
    - name: SERVICE_ACCOUNT
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.serviceAccountName
    - name: HOST_IP
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: status.hostIP
    - name: CANONICAL_SERVICE
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.labels['service.istio.io/canonical-name']
    - name: CANONICAL_REVISION
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.labels['service.istio.io/canonical-revision']
    - name: PROXY_CONFIG
      value: |
        {"tracing":{}}
    - name: ISTIO_META_POD_PORTS
      value: |-
        [
        ]
    - name: ISTIO_META_APP_CONTAINERS
      value: tensorflow-benchmarks
    - name: ISTIO_META_CLUSTER_ID
      value: Kubernetes
    - name: ISTIO_META_INTERCEPTION_MODE
      value: REDIRECT
    - name: ISTIO_META_WORKLOAD_NAME
      value: tensorflow-benchmarks-launcher
    - name: ISTIO_META_OWNER
      value: kubernetes://apis/batch/v1/namespaces/ping-wang/jobs/tensorflow-benchmarks-launcher
    - name: ISTIO_META_MESH_ID
      value: cluster.local
    - name: TRUST_DOMAIN
      value: cluster.local
    image: gcr.io/istio-release/proxyv2:1.9.6
    imagePullPolicy: Always
    name: istio-proxy
    ports:
    - containerPort: 15090
      name: http-envoy-prom
      protocol: TCP
    readinessProbe:
      failureThreshold: 30
      httpGet:
        path: /healthz/ready
        port: 15021
        scheme: HTTP
      initialDelaySeconds: 1
      periodSeconds: 2
      successThreshold: 1
      timeoutSeconds: 3
    resources:
      limits:
        cpu: "2"
        memory: 1Gi
      requests:
        cpu: 10m
        memory: 40Mi
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      privileged: false
      readOnlyRootFilesystem: true
      runAsGroup: 1337
      runAsNonRoot: true
      runAsUser: 1337
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/istio
      name: istiod-ca-cert
    - mountPath: /var/lib/istio/data
      name: istio-data
    - mountPath: /etc/istio/proxy
      name: istio-envoy
    - mountPath: /var/run/secrets/tokens
      name: istio-token
    - mountPath: /etc/istio/pod
      name: istio-podinfo
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-jsmkb
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  hostname: tensorflow-benchmarks-launcher
  initContainers:
  - args:
    - istio-iptables
    - -p
    - "15001"
    - -z
    - "15006"
    - -u
    - "1337"
    - -m
    - REDIRECT
    - -i
    - '*'
    - -x
    - ""
    - -b
    - '*'
    - -d
    - 15090,15021,15020
    image: gcr.io/istio-release/proxyv2:1.9.6
    imagePullPolicy: Always
    name: istio-init
    resources:
      limits:
        cpu: "2"
        memory: 1Gi
      requests:
        cpu: 10m
        memory: 40Mi
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        add:
        - NET_ADMIN
        - NET_RAW
        drop:
        - ALL
      privileged: false
      readOnlyRootFilesystem: false
      runAsGroup: 0
      runAsNonRoot: false
      runAsUser: 0
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-jsmkb
      readOnly: true
  nodeName: node-10-120-220-137
  priority: 0
  restartPolicy: OnFailure
  schedulerName: default-scheduler
  securityContext:
    fsGroup: 1337
  serviceAccount: default
  serviceAccountName: default
  subdomain: tensorflow-benchmarks-worker
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - emptyDir:
      medium: Memory
    name: istio-envoy
  - emptyDir: {}
    name: istio-data
  - downwardAPI:
      defaultMode: 420
      items:
      - fieldRef:
          apiVersion: v1
          fieldPath: metadata.labels
        path: labels
      - fieldRef:
          apiVersion: v1
          fieldPath: metadata.annotations
        path: annotations
      - path: cpu-limit
        resourceFieldRef:
          containerName: istio-proxy
          divisor: 1m
          resource: limits.cpu
      - path: cpu-request
        resourceFieldRef:
          containerName: istio-proxy
          divisor: 1m
          resource: requests.cpu
    name: istio-podinfo
  - name: istio-token
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          audience: istio-ca
          expirationSeconds: 43200
          path: istio-token
  - configMap:
      defaultMode: 420
      name: istio-ca-root-cert
    name: istiod-ca-cert
  - name: ssh-auth
    secret:
      defaultMode: 384
      items:
      - key: ssh-privatekey
        path: id_rsa
      - key: ssh-publickey
        path: id_rsa.pub
      - key: ssh-publickey
        path: authorized_keys
      secretName: tensorflow-benchmarks-ssh
  - configMap:
      defaultMode: 420
      items:
      - key: hostfile
        mode: 292
        path: hostfile
      - key: discover_hosts.sh
        mode: 365
        path: discover_hosts.sh
      name: tensorflow-benchmarks-config
    name: mpi-job-config
  - name: default-token-jsmkb
    secret:
      defaultMode: 420
      secretName: default-token-jsmkb
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2021-09-30T01:05:16Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2021-09-30T01:05:13Z"
    message: 'containers with unready status: [tensorflow-benchmarks]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2021-09-30T01:05:13Z"
    message: 'containers with unready status: [tensorflow-benchmarks]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2021-09-30T01:05:13Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: docker://ab162a1bfd50198374be638378a697307e3408662795d42c8fd1fbbd6fb828d9
    image: docker-ecr001.rnd.gic.ericsson.se/istio/proxyv2:1.9.6
    imageID: docker-pullable://docker-ecr001.rnd.gic.ericsson.se/istio/proxyv2@sha256:87a9db561d2ef628deea7a4cbd0adf008a2f64355a2796e3b840d445b7e9cd3e
    lastState: {}
    name: istio-proxy
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2021-09-30T01:05:17Z"
  - containerID: docker://44003d40af0b468db0d35cee8986d39197efe592328c8c885f7b1e6d2b3b91bc
    image: mpioperator/tensorflow-benchmarks:latest
    imageID: docker-pullable://mpioperator/tensorflow-benchmarks@sha256:476eb9df7a348a722f3c4e5e15e6c6f3fe9ed29749e8be98ac9447df3b1b5a54
    lastState:
      terminated:
        containerID: docker://44003d40af0b468db0d35cee8986d39197efe592328c8c885f7b1e6d2b3b91bc
        exitCode: 255
        finishedAt: "2021-09-30T01:06:43Z"
        reason: Error
        startedAt: "2021-09-30T01:06:43Z"
    name: tensorflow-benchmarks
    ready: false
    restartCount: 4
    started: false
    state:
      waiting:
        message: back-off 1m20s restarting failed container=tensorflow-benchmarks
          pod=tensorflow-benchmarks-launcher-5jb5n_ping-wang(007af6b3-fda2-485d-8f83-f0d829732764)
        reason: CrashLoopBackOff
  hostIP: 10.120.220.137
  initContainerStatuses:
  - containerID: docker://8cf9f8f27f0df9f0f704b16b735eabef6ddf24f7dd8f8b2eaa824986f5d46fba
    image: docker-ecr001.rnd.gic.ericsson.se/istio/proxyv2:1.9.6
    imageID: docker-pullable://docker-ecr001.rnd.gic.ericsson.se/istio/proxyv2@sha256:87a9db561d2ef628deea7a4cbd0adf008a2f64355a2796e3b840d445b7e9cd3e
    lastState: {}
    name: istio-init
    ready: true
    restartCount: 0
    state:
      terminated:
        containerID: docker://8cf9f8f27f0df9f0f704b16b735eabef6ddf24f7dd8f8b2eaa824986f5d46fba
        exitCode: 0
        finishedAt: "2021-09-30T01:05:15Z"
        reason: Completed
        startedAt: "2021-09-30T01:05:15Z"
  phase: Running
  podIP: 10.42.198.115
  podIPs:
  - ip: 10.42.198.115
  qosClass: Burstable
  startTime: "2021-09-30T01:05:13Z"
alculquicondor commented 2 years ago

I wonder if this has something to do with the security context:

  securityContext:
    fsGroup: 1337

Perhaps that's changing the permissions of the volume mounted in /root/.ssh.

Can you try running a different sample? https://github.com/kubeflow/mpi-operator/blob/master/examples/pi/pi.yaml

This sample run as non-root. That's a good thing in general. Perhaps there will be no way to support running as root when using istio.

I haven't tested running the tensorflow-benchmarks as non root. Maybe you will need to add the .ssh_config file like we do here https://github.com/kubeflow/mpi-operator/blob/master/examples/base/Dockerfile#L28

alculquicondor commented 2 years ago

The root cause seems to be the fsGroup indeed. Here is the upstream issue: kubernetes/kubernetes#57923

And there is this proposal for a fix that hasn't started yet kubernetes/enhancements#2605

alculquicondor commented 2 years ago

@celiawa, could you try the other sample?

alculquicondor commented 2 years ago

/retitle Can't run as root when using istio

celiawa commented 2 years ago

Hi @alculquicondor , sorry for the later response, as I was on holiday last few days. I tried this sample https://github.com/kubeflow/mpi-operator/blob/master/examples/pi/pi.yaml in the istio injection namespace. The laucher now failed

$  k turing001 get po -n mynamespace
NAME                                                              READY   STATUS             RESTARTS   AGE
pi-launcher-lb8bl                                                 1/2     CrashLoopBackOff   5          4m59s
pi-worker-0                                                       2/2     Running            0          5m
pi-worker-1                                                       2/2     Running            0          5m
$  k turing001 -n mynamespace logs pi-launcher-lb8bl
Using deprecated annotation `kubectl.kubernetes.io/default-logs-container` in pod/pi-launcher-lb8bl. Please use `kubectl.kubernetes.io/default-container` instead
Warning: Permanently added 'pi-worker-0.pi-worker,10.42.116.83' (ECDSA) to the list of known hosts.
Warning: Permanently added 'pi-worker-1.pi-worker,10.42.88.141' (ECDSA) to the list of known hosts.
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:

  my node:   pi-launcher
  target node:  pi-worker-1.pi-worker

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
--------------------------------------------------------------------------
alculquicondor commented 2 years ago

Can you provide more information? Like the worker logs and the yamls for launcher and workers?

alculquicondor commented 2 years ago

Also consider adding the environment variable OMPI_MCA_orte_debug with value true to the container

celiawa commented 2 years ago

Hi, Please refer below info. Logs with OMPI_MCA_orte_debug setting to true.

yamls for launcher

$ k turing001 -n mynamesoace get po pi-launcher-n5p27 -oyaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    cni.projectcalico.org/podIP: 10.42.198.107/32
    cni.projectcalico.org/podIPs: 10.42.198.107/32
    kubectl.kubernetes.io/default-logs-container: mpi-launcher
    prometheus.io/path: /stats/prometheus
    prometheus.io/port: "15020"
    prometheus.io/scrape: "true"
    sidecar.istio.io/status: '{"initContainers":["istio-init"],"containers":["istio-proxy"],"volumes":["istio-envoy","istio-data","istio-podinfo","istio-token","istiod-ca-cert"],"imagePullSecrets":null}'
  creationTimestamp: "2021-10-09T05:32:14Z"
  generateName: pi-launcher-
  labels:
    controller-uid: 4230ce63-b837-4fb0-a852-2bd77c90f080
    istio.io/rev: default
    job-name: pi-launcher
    security.istio.io/tlsMode: istio
    service.istio.io/canonical-name: pi-launcher
    service.istio.io/canonical-revision: latest
    training.kubeflow.org/job-name: pi
    training.kubeflow.org/job-role: launcher
    training.kubeflow.org/operator-name: mpi-operator
  name: pi-launcher-n5p27
  namespace: ping-wang
  ownerReferences:
  - apiVersion: batch/v1
    blockOwnerDeletion: true
    controller: true
    kind: Job
    name: pi-launcher
    uid: 4230ce63-b837-4fb0-a852-2bd77c90f080
  resourceVersion: "270665573"
  selfLink: /api/v1/namespaces/ping-wang/pods/pi-launcher-n5p27
  uid: f5373e7f-4a2e-4f23-9005-f1fdcb80aa86
spec:
  containers:
  - args:
    - -n
    - "2"
    - /home/mpiuser/pi
    command:
    - mpirun
    env:
    - name: OMPI_MCA_orte_debug
      value: "true"
    - name: K_MPI_JOB_ROLE
      value: launcher
    - name: OMPI_MCA_orte_keep_fqdn_hostnames
      value: "true"
    - name: OMPI_MCA_orte_default_hostfile
      value: /etc/mpi/hostfile
    - name: OMPI_MCA_plm_rsh_args
      value: -o ConnectionAttempts=10
    - name: OMPI_MCA_orte_set_default_slots
      value: "1"
    - name: NVIDIA_VISIBLE_DEVICES
    - name: NVIDIA_DRIVER_CAPABILITIES
    image: mpioperator/mpi-pi
    imagePullPolicy: Always
    name: mpi-launcher
    resources:
      limits:
        cpu: "1"
        memory: 1Gi
      requests:
        cpu: "1"
        memory: 1Gi
    securityContext:
      runAsUser: 1000
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /home/mpiuser/.ssh
      name: ssh-auth
    - mountPath: /etc/mpi
      name: mpi-job-config
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-jsmkb
      readOnly: true
  - args:
    - proxy
    - sidecar
    - --domain
    - $(POD_NAMESPACE).svc.cluster.local
    - --serviceCluster
    - pi-launcher.ping-wang
    - --proxyLogLevel=warning
    - --proxyComponentLogLevel=misc:error
    - --log_output_level=default:info
    - --concurrency
    - "2"
    env:
    - name: JWT_POLICY
      value: third-party-jwt
    - name: PILOT_CERT_PROVIDER
      value: istiod
    - name: CA_ADDR
      value: istiod.istio-system.svc:15012
    - name: POD_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    - name: POD_NAMESPACE
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.namespace
    - name: INSTANCE_IP
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: status.podIP
    - name: SERVICE_ACCOUNT
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.serviceAccountName
    - name: HOST_IP
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: status.hostIP
    - name: CANONICAL_SERVICE
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.labels['service.istio.io/canonical-name']
    - name: CANONICAL_REVISION
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.labels['service.istio.io/canonical-revision']
    - name: PROXY_CONFIG
      value: |
        {"tracing":{}}
    - name: ISTIO_META_POD_PORTS
      value: |-
        [
        ]
    - name: ISTIO_META_APP_CONTAINERS
      value: mpi-launcher
    - name: ISTIO_META_CLUSTER_ID
      value: Kubernetes
    - name: ISTIO_META_INTERCEPTION_MODE
      value: REDIRECT
    - name: ISTIO_META_WORKLOAD_NAME
      value: pi-launcher
    - name: ISTIO_META_OWNER
      value: kubernetes://apis/batch/v1/namespaces/ping-wang/jobs/pi-launcher
    - name: ISTIO_META_MESH_ID
      value: cluster.local
    - name: TRUST_DOMAIN
      value: cluster.local
    image: gcr.io/istio-release/proxyv2:1.9.6
    imagePullPolicy: Always
    name: istio-proxy
    ports:
    - containerPort: 15090
      name: http-envoy-prom
      protocol: TCP
    readinessProbe:
      failureThreshold: 30
      httpGet:
        path: /healthz/ready
        port: 15021
        scheme: HTTP
      initialDelaySeconds: 1
      periodSeconds: 2
      successThreshold: 1
      timeoutSeconds: 3
    resources:
      limits:
        cpu: "2"
        memory: 1Gi
      requests:
        cpu: 10m
        memory: 40Mi
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      privileged: false
      readOnlyRootFilesystem: true
      runAsGroup: 1337
      runAsNonRoot: true
      runAsUser: 1337
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/istio
      name: istiod-ca-cert
    - mountPath: /var/lib/istio/data
      name: istio-data
    - mountPath: /etc/istio/proxy
      name: istio-envoy
    - mountPath: /var/run/secrets/tokens
      name: istio-token
    - mountPath: /etc/istio/pod
      name: istio-podinfo
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-jsmkb
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  hostname: pi-launcher
  initContainers:
  - args:
    - istio-iptables
    - -p
    - "15001"
    - -z
    - "15006"
    - -u
    - "1337"
    - -m
    - REDIRECT
    - -i
    - '*'
    - -x
    - ""
    - -b
    - '*'
    - -d
    - 15090,15021,15020
    image: gcr.io/istio-release/proxyv2:1.9.6
    imagePullPolicy: Always
    name: istio-init
    resources:
      limits:
        cpu: "2"
        memory: 1Gi
      requests:
        cpu: 10m
        memory: 40Mi
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        add:
        - NET_ADMIN
        - NET_RAW
        drop:
        - ALL
      privileged: false
      readOnlyRootFilesystem: false
      runAsGroup: 0
      runAsNonRoot: false
      runAsUser: 0
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-jsmkb
      readOnly: true
  nodeName: node-10-120-220-137
  priority: 0
  restartPolicy: OnFailure
  schedulerName: default-scheduler
  securityContext:
    fsGroup: 1337
  serviceAccount: default
  serviceAccountName: default
  subdomain: pi-worker
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - emptyDir:
      medium: Memory
    name: istio-envoy
  - emptyDir: {}
    name: istio-data
  - downwardAPI:
      defaultMode: 420
      items:
      - fieldRef:
          apiVersion: v1
          fieldPath: metadata.labels
        path: labels
      - fieldRef:
          apiVersion: v1
          fieldPath: metadata.annotations
        path: annotations
      - path: cpu-limit
        resourceFieldRef:
          containerName: istio-proxy
          divisor: 1m
          resource: limits.cpu
      - path: cpu-request
        resourceFieldRef:
          containerName: istio-proxy
          divisor: 1m
          resource: requests.cpu
    name: istio-podinfo
  - name: istio-token
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          audience: istio-ca
          expirationSeconds: 43200
          path: istio-token
  - configMap:
      defaultMode: 420
      name: istio-ca-root-cert
    name: istiod-ca-cert
  - name: ssh-auth
    secret:
      defaultMode: 420
      items:
      - key: ssh-privatekey
        path: id_rsa
      - key: ssh-publickey
        path: id_rsa.pub
      - key: ssh-publickey
        path: authorized_keys
      secretName: pi-ssh
  - configMap:
      defaultMode: 420
      items:
      - key: hostfile
        mode: 292
        path: hostfile
      - key: discover_hosts.sh
        mode: 365
        path: discover_hosts.sh
      name: pi-config
    name: mpi-job-config
  - name: default-token-jsmkb
    secret:
      defaultMode: 420
      secretName: default-token-jsmkb
yamls for worker

$  k turing001 -n mynamespace get po pi-worker-0 -oyaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    cni.projectcalico.org/podIP: 10.42.116.68/32
    cni.projectcalico.org/podIPs: 10.42.116.68/32
    kubectl.kubernetes.io/default-logs-container: mpi-worker
    prometheus.io/path: /stats/prometheus
    prometheus.io/port: "15020"
    prometheus.io/scrape: "true"
    sidecar.istio.io/status: '{"initContainers":["istio-init"],"containers":["istio-proxy"],"volumes":["istio-envoy","istio-data","istio-podinfo","istio-token","istiod-ca-cert"],"imagePullSecrets":null}'
  creationTimestamp: "2021-10-09T05:40:51Z"
  labels:
    istio.io/rev: default
    security.istio.io/tlsMode: istio
    service.istio.io/canonical-name: pi-worker-0
    service.istio.io/canonical-revision: latest
    training.kubeflow.org/job-name: pi
    training.kubeflow.org/job-role: worker
    training.kubeflow.org/operator-name: mpi-operator
    training.kubeflow.org/replica-index: "0"
  name: pi-worker-0
  namespace: ping-wang
  ownerReferences:
  - apiVersion: kubeflow.org/v2beta1
    blockOwnerDeletion: true
    controller: true
    kind: MPIJob
    name: pi
    uid: 88b20aa4-49ce-4625-985a-4c5cf2dabb91
  resourceVersion: "270678724"
  selfLink: /api/v1/namespaces/ping-wang/pods/pi-worker-0
  uid: cd1f2e0c-7208-46aa-9719-3197a94b9ae6
spec:
  containers:
  - args:
    - -De
    - -f
    - /home/mpiuser/.sshd_config
    command:
    - /usr/sbin/sshd
    env:
    - name: OMPI_MCA_orte_debug
      value: "true"
    - name: K_MPI_JOB_ROLE
      value: worker
    image: mpioperator/mpi-pi
    imagePullPolicy: Always
    name: mpi-worker
    resources:
      limits:
        cpu: "1"
        memory: 1Gi
      requests:
        cpu: "1"
        memory: 1Gi
    securityContext:
      runAsUser: 1000
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /home/mpiuser/.ssh
      name: ssh-auth
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-jsmkb
      readOnly: true
  - args:
    - proxy
    - sidecar
    - --domain
    - $(POD_NAMESPACE).svc.cluster.local
    - --serviceCluster
    - pi-worker-0.ping-wang
    - --proxyLogLevel=warning
    - --proxyComponentLogLevel=misc:error
    - --log_output_level=default:info
    - --concurrency
    - "2"
    env:
    - name: JWT_POLICY
      value: third-party-jwt
    - name: PILOT_CERT_PROVIDER
      value: istiod
    - name: CA_ADDR
      value: istiod.istio-system.svc:15012
    - name: POD_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    - name: POD_NAMESPACE
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.namespace
    - name: INSTANCE_IP
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: status.podIP
    - name: SERVICE_ACCOUNT
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.serviceAccountName
    - name: HOST_IP
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: status.hostIP
    - name: CANONICAL_SERVICE
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.labels['service.istio.io/canonical-name']
    - name: CANONICAL_REVISION
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.labels['service.istio.io/canonical-revision']
    - name: PROXY_CONFIG
      value: |
        {"tracing":{}}
    - name: ISTIO_META_POD_PORTS
      value: |-
        [
        ]
    - name: ISTIO_META_APP_CONTAINERS
      value: mpi-worker
    - name: ISTIO_META_CLUSTER_ID
      value: Kubernetes
    - name: ISTIO_META_INTERCEPTION_MODE
      value: REDIRECT
    - name: ISTIO_META_WORKLOAD_NAME
      value: pi-worker-0
    - name: ISTIO_META_OWNER
      value: kubernetes://apis/v1/namespaces/ping-wang/pods/pi-worker-0
    - name: ISTIO_META_MESH_ID
      value: cluster.local
    - name: TRUST_DOMAIN
      value: cluster.local
    image: gcr.io/istio-release/proxyv2:1.9.6
    imagePullPolicy: Always
    name: istio-proxy
    ports:
    - containerPort: 15090
      name: http-envoy-prom
      protocol: TCP
    readinessProbe:
      failureThreshold: 30
      httpGet:
        path: /healthz/ready
        port: 15021
        scheme: HTTP
      initialDelaySeconds: 1
      periodSeconds: 2
      successThreshold: 1
      timeoutSeconds: 3
    resources:
      limits:
        cpu: "2"
        memory: 1Gi
      requests:
        cpu: 10m
        memory: 40Mi
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      privileged: false
      readOnlyRootFilesystem: true
      runAsGroup: 1337
      runAsNonRoot: true
      runAsUser: 1337
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/istio
      name: istiod-ca-cert
    - mountPath: /var/lib/istio/data
      name: istio-data
    - mountPath: /etc/istio/proxy
      name: istio-envoy
    - mountPath: /var/run/secrets/tokens
      name: istio-token
    - mountPath: /etc/istio/pod
      name: istio-podinfo
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-jsmkb
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  hostname: pi-worker-0
  initContainers:
  - args:
    - istio-iptables
    - -p
    - "15001"
    - -z
    - "15006"
    - -u
    - "1337"
    - -m
    - REDIRECT
    - -i
    - '*'
    - -x
    - ""
    - -b
    - '*'
    - -d
    - 15090,15021,15020
    image: gcr.io/istio-release/proxyv2:1.9.6
    imagePullPolicy: Always
    name: istio-init
    resources:
      limits:
        cpu: "2"
        memory: 1Gi
      requests:
        cpu: 10m
        memory: 40Mi
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        add:
        - NET_ADMIN
        - NET_RAW
        drop:
        - ALL
      privileged: false
      readOnlyRootFilesystem: false
      runAsGroup: 0
      runAsNonRoot: false
      runAsUser: 0
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-jsmkb
      readOnly: true
  nodeName: node-10-120-220-132
  priority: 0
  restartPolicy: Never
  schedulerName: default-scheduler
  securityContext:
    fsGroup: 1337
  serviceAccount: default
  serviceAccountName: default
  subdomain: pi-worker
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - emptyDir:
      medium: Memory
    name: istio-envoy
  - emptyDir: {}
    name: istio-data
  - downwardAPI:
      defaultMode: 420
      items:
      - fieldRef:
          apiVersion: v1
          fieldPath: metadata.labels
        path: labels
      - fieldRef:
          apiVersion: v1
          fieldPath: metadata.annotations
        path: annotations
      - path: cpu-limit
        resourceFieldRef:
          containerName: istio-proxy
          divisor: 1m
          resource: limits.cpu
      - path: cpu-request
        resourceFieldRef:
          containerName: istio-proxy
          divisor: 1m
          resource: requests.cpu
    name: istio-podinfo
  - name: istio-token
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          audience: istio-ca
          expirationSeconds: 43200
          path: istio-token
  - configMap:
      defaultMode: 420
      name: istio-ca-root-cert
    name: istiod-ca-cert
  - name: ssh-auth
    secret:
      defaultMode: 420
      items:
      - key: ssh-privatekey
        path: id_rsa
      - key: ssh-publickey
        path: id_rsa.pub
      - key: ssh-publickey
        path: authorized_keys
      secretName: pi-ssh
  - name: default-token-jsmkb
    secret:
      defaultMode: 420
      secretName: default-token-jsmkb
Launcher logs

k turing001 -n mynamespace logs pi-launcher-7b75v
Using deprecated annotation `kubectl.kubernetes.io/default-logs-container` in pod/pi-launcher-7b75v. Please use `kubectl.kubernetes.io/default-container` instead
[pi-launcher:00001] procdir: /tmp/ompi.pi-launcher.1000/pid.1/0/0
[pi-launcher:00001] jobdir: /tmp/ompi.pi-launcher.1000/pid.1/0
[pi-launcher:00001] top: /tmp/ompi.pi-launcher.1000/pid.1
[pi-launcher:00001] top: /tmp/ompi.pi-launcher.1000
[pi-launcher:00001] tmp: /tmp
[pi-launcher:00001] sess_dir_cleanup: job session dir does not exist
[pi-launcher:00001] sess_dir_cleanup: top session dir does not exist
[pi-launcher:00001] procdir: /tmp/ompi.pi-launcher.1000/pid.1/0/0
[pi-launcher:00001] jobdir: /tmp/ompi.pi-launcher.1000/pid.1/0
[pi-launcher:00001] top: /tmp/ompi.pi-launcher.1000/pid.1
[pi-launcher:00001] top: /tmp/ompi.pi-launcher.1000
[pi-launcher:00001] tmp: /tmp
Warning: Permanently added 'pi-worker-1.pi-worker,10.42.130.76' (ECDSA) to the list of known hosts.
Warning: Permanently added 'pi-worker-0.pi-worker,10.42.116.68' (ECDSA) to the list of known hosts.
[pi-worker-1:00026] procdir: /tmp/ompi.pi-worker-1.1000/jf.49480/0/2
[pi-worker-1:00026] jobdir: /tmp/ompi.pi-worker-1.1000/jf.49480/0
[pi-worker-1:00026] top: /tmp/ompi.pi-worker-1.1000/jf.49480
[pi-worker-1:00026] top: /tmp/ompi.pi-worker-1.1000
[pi-worker-1:00026] tmp: /tmp
[pi-worker-1:00026] sess_dir_cleanup: job session dir does not exist
[pi-worker-1:00026] sess_dir_cleanup: top session dir does not exist
[pi-worker-1:00026] procdir: /tmp/ompi.pi-worker-1.1000/jf.49480/0/2
[pi-worker-1:00026] jobdir: /tmp/ompi.pi-worker-1.1000/jf.49480/0
[pi-worker-1:00026] top: /tmp/ompi.pi-worker-1.1000/jf.49480
[pi-worker-1:00026] top: /tmp/ompi.pi-worker-1.1000
[pi-worker-1:00026] tmp: /tmp
[pi-worker-0:00027] procdir: /tmp/ompi.pi-worker-0.1000/jf.49480/0/1
[pi-worker-0:00027] jobdir: /tmp/ompi.pi-worker-0.1000/jf.49480/0
[pi-worker-0:00027] top: /tmp/ompi.pi-worker-0.1000/jf.49480
[pi-worker-0:00027] top: /tmp/ompi.pi-worker-0.1000
[pi-worker-0:00027] tmp: /tmp
[pi-worker-0:00027] sess_dir_cleanup: job session dir does not exist
[pi-worker-0:00027] sess_dir_cleanup: top session dir does not exist
[pi-worker-0:00027] procdir: /tmp/ompi.pi-worker-0.1000/jf.49480/0/1
[pi-worker-0:00027] jobdir: /tmp/ompi.pi-worker-0.1000/jf.49480/0
[pi-worker-0:00027] top: /tmp/ompi.pi-worker-0.1000/jf.49480
[pi-worker-0:00027] top: /tmp/ompi.pi-worker-0.1000
[pi-worker-0:00027] tmp: /tmp
[pi-worker-1:00026] sess_dir_finalize: proc session dir does not exist
[pi-worker-1:00026] sess_dir_finalize: job session dir does not exist
[pi-worker-1:00026] sess_dir_finalize: jobfam session dir does not exist
[pi-worker-1:00026] sess_dir_finalize: jobfam session dir does not exist
[pi-worker-1:00026] sess_dir_finalize: top session dir does not exist
[pi-worker-1:00026] sess_dir_cleanup: job session dir does not exist
[pi-worker-1:00026] sess_dir_cleanup: top session dir does not exist
[pi-worker-1:00026] sess_dir_cleanup: job session dir does not exist
[pi-worker-1:00026] sess_dir_cleanup: top session dir does not exist
exiting with status 1
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:

  my node:   pi-launcher
  target node:  pi-worker-0.pi-worker

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
--------------------------------------------------------------------------
[pi-launcher:00001] Job UNKNOWN has launched
[pi-launcher:00001] [[49480,0],0] Releasing job data for [49480,1]
[pi-launcher:00001] sess_dir_finalize: proc session dir does not exist
[pi-launcher:00001] sess_dir_finalize: job session dir does not exist
[pi-launcher:00001] sess_dir_finalize: jobfam session dir does not exist
[pi-launcher:00001] sess_dir_finalize: jobfam session dir does not exist
[pi-launcher:00001] sess_dir_finalize: top session dir does not exist
[pi-launcher:00001] sess_dir_cleanup: job session dir does not exist
[pi-launcher:00001] sess_dir_cleanup: top session dir does not exist
[pi-launcher:00001] [[49480,0],0] Releasing job data for [49480,0]
[pi-launcher:00001] sess_dir_cleanup: job session dir does not exist
[pi-launcher:00001] sess_dir_cleanup: top session dir does not exist
exiting with status 1
Worker logs

$ k turing001 -n mynamespace logs pi-worker-0
Using deprecated annotation `kubectl.kubernetes.io/default-logs-container` in pod/pi-worker-0. Please use `kubectl.kubernetes.io/default-container` instead
Server listening on 0.0.0.0 port 22.
Server listening on :: port 22.
Accepted publickey for mpiuser from 10.42.198.120 port 33588 ssh2: ECDSA SHA256:wvQe8CVOl3BQ+7ig4VHapc8rLl6GNIWCiZKxJkFvPNU
Received disconnect from 10.42.198.120 port 33588:11: disconnected by user
Disconnected from user mpiuser 10.42.198.120 port 33588
Accepted publickey for mpiuser from 10.42.198.120 port 35306 ssh2: ECDSA SHA256:wvQe8CVOl3BQ+7ig4VHapc8rLl6GNIWCiZKxJkFvPNU
Received disconnect from 10.42.198.120 port 35306:11: disconnected by user
Disconnected from user mpiuser 10.42.198.120 port 35306
Accepted publickey for mpiuser from 10.42.198.120 port 38536 ssh2: ECDSA SHA256:wvQe8CVOl3BQ+7ig4VHapc8rLl6GNIWCiZKxJkFvPNU
Accepted publickey for mpiuser from 10.42.198.120 port 43876 ssh2: ECDSA SHA256:wvQe8CVOl3BQ+7ig4VHapc8rLl6GNIWCiZKxJkFvPNU
celiawa commented 2 years ago

BTW, the sample https://github.com/kubeflow/mpi-operator/blob/master/examples/pi/pi.yaml worked well in the namespace without istio injection.

alculquicondor commented 2 years ago

From the worker logs, it seems that:

Maybe the orte protocol is failing to establish over an istio service? Perhaps TCP needs to be authorized or something. I'm not at all familiar with istio.

@xhejtman you were also running over istio. Did you find issues similar to this? cc @ahg-g for ideas around istio networking.

xhejtman commented 2 years ago

@xhejtman you were also running over istio. Did you find issues similar to this?

no, but mainly I do not use run-as-root, I am avoiding it as much as possible. The istio had an issue with initcontainers which is not the case here. But I can give it a try on our cluster.

alculquicondor commented 2 years ago

@celiawa's latest attempt (https://github.com/kubeflow/mpi-operator/issues/429#issuecomment-939233000) is running as non-root but still fails. So I suspect that there is a different configuration in the istio proxies that is preventing the communication over the orte protocol to happen.

xhejtman commented 2 years ago

@alculquicondor I can confirm that tensorflow benchmark does not work, if run in istio enabled namespace. I ended with: ssh_exchange_identification: Connection closed by remote host

It seems that istio proxy is causing this, as I saw that it changes peer IPs.

Actually, the problem is, that istio creates an HTTP proxy:

telnet tensorflow-benchmarks-worker-0.tensorflow-benchmarks-worker 2222
Trying 10.42.5.15...
Connected to tensorflow-benchmarks-worker-0.tensorflow-benchmarks-worker.xhejtman.svc.cluster.local.
Escape character is '^]'.
GET / HTTP/1.0

HTTP/1.1 426 Upgrade Required
date: Tue, 12 Oct 2021 21:07:45 GMT
server: istio-envoy
connection: close
content-length: 0

Connection closed by foreign host.

which obviously will never work with ssh. Not sure, how to instrument istio to create generic tcp proxy, not http.

alculquicondor commented 2 years ago

That makes sense. Although in the case in this comment https://github.com/kubeflow/mpi-operator/issues/429#issuecomment-939233000, it looks like SSH works (because of the line Accepted publickey for mpiuser from 10.42.198.120 port 33588 ssh2:). But then it fails afterwards, maybe the orte protocol.

Although, @xhejtman does the mpioperator/mpi-pi work for you under istio or do you see the same problem? We did some changes there according to the debugging you provided. Or was that just about the security policy?

If no images work, I feel like this is a problem of misconfiguration of istio, outside of our scope.

But another question is: why do you want istio for HPC? It would just slow down the job.

xhejtman commented 2 years ago

But another question is: why do you want istio for HPC? It would just slow down the job.

just note here, well, this is a bit catch 22, as the mpi-operator is used in kubeflow (and also hosted by kubeflow). Kubeflow itself requires istio installed. It does not work without istio (at least to my knowledge). So this is the reason many of us use istio.

The first problem with mpi-pi was that it used init container that accessed network which does not work in istio enabled namespace. This is fixed in mpi-operator already. But there is a new problem with istio-http-proxy. Maybe some proper annotation can fix it? Maybe traffic.sidecar.istio.io/excludeInboundPorts: [2222] but maybe also excludeOutboundPorts is needed as well? Will try to debug it a bit more. I tried to set port protocol to TCP explicitly but no luck.

alculquicondor commented 2 years ago

@terrytangyuan is it possible to install kubeflow without istio? Is that something documented?

terrytangyuan commented 2 years ago

@alculquicondor I think other users have reported istio issue before. I'd search for kubeflow issues instead. Most of the existing MPI users use the standalone MPI Operator installation.

alculquicondor commented 2 years ago

Now that we are merging the operator into the training operator, we should include in the docs that MPI only runs properly in non-istio namespaces.

terrytangyuan commented 2 years ago

Yes definitely good to add more docs on this. Let's leave this open to track that.

mChowdhury-91 commented 2 years ago

@alculquicondor I'm facing the same issue. Is there any solution to the permission issue

alculquicondor commented 2 years ago

on istio? I don't think anyone found a workaround.