volcano-sh / volcano

A Cloud Native Batch System (Project under CNCF)
https://volcano.sh
Apache License 2.0
4.17k stars 959 forks source link

ORTE does not know how to route a message to the specified daemon #2226

Closed kongjibai closed 1 year ago

kongjibai commented 2 years ago

when i use volcano start a horovod tf job,the lm-horovod-job-master-0 node will run error and after restart 3 times, it'll be Running status, because it Permanently added 'lm-horovod-job-worker-0.lm-horovod-job,10.10.10.10' (ECDSA) to the list of known hosts. the output info as below, anyone met or solved problem like this? i use volcano 1.5.1, horovod 0.24.3, tf1.15, open mpi 4.0.0, cuda 10.0, ubuntu 18.04 on K8s docker image.

checkpoints  data  test.py  tf_mnist_lm.py  torch_mnist_lm.py
ssh: Could not resolve hostname lm-horovod-job-worker-5.lm-horovod-job: Name or service not known
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:

  my node:   lm-horovod-job-master-0
  target node:  lm-horovod-job-worker-0

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
--------------------------------------------------------------------------
[lm-horovod-job-master-0:00014] 6 more processes have sent help message help-errmgr-base.txt / no-path
[lm-horovod-job-master-0:00014] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
hwdef commented 2 years ago

Please post the yaml file of the job

kongjibai commented 2 years ago

Please post the yaml file of the job

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  # name: lm-hvd-job-tf-mnist
  name: lm-horovod-job
  # namespace: vc-horovod-test
  # namespace: default
  labels:
    "volcano.sh/job-type": Horovod
spec:
  minAvailable: 9
  schedulerName: volcano
  plugins:
    ssh: []
    svc: []
  policies:
    - event: PodEvicted
      action: RestartJob
  tasks:
    - replicas: 1
      name: master
      policies:
        - event: TaskCompleted
          action: CompleteJob
      template:
        spec:
          containers:
            - command:
                - /bin/sh
                - -c
                - |
                  WORKER_HOST=`cat /home/etc-volcano/volcano/worker.host | tr "\n" ","`;
                  mkdir -p /var/run/sshd; /usr/sbin/sshd;
                  cd /home/vc-hvd-test;
                  mpiexec --allow-run-as-root --host ${WORKER_HOST} --mca routed=direct -np 8 python torch_mnist_lm.py;
              image: vc-hvd-test:v1.0
              name: master
              ports:
                - containerPort: 22
                  name: job-port
              volumeMounts:                       
              - name: vc-hvd-home
                mountPath: /home
              resources:
                requests:
                  cpu: "500m"
                  memory: "1024Mi"
                limits:
                  cpu: "500m"
                  memory: "1024Mi"
          volumes:
          - name: vc-hvd-home
            persistentVolumeClaim:                 
              claimName: nfs-pvc02
          restartPolicy: OnFailure
          imagePullSecrets:
            - name: default-secret
    - replicas: 8
      name: worker
      template:
        spec:
          containers:
            - command:
                - /bin/sh
                - -c
                - |
                  mkdir -p /var/run/sshd; /usr/sbin/sshd -D;
              image: vc-hvd-test:v1.0
              name: worker
              ports:
                - containerPort: 22
                  name: job-port
              volumeMounts:                         
              - name: vc-hvd-home
                mountPath: /home
              resources:
                limits:
                  nvidia.com/gpu: 1                 
          volumes:
          - name: vc-hvd-home
            persistentVolumeClaim:                 
              claimName: nfs-pvc02
          restartPolicy: OnFailure
          imagePullSecrets:
            - name: default-secret
---
hwdef commented 2 years ago

Please try delayed start master.

https://github.com/volcano-sh/volcano/blob/2cfce7a1305e4ad6d3dcb1a11bf3dc528aee0701/example/task-start-dependency/mpi.yaml#L34

You can try use dependsOn.

kongjibai commented 2 years ago

Please try delayed start master.

https://github.com/volcano-sh/volcano/blob/2cfce7a1305e4ad6d3dcb1a11bf3dc528aee0701/example/task-start-dependency/mpi.yaml#L34

You can try use dependsOn.

I have tried use dependsOn, but there is no matser job and the worker jobs always in Pending status. After i deleted the job and apply it again, the output of kubectl get pod is No resources found in default namespace. I can only clean up volcano and reinstall it, if i try to use dependsOn and apply a job, the matser and worker jobs will as same before. bellow is the .yaml file.

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  # name: lm-hvd-job-tf-mnist
  name: lm-horovod-job
  # namespace: vc-horovod-test
  # namespace: default
  labels:
    "volcano.sh/job-type": Horovod
spec:
  minAvailable: 9
  schedulerName: volcano
  plugins:
    ssh: []
    svc: []
  policies:
    - event: PodEvicted
      action: RestartJob
  tasks:
    - replicas: 1
      name: master
      policies:
        - event: TaskCompleted
          action: CompleteJob
      template:
        spec:
          containers:
            - command:
                - /bin/sh
                - -c
                - |
                  WORKER_HOST=`cat /home/etc-volcano/volcano/worker.host | tr "\n" ","`;
                  mkdir -p /var/run/sshd; /usr/sbin/sshd;
                  cd /home/vc-hvd-test;
                  mpiexec --allow-run-as-root --host ${WORKER_HOST} --mca routed=direct -np 8 python torch_mnist_lm.py;
              image: vc-hvd-test:v1.0
              name: master
              ports:
                - containerPort: 22
                  name: job-port
              volumeMounts:                       
              - name: vc-hvd-home
                mountPath: /home
              resources:
                requests:
                  cpu: "500m"
                  memory: "1024Mi"
                limits:
                  cpu: "500m"
                  memory: "1024Mi"
          volumes:
          - name: vc-hvd-home
            persistentVolumeClaim:                 
              claimName: nfs-pvc02
          restartPolicy: OnFailure
          imagePullSecrets:
            - name: default-secret
       dependsOn:
        name: 
        - "worker"
    - replicas: 8
      name: worker
      template:
        spec:
          containers:
            - command:
                - /bin/sh
                - -c
                - |
                  mkdir -p /var/run/sshd; /usr/sbin/sshd -D;
              image: vc-hvd-test:v1.0
              name: worker
              ports:
                - containerPort: 22
                  name: job-port
              volumeMounts:                         
              - name: vc-hvd-home
                mountPath: /home
              resources:
                limits:
                  nvidia.com/gpu: 1                 
          volumes:
          - name: vc-hvd-home
            persistentVolumeClaim:                 
              claimName: nfs-pvc02
          restartPolicy: OnFailure
          imagePullSecrets:
            - name: default-secret
---
hwdef commented 2 years ago

It is speculated that there is a conflict with the gang plugin. You can disable gang plugin or set minavailable == worker.replicas

Thor-wl commented 2 years ago

@hwdef Perhaps we should consider more about mpi requirements and provides more test results at mpi plugin

stale[bot] commented 2 years ago

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

stale[bot] commented 1 year ago

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

stale[bot] commented 1 year ago

Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗