kubeflow / mpi-operator

Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
https://www.kubeflow.org/docs/components/training/mpi/
Apache License 2.0
440 stars 218 forks source link

SSH issue when trying to deploy horovod mnist example #435

Open ramakrishnamamidi opened 3 years ago

ramakrishnamamidi commented 3 years ago

Following is the mpi-operator configuration file i am trying to deploy on our kubernetes cluster.

apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  name: tensorflow-mnist
spec:
  slotsPerWorker: 1
  runPolicy:
    cleanPodPolicy: Running
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          containers:
          - image: horovod/horovod:latest
            name: mpi-launcher
            command:
            - mpirun
            args:
            - -np
            - "2"
            - --allow-run-as-root
            - -bind-to
            - none
            - -map-by
            - slot
            - -x
            - LD_LIBRARY_PATH
            - -x
            - PATH
            - -mca
            - pml
            - ob1
            - -mca
            - btl
            - ^openib
            - python
            - /examples/tensorflow2/tensorflow2_mnist.py
            # resources:
            #   limits:
            #     cpu: 1
            #     memory: 2Gi
    Worker:
      replicas: 2
      template:
        spec:
          containers:
          - image: horovod/horovod:latest
            name: mpi-worker
            # resources:
            #   limits:
            #     cpu: 2
            #     memory: 4Gi

Following is the error i am getting in the launcher pod

Failed to add the host to the list of known hosts (/root/.ssh/known_hosts).
Failed to add the host to the list of known hosts (/root/.ssh/known_hosts).
Permission denied, please try again.
Permission denied, please try again.
root@tensorflow-mnist-worker-1.tensorflow-mnist-worker: Permission denied (publickey,password).
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:

  my node:   tensorflow-mnist-launcher
  target node:  tensorflow-mnist-worker-0.tensorflow-mnist-worker

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
--------------------------------------------------------------------------

What am i missing as i am unable to find anything on horovod docs or mpi operator

kubectl version Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.2", GitCommit:"faecb196815e248d3ecfb03c680a4507229c2a56", GitTreeState:"clean", BuildDate:"2021-01-13T13:28:09Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.2", GitCommit:"faecb196815e248d3ecfb03c680a4507229c2a56", GitTreeState:"clean", BuildDate:"2021-01-13T13:20:00Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}

mpi operator version - 0.3.0

As per the official docs of mpi operator this config is for tf v1.14 examples/v2beta1/tensorflow-benchmarks.yaml, The above config is for tfv2.5.0 based on horovod latest docker image. Can anyone tell me what step i am missing.

alculquicondor commented 3 years ago

The upstream horovod image horovod/horovod:latest doesnt' support the operator directly.

You can use this dockerfile instead: https://github.com/kubeflow/mpi-operator/blob/master/examples/horovod/Dockerfile

alculquicondor commented 3 years ago

You can use this file as a reference of what configurations the image needs https://github.com/kubeflow/mpi-operator/blob/master/examples/base/Dockerfile

ramakrishnamamidi commented 3 years ago

Ok so I modified the docker image as follows

FROM horovod/horovod:latest

RUN echo "UserKnownHostsFile /dev/null" >> /etc/ssh/ssh_config && \
    sed -i 's/#\(StrictModes \).*/\1no/g' /etc/ssh/sshd_config
CMD ["/bin/bash"]

And used the below mpi config

apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  name: tensorflow-mnist
spec:
  slotsPerWorker: 1
  runPolicy:
    cleanPodPolicy: Running
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          containers:
          - image: ramakrishna1592/mnist-horovod:v1
            name: mpi-launcher
            command:
            - mpirun
            args:
            - -np
            - "2"
            - --allow-run-as-root
            - -bind-to
            - none
            - -map-by
            - slot
            - -x
            - LD_LIBRARY_PATH
            - -x
            - PATH
            - -mca
            - pml
            - ob1
            - -mca
            - btl
            - ^openib
            - python
            - /horovod/examples/tensorflow2/tensorflow2_mnist.py
            # resources:
            #   limits:
            #     cpu: 1
            #     memory: 2Gi
    Worker:
      replicas: 2
      template:
        spec:
          containers:
          - image: ramakrishna1592/mnist-horovod:v1
            name: mpi-worker
            # resources:
            #   limits:
            #     cpu: 2
            #     memory: 4Gi

It worked and completed execution in 15m.

Couple of things i noticed, after the mpirun executes the launcher pod goes into CrashLoopBackOff state image Following is the logs of the pod image

After sometime it moves into running state image

Is this an issue with the config?? Is there any parameter in need to set

alculquicondor commented 3 years ago

Those errors are expected/acceptable. mpi-operator handles the retries for you. The thing is that, depending on your k8s installation, the networking (DNS names) might take some time to setup.

alculquicondor commented 3 years ago

@terrytangyuan I think we should have our fork of horovod images in here https://hub.docker.com/u/mpioperator

We just need the changes that @ramakrishnamamidi identified:

FROM horovod/horovod:latest

RUN echo "UserKnownHostsFile /dev/null" >> /etc/ssh/ssh_config && \
    sed -i 's/#\(StrictModes \).*/\1no/g' /etc/ssh/sshd_config
CMD ["/bin/bash"]

Can you create a repo for it?

alculquicondor commented 3 years ago

Although it would be good to also add the configuration to be able to run as non-root. Similar to this https://github.com/kubeflow/mpi-operator/commit/fee9913c6c5ee657871cf8967ec7e8d773666ea5#diff-be50a3cb50e4eb471c7337dba6036a840f2cadb8faf1ab15c421e682dafd9842

alculquicondor commented 3 years ago

Actually, isn't the above what we have as the tensorflow benchmarks image?

NettrixTobin commented 2 years ago

@alculquicondor ,Hi,i had the same problem with the default ymal

kubectl apply -f examples/v2beta1/tensorflow-benchmarks/tensorflow-benchmarks.yaml

kubectl logs -f tensorflow-benchmarks-imagenet-launcher-tnsxb image

How to set up password-free login between containers? Is it necessary to re-bulid images using Dockerfile and replace the images in tensorflow-benchmarks.ymal?

alculquicondor commented 2 years ago

Can you confirm which image these pods are using? If I remember correctly, the images in dockerhub where built using the Dockerfile in the repo.

NettrixTobin commented 2 years ago

Thanks for your reply, I am using the default image

           containers:
           - image: mpioperator/tensorflow-benchmarks:latest

Maybe I solved the problem, after I adjusted the calico-node state. before

kube-system          calico-node-6c9mx                          0/1     Running   0                  10s
kube-system          calico-node-dbndq                          0/1     Running   0                  10s
kube-system          calico-node-qw6vv                          0/1     Running   0                  10s

after

kube-system          calico-node-h2pnc                          1/1     Running   0                  11m
kube-system          calico-node-npzn6                          1/1     Running   0                  11m
kube-system          calico-node-smwpb                          1/1     Running   0                  11m

But I got a new error as below image

alculquicondor commented 2 years ago

That error looks like a problem in the application, which is outside of the scope of the operator. Did you install GPU drivers?

NettrixTobin commented 2 years ago

Thanks, my gpu driver version is 510.47.03, i solved this problem by updating the tf and cuda version in the image

That error looks like a problem in the application, which is outside of the scope of the operator. Did you install GPU drivers?

alculquicondor commented 2 years ago

In the horovod image or the GPU daemonset?

If the horovod image, maybe it's worth upgrading our patched image.