kubeflow / mpi-operator

Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
https://www.kubeflow.org/docs/components/training/mpi/
Apache License 2.0
421 stars 212 forks source link

[Question] v2 versus v1alpha2 #443

Closed aalugore-ai closed 2 years ago

aalugore-ai commented 2 years ago

Hello,

I am attempting to port some custom functionality from my previously used v1alpha2 code/fork into v2 as I notice v1alpha2 is being deprecated.

I have read through the code but am struggling to identify all the functional differences added to v2 except for worker statefulsets being converted to plain pods, and OpenSSH support.

Could someone explain the main architectual differences introduced in v2 and the associated implications? Or perhaps point me to issues or Pull Requests that describe the functionality? If I successfully port over my custom logic (for the sake of brevity let's just say it's a no-op), would I be able to point my cluster to the v2 version of mpi-operator and it will "just work"? Or are there configurations I need to take into account besides "pointing" to v2?

terrytangyuan commented 2 years ago

https://github.com/kubeflow/mpi-operator/blob/master/proposals/scalable-robust-operator.md

Perhaps it would be great to have an upgrading guide.

/cc @alculquicondor

alculquicondor commented 2 years ago

Your images need some compatible SSH configuration. You can find a base image here: https://github.com/kubeflow/mpi-operator/tree/master/examples/base

An upgrade guide would be great, I'm happy to review a PR to the README.

aalugore-ai commented 2 years ago

Is there any plan to put out an upgrade guide? I attempted performing the upgrade myself and tried a few things but was unsuccessful.

alculquicondor commented 2 years ago

I unfortunately don't have time to do it. Does this help? https://github.com/kubeflow/mpi-operator/pull/416/files#diff-a7f96af68d899e069e558ca372a2a3976a62761ce47e1e90a4f2f281f47690eeL72

aalugore-ai commented 2 years ago

Greetings! I hope you had a wonderful holiday season :) So in lieu of an upgrade guide, I decided to just try and perform the upgrade and see what happens. I hope by doing this I can ask you more specific, smaller scoped questions as I fiddle my way through this. Currently hitting an error when I use helm to deploy the mpi operator. I am using this chart with some minor innocuous changes to get the docker images from my own private repo.

For reference, I am using a build mpi-operator image based on the tag kubeflow/mpi-operator@v0.3.0

I was able to successfully get this CRD created in my env: https://github.com/kubeflow/mpi-operator/blob/a566d1d18046fe7f756b0ec7004693c898d589a0/deploy/v2beta1/mpi-operator.yaml so the v2beta1 api is stored in my env.

I attempted to deploy with the following command: helm install --set rbac.clusterResources.create=false --set crd.create=false --create-namespace -n mpi-operator-v2 hack/helm/mpi-operator --generate-name

The mpi-operator comes up and is shown as running in kubernetes, but when I look at the logs, I see this:

<Normal mpi-operator logging>
...

I0106 20:39:54.407102       1 mpi_job_controller.go:325] Setting up event handlers
I0106 20:39:54.407159       1 mpi_job_controller.go:384] Starting MPIJob controller
I0106 20:39:54.407164       1 mpi_job_controller.go:387] Waiting for informer caches to sync
E0106 20:39:54.408165       1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.9/tools/cache/reflector.go:156: Failed to watch *v1.Secret: failed to list *v1.Secret: secrets is forbidden: User "system:serviceaccount:mpi-operator-v2:mpi-operator" cannot list resource "secrets" in API group "" in the namespace "mpi-operator-v2"
E0106 20:39:54.408237       1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.9/tools/cache/reflector.go:156: Failed to watch *v1.Service: failed to list *v1.Service: services is forbidden: User "system:serviceaccount:mpi-operator-v2:mpi-operator" cannot list resource "services" in API group "" in the namespace "mpi-operator-v2"
E0106 20:39:55.262512       1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.9/tools/cache/reflector.go:156: Failed to watch *v1.Service: failed to list *v1.Service: services is forbidden: User "system:serviceaccount:mpi-operator-v2:mpi-operator" cannot list resource "services" in API group "" in the namespace "mpi-operator-v2"
E0106 20:39:55.548654       1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.9/tools/cache/reflector.go:156: Failed to watch *v1.Secret: failed to list *v1.Secret: secrets is forbidden: User "system:serviceaccount:mpi-operator-v2:mpi-operator" cannot list resource "secrets" in API group "" in the namespace "mpi-operator-v2"

...
<The above errors repeat indefinitely>

In the design docs, it says the service account has been removed for v2, so I'm wondering if I missed a step or did something wrong. Please let me know if you need any other information from me to debug or if I'm doing something obviously incorrect.

alculquicondor commented 2 years ago

What was removed was the need for the controller to create service accounts.

However, the mpi-operator itself still requires permissions that can be given through a service account. Perhaps it would help to have a look at the cluster role: https://github.com/kubeflow/mpi-operator/blob/master/manifests/base/cluster-role.yaml

Also, there is an open PR for the helm update, but I haven't had a chance to review it yet: #447

aalugore-ai commented 2 years ago

Hello again!

Quick thank you for all your help @alculquicondor, each interaction we've had has moved me forward so I appreciate your time!

I've been able to stand up the V2 mpi-operator to a point where it is waiting for a job. However, When I attempt to start a simple "hello_world" job (which runs a simple 2 layer model using horovod to distribute work) I see the launchers fail over to the back off limit. mpi-operator logs seems pretty normal but when I dump logs from one of the failed launcher I see this:

alugo@head_node:~$ k2 logs v2-mlops-hello-world-launcher-4xwwt
ssh: Could not resolve hostname v2-mlops-hello-world-worker-1.v2-mlops-hello-world-worker: Name or service not known
ssh: Could not resolve hostname v2-mlops-hello-world-worker-0.v2-mlops-hello-world-worker: Name or service not known
ssh: Could not resolve hostname v2-mlops-hello-world-worker-2.v2-mlops-hello-world-worker: Name or service not known
ssh: Could not resolve hostname v2-mlops-hello-world-worker-3.v2-mlops-hello-world-worker: Name or service not known
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:

  my node:   v2-mlops-hello-world-launcher
  target node:  v2-mlops-hello-world-worker-1.v2-mlops-hello-world-worker

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
--------------------------------------------------------------------------
[v2-mlops-hello-world-launcher:00001] 2 more processes have sent help message help-errmgr-base.txt / no-path
[v2-mlops-hello-world-launcher:00001] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

I hypothesize a few things that could be wrong, but first I think it's important to state that I am currently running k8s version v1.19.7 with no possibility for an upgrade atm. I'm wondering if the lack of ability to use Indexed Jobs is causing problems.

In the design doc, it states that "plain pods" could still be used. Is there some flag or config needed in order to make sure this path is taken? Is k8s version v1.19.7 compatible with v2 mpi-operator?

Last piece of info I want to give you. In a recent PR, there was a bunch of ssh config stuff added to an example here: https://github.com/kubeflow/mpi-operator/pull/428/files I looked at the Dockerfiles used to generate my worker container ( I do not own them ). They currently do not contain these StrictHostKeyChecking/StrictModes/Port etc sed instructions. Could that be contributing to the problem? If so, how much of the commands in this dockerfile would I need to add to my worker container?

aalugore-ai commented 2 years ago

I also wanted to add my MpiJobSpec for your reference in case it helps:

kind: MPIJob
metadata:
  name: v2-mlops-hello-world
spec:
  slotsPerWorker: 4
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      restartPolicy: Never
      template:
        spec:
          imagePullSecrets:
            - name: private-registry
          terminationGracePeriodSeconds: 0
          containers:
            - image: <MY TF IMAGE>
              imagePullPolicy: Always
              name: tensorflow-launcher
              env:
                - name: PYTHONPATH
                  value: "<edited out>"

              command:
                [
                  "mpirun",
                  "--allow-run-as-root",
                  "-map-by",
                  "ppr:2:socket",
                  "--bind-to",
                  "socket",
                  "--report-bindings",
                  "--tag-output",
                  "-npersocket",
                  "2",
                  "-x",
                  "PATH",
                  "-x",
                  "PYTHONPATH",
                  "python3",
                  "< command >"
                ]
              resources:
                requests:
                  cpu: "100m"
    Worker:
      replicas: 4
      template:
        spec:
          hostNetwork: true
          imagePullSecrets:
            - name: private-registry
          terminationGracePeriodSeconds: 0
          containers:
            - image: <MY TF IMAGE>
              imagePullPolicy: Always
              name: tensorflow-worker
              command: ["bash", "-c"]
              args:
                - sleep infinity;
              env:
                - name: PYTHONPATH
                  value: "<edited out>"
              securityContext:
                capabilities:
                  add:
                    - SYS_RAWIO
                    - SYS_PTRACE
              resources:
                limits:
                  <MyGpuResourceLabel>: 4
                  hugepages-2Mi: "1800Mi"
                  cpu: "108"

Not sure if there is some "new" way mpijobs need to be defined.

alculquicondor commented 2 years ago

I'm wondering if the lack of ability to use Indexed Jobs is causing problems.

The v2 controller doesn't use Indexed Jobs, although it's my desire that we do that in the future.

I looked at the Dockerfiles used to generate my worker container ( I do not own them ). They currently do not contain these StrictHostKeyChecking/StrictModes/Port etc sed instructions. Could that be contributing to the problem? If so, how much of the commands in this dockerfile would I need to add to my worker container?

It is possible that the worker's sshd is not allowing the connection due to a misconfiguration. You can gather more information about this if you have a look at the worker's logs (kubectl logs should help you with that).

BUT! You are not actually running sshd in your workers, you are running sleep. I suspect you were looking at the instructions for the old controller. Maybe this sample helps: https://github.com/kubeflow/mpi-operator/blob/master/examples/pi/pi.yaml

I see you are trying to use hostNetwork, so you might need to change the port for sshd to run. This sample has such change https://github.com/kubeflow/mpi-operator/blob/master/examples/base/Dockerfile

aalugore-ai commented 2 years ago

So I actually have to run sshd from my mpijob yaml?

Do I need to do all the mpiuser stuff?

alculquicondor commented 2 years ago

If you run as root, the mpi operator takes care of that. Unfortunately, if you run as non-root, the challenging part is providing the location of a PID file that the user has permissions to use https://github.com/kubeflow/mpi-operator/blob/master/examples/base/sshd_config Maybe there's an alternative way of doing this that I couldn't think of. Regardless, it requires a proper sshd config in the image.

aalugore-ai commented 2 years ago

Okay, so in my worker container, I am indeed running as root.

alculquicondor commented 2 years ago

Then you can probably remove the command and let mpi-operator set it for you

aalugore-ai commented 2 years ago

Okay, so I tried a bunch of things.

alculquicondor commented 2 years ago

I think it's possible that this:

ssh: Could not resolve hostname v2-mlops-hello-world-worker-1.v2-mlops-hello-world-worker: Name or service not known

is a red-herring. Maybe initially the hostnames are not reachable, but openmpi retries and eventually it connects and a password is required.

It's more useful if you take a look at the worker's logs to see why they are rejecting the connections.

kilicbaran commented 2 years ago

Hello! I had an issue similar to the one in https://github.com/kubeflow/mpi-operator/issues/443#issuecomment-1016755568. I migrated from v1alpha2 to v2beta1. I can only run an MPIJob exactly once. The first run successfully completes. After I delete the job and run it again, I get the same ssh: Could not resolve hostname error. When I run the MPIJob, sometimes I get ssh: Could not resolve hostname error and sometimes not. I don't remember that I had such an issue with v1alpha2.

When I connect to the launcher with kubectl exec -it, I can ssh to workers without any error. When connect to the workers, I can ssh to other workers without any error.

I don't know whether it is because of mpi-operator or any underlying setup like Kubernetes. For reference, I tried this with kind with 2 worker nodes.

alculquicondor commented 2 years ago

Previous versions don't use SSH, so you wouldn't see connection errors.

SSH errors are expected as networking takes some time to setup. But again, OpenMPI implements retries, so it should eventually connect. Unless you are missing some ssh/sshd configuration parameters.

kilicbaran commented 2 years ago

Thanks @alculquicondor . I think the problem was that I was using a base image that was different from the one in the example base image, and this image contained an old version of OpenMPI. It somehow didn't retried. I used an image that included a newer version. It solved my problem.

alculquicondor commented 2 years ago

That's interesting, can you share those errors and the parameters? Do you think it's possible that others might run into the same issues and require these parameters?

kilicbaran commented 2 years ago

I experimented a little bit more. It looks like I was just impatient and I deleted the job without letting it restart enough. My MPI job worked after 3 restarts without setting any MCA parameters. Sorry for misleading you @alculquicondor and others.

alculquicondor commented 2 years ago

You can use restartPolicy: OnFailure (or remove the line, as that's the default), so that you don't end up with failed Pods left.

alculquicondor commented 2 years ago

/close

google-oss-prow[bot] commented 2 years ago

@alculquicondor: Closing this issue.

In response to [this](https://github.com/kubeflow/mpi-operator/issues/443#issuecomment-1021350641): >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.