Closed aalugore-ai closed 2 years ago
https://github.com/kubeflow/mpi-operator/blob/master/proposals/scalable-robust-operator.md
Perhaps it would be great to have an upgrading guide.
/cc @alculquicondor
Your images need some compatible SSH configuration. You can find a base image here: https://github.com/kubeflow/mpi-operator/tree/master/examples/base
An upgrade guide would be great, I'm happy to review a PR to the README.
Is there any plan to put out an upgrade guide? I attempted performing the upgrade myself and tried a few things but was unsuccessful.
I unfortunately don't have time to do it. Does this help? https://github.com/kubeflow/mpi-operator/pull/416/files#diff-a7f96af68d899e069e558ca372a2a3976a62761ce47e1e90a4f2f281f47690eeL72
Greetings! I hope you had a wonderful holiday season :) So in lieu of an upgrade guide, I decided to just try and perform the upgrade and see what happens. I hope by doing this I can ask you more specific, smaller scoped questions as I fiddle my way through this. Currently hitting an error when I use helm to deploy the mpi operator. I am using this chart with some minor innocuous changes to get the docker images from my own private repo.
For reference, I am using a build mpi-operator image based on the tag kubeflow/mpi-operator@v0.3.0
I was able to successfully get this CRD created in my env: https://github.com/kubeflow/mpi-operator/blob/a566d1d18046fe7f756b0ec7004693c898d589a0/deploy/v2beta1/mpi-operator.yaml so the v2beta1 api is stored in my env.
I attempted to deploy with the following command:
helm install --set rbac.clusterResources.create=false --set crd.create=false --create-namespace -n mpi-operator-v2 hack/helm/mpi-operator --generate-name
The mpi-operator comes up and is shown as running in kubernetes, but when I look at the logs, I see this:
<Normal mpi-operator logging>
...
I0106 20:39:54.407102 1 mpi_job_controller.go:325] Setting up event handlers
I0106 20:39:54.407159 1 mpi_job_controller.go:384] Starting MPIJob controller
I0106 20:39:54.407164 1 mpi_job_controller.go:387] Waiting for informer caches to sync
E0106 20:39:54.408165 1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.9/tools/cache/reflector.go:156: Failed to watch *v1.Secret: failed to list *v1.Secret: secrets is forbidden: User "system:serviceaccount:mpi-operator-v2:mpi-operator" cannot list resource "secrets" in API group "" in the namespace "mpi-operator-v2"
E0106 20:39:54.408237 1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.9/tools/cache/reflector.go:156: Failed to watch *v1.Service: failed to list *v1.Service: services is forbidden: User "system:serviceaccount:mpi-operator-v2:mpi-operator" cannot list resource "services" in API group "" in the namespace "mpi-operator-v2"
E0106 20:39:55.262512 1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.9/tools/cache/reflector.go:156: Failed to watch *v1.Service: failed to list *v1.Service: services is forbidden: User "system:serviceaccount:mpi-operator-v2:mpi-operator" cannot list resource "services" in API group "" in the namespace "mpi-operator-v2"
E0106 20:39:55.548654 1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.9/tools/cache/reflector.go:156: Failed to watch *v1.Secret: failed to list *v1.Secret: secrets is forbidden: User "system:serviceaccount:mpi-operator-v2:mpi-operator" cannot list resource "secrets" in API group "" in the namespace "mpi-operator-v2"
...
<The above errors repeat indefinitely>
In the design docs, it says the service account has been removed for v2, so I'm wondering if I missed a step or did something wrong. Please let me know if you need any other information from me to debug or if I'm doing something obviously incorrect.
What was removed was the need for the controller to create service accounts.
However, the mpi-operator itself still requires permissions that can be given through a service account. Perhaps it would help to have a look at the cluster role: https://github.com/kubeflow/mpi-operator/blob/master/manifests/base/cluster-role.yaml
Also, there is an open PR for the helm update, but I haven't had a chance to review it yet: #447
Hello again!
Quick thank you for all your help @alculquicondor, each interaction we've had has moved me forward so I appreciate your time!
I've been able to stand up the V2 mpi-operator to a point where it is waiting for a job. However, When I attempt to start a simple "hello_world" job (which runs a simple 2 layer model using horovod to distribute work) I see the launchers fail over to the back off limit. mpi-operator logs seems pretty normal but when I dump logs from one of the failed launcher I see this:
alugo@head_node:~$ k2 logs v2-mlops-hello-world-launcher-4xwwt
ssh: Could not resolve hostname v2-mlops-hello-world-worker-1.v2-mlops-hello-world-worker: Name or service not known
ssh: Could not resolve hostname v2-mlops-hello-world-worker-0.v2-mlops-hello-world-worker: Name or service not known
ssh: Could not resolve hostname v2-mlops-hello-world-worker-2.v2-mlops-hello-world-worker: Name or service not known
ssh: Could not resolve hostname v2-mlops-hello-world-worker-3.v2-mlops-hello-world-worker: Name or service not known
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:
my node: v2-mlops-hello-world-launcher
target node: v2-mlops-hello-world-worker-1.v2-mlops-hello-world-worker
This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
--------------------------------------------------------------------------
[v2-mlops-hello-world-launcher:00001] 2 more processes have sent help message help-errmgr-base.txt / no-path
[v2-mlops-hello-world-launcher:00001] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
I hypothesize a few things that could be wrong, but first I think it's important to state that I am currently running k8s version v1.19.7 with no possibility for an upgrade atm. I'm wondering if the lack of ability to use Indexed Jobs is causing problems.
In the design doc, it states that "plain pods" could still be used. Is there some flag or config needed in order to make sure this path is taken? Is k8s version v1.19.7 compatible with v2 mpi-operator?
Last piece of info I want to give you. In a recent PR, there was a bunch of ssh config stuff added to an example here: https://github.com/kubeflow/mpi-operator/pull/428/files
I looked at the Dockerfiles used to generate my worker container ( I do not own them ). They currently do not contain these StrictHostKeyChecking/StrictModes/Port etc sed
instructions. Could that be contributing to the problem? If so, how much of the commands in this dockerfile would I need to add to my worker container?
I also wanted to add my MpiJobSpec for your reference in case it helps:
kind: MPIJob
metadata:
name: v2-mlops-hello-world
spec:
slotsPerWorker: 4
mpiReplicaSpecs:
Launcher:
replicas: 1
restartPolicy: Never
template:
spec:
imagePullSecrets:
- name: private-registry
terminationGracePeriodSeconds: 0
containers:
- image: <MY TF IMAGE>
imagePullPolicy: Always
name: tensorflow-launcher
env:
- name: PYTHONPATH
value: "<edited out>"
command:
[
"mpirun",
"--allow-run-as-root",
"-map-by",
"ppr:2:socket",
"--bind-to",
"socket",
"--report-bindings",
"--tag-output",
"-npersocket",
"2",
"-x",
"PATH",
"-x",
"PYTHONPATH",
"python3",
"< command >"
]
resources:
requests:
cpu: "100m"
Worker:
replicas: 4
template:
spec:
hostNetwork: true
imagePullSecrets:
- name: private-registry
terminationGracePeriodSeconds: 0
containers:
- image: <MY TF IMAGE>
imagePullPolicy: Always
name: tensorflow-worker
command: ["bash", "-c"]
args:
- sleep infinity;
env:
- name: PYTHONPATH
value: "<edited out>"
securityContext:
capabilities:
add:
- SYS_RAWIO
- SYS_PTRACE
resources:
limits:
<MyGpuResourceLabel>: 4
hugepages-2Mi: "1800Mi"
cpu: "108"
Not sure if there is some "new" way mpijobs need to be defined.
I'm wondering if the lack of ability to use Indexed Jobs is causing problems.
The v2 controller doesn't use Indexed Jobs, although it's my desire that we do that in the future.
I looked at the Dockerfiles used to generate my worker container ( I do not own them ). They currently do not contain these StrictHostKeyChecking/StrictModes/Port etc sed instructions. Could that be contributing to the problem? If so, how much of the commands in this dockerfile would I need to add to my worker container?
It is possible that the worker's sshd is not allowing the connection due to a misconfiguration. You can gather more information about this if you have a look at the worker's logs (kubectl logs
should help you with that).
BUT! You are not actually running sshd in your workers, you are running sleep
. I suspect you were looking at the instructions for the old controller. Maybe this sample helps: https://github.com/kubeflow/mpi-operator/blob/master/examples/pi/pi.yaml
I see you are trying to use hostNetwork
, so you might need to change the port for sshd to run. This sample has such change https://github.com/kubeflow/mpi-operator/blob/master/examples/base/Dockerfile
So I actually have to run sshd from my mpijob yaml?
Do I need to do all the mpiuser stuff?
If you run as root, the mpi operator takes care of that. Unfortunately, if you run as non-root, the challenging part is providing the location of a PID file that the user has permissions to use https://github.com/kubeflow/mpi-operator/blob/master/examples/base/sshd_config Maybe there's an alternative way of doing this that I couldn't think of. Regardless, it requires a proper sshd config in the image.
Okay, so in my worker container, I am indeed running as root.
Then you can probably remove the command and let mpi-operator set it for you
Okay, so I tried a bunch of things.
sshAuthMountpath: /root/.ssh
to my mpijob yaml
In all cases I got the same error I sent here: https://github.com/kubeflow/mpi-operator/issues/443#issuecomment-1012520106
What's interesting is the failure line "ssh: Could not resolve hostname v2-mlops-hello-world-worker-1.v2-mlops-hello-world-worker: Name or service not known"
I see this in kubectl logs launcher-pod
. But when I exec into the worker and run ssh v2-mlops-hello-world-worker-1.v2-mlops-hello-world-worker
the host is found but it prompts me for a password.
root@worker0:~# ssh v2-mlops-hello-world-worker-1.v2-mlops-hello-world-worker
Warning: Permanently added 'v2-mlops-hello-world-worker-1.v2-mlops-hello-world-worker,<worker ip>' (ECDSA) to the list of known hosts.
root@v2-mlops-hello-world-worker-1.v2-mlops-hello-world-worker's password:
It's strange to me the logs of the launcher say it cannot resolve the host, but inside the worker, the hostname is found, but prompts for a password.
I think it's possible that this:
ssh: Could not resolve hostname v2-mlops-hello-world-worker-1.v2-mlops-hello-world-worker: Name or service not known
is a red-herring. Maybe initially the hostnames are not reachable, but openmpi retries and eventually it connects and a password is required.
It's more useful if you take a look at the worker's logs to see why they are rejecting the connections.
Hello! I had an issue similar to the one in https://github.com/kubeflow/mpi-operator/issues/443#issuecomment-1016755568. I migrated from v1alpha2 to v2beta1. I can only run an MPIJob exactly once. The first run successfully completes. After I delete the job and run it again, I get the same When I run the MPIJob, sometimes I get ssh: Could not resolve hostname
error.ssh: Could not resolve hostname
error and sometimes not. I don't remember that I had such an issue with v1alpha2.
When I connect to the launcher with kubectl exec -it
, I can ssh to workers without any error. When connect to the workers, I can ssh to other workers without any error.
I don't know whether it is because of mpi-operator or any underlying setup like Kubernetes. For reference, I tried this with kind with 2 worker nodes.
Previous versions don't use SSH, so you wouldn't see connection errors.
SSH errors are expected as networking takes some time to setup. But again, OpenMPI implements retries, so it should eventually connect. Unless you are missing some ssh/sshd configuration parameters.
Thanks @alculquicondor . I think the problem was that I was using a base image that was different from the one in the example base image, and this image contained an old version of OpenMPI. It somehow didn't retried. I used an image that included a newer version. It solved my problem.
That's interesting, can you share those errors and the parameters? Do you think it's possible that others might run into the same issues and require these parameters?
I experimented a little bit more. It looks like I was just impatient and I deleted the job without letting it restart enough. My MPI job worked after 3 restarts without setting any MCA parameters. Sorry for misleading you @alculquicondor and others.
You can use restartPolicy: OnFailure
(or remove the line, as that's the default), so that you don't end up with failed Pods left.
/close
@alculquicondor: Closing this issue.
Hello,
I am attempting to port some custom functionality from my previously used v1alpha2 code/fork into v2 as I notice v1alpha2 is being deprecated.
I have read through the code but am struggling to identify all the functional differences added to v2 except for worker statefulsets being converted to plain pods, and OpenSSH support.
Could someone explain the main architectual differences introduced in v2 and the associated implications? Or perhaps point me to issues or Pull Requests that describe the functionality? If I successfully port over my custom logic (for the sake of brevity let's just say it's a no-op), would I be able to point my cluster to the v2 version of mpi-operator and it will "just work"? Or are there configurations I need to take into account besides "pointing" to v2?