Closed xhejtman closed 3 years ago
It seems that the problem is that MPIJob cannot run in istio enabled namespace. In pure namespace, it runs OK. Is this expected? mpi-operator is currently incompatible with istio?
The problem with istio namespace is, that an istio proxy sidecar is injected. If the istio proxy is not running, no pod can reach network, including init containers. However, istio proxy does not start till all init containers finish. mpi operator runs kubectl-delivery init container that needs to access kube api. This is not possible till istio proxy starts. However, the istio proxy does not start until init container finishes. Ergo deadlock. This is famous bug in istio: https://github.com/istio/istio/issues/11130. Situation could be solved, if kubectl-delivery init container forcibly starts before istio-validation init container. Not sure, if this is possible.
Can you add a label to disable injection?
I could but I guess it is the same as a new namespace without istio, isn't it? The question is whether the mpi operator can work in istio enabled namespaces or not. If not, maybe it is worth of mention it in README as whole kubeflow framework actually needs istio ;)
/cc @andreyvelich I remember that we encounter this problem before in Katib.
We disabled istio sidecar for Katib Trials because we are using external source to download dataset: https://github.com/kubeflow/katib/blob/master/examples/v1beta1/mxnet-mnist/mnist.py#L44. I am not sure if MPI Operator can work with istio enabled.
What do you think @alculquicondor @terrytangyuan ?
I haven't tried it with Istio before so I would recommend adding a note on README regarding this unless others have successful experience with using them together.
The v2 controller doesn't have init containers that access the network. Please consider giving it a try. I'm currently working on documenting it, but it's already fully functional.
You can build it with
make RELEASE_VERSION=dev CONTROLLER_VERSION=v2 images dev_manifest
Then you can deploy it with
kubectl apply -k manifests/overlays/dev
Ok, thanks, I will give it a try. Does it use ssh or does it work without root? I am interested in the latter one. Not sure whether to create a new issue or just mention it here, but it seems that if MPI job fails (e.g. due to istio problem) and you delete it and recreate then no pods ever get created on the second try. I need to rename the job.
Yes, it uses ssh and it works without root, but you need to make sure the image supports it. See the example in examples/pi
.
Please open a separate issue with precise steps, wait times, log lines, etc.
Unfortunately, it requires root for init-ssh container. If you have restrictive PSP, it fails. I try to fix it so it is able to run on PSP enabled cluster and provide patch if interested.
edit: strange, it should set everything correctly but for some reason, it does not.
this is launcher definition:
spec:
containers:
- args:
- -n
- "2"
- /home/mpiuser/pi
command:
- mpirun
env:
- name: K_MPI_JOB_ROLE
value: launcher
- name: OMPI_MCA_orte_keep_fqdn_hostnames
value: "true"
- name: OMPI_MCA_orte_default_hostfile
value: /etc/mpi/hostfile
- name: OMPI_MCA_plm_rsh_args
value: -o ConnectionAttempts=10
- name: OMPI_MCA_orte_set_default_slots
value: "1"
- name: NVIDIA_VISIBLE_DEVICES
- name: NVIDIA_DRIVER_CAPABILITIES
image: kubeflow/mpi-pi
imagePullPolicy: Always
name: mpi-launcher
resources:
limits:
cpu: "1"
memory: 1Gi
requests:
cpu: "1"
memory: 1Gi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
runAsUser: 1000
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /home/mpiuser/.ssh
name: ssh-home
- mountPath: /etc/mpi
name: mpi-job-config
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-dkn7g
readOnly: true
dnsPolicy: ClusterFirst
...
initContainers:
- args:
- -c
- cp -RL /mnt/ssh/* /mnt/home-ssh && chmod 700 /mnt/home-ssh && chmod 600 /mnt/home-ssh/*
&& chown 1000 -R /mnt/home-ssh
command:
- /bin/sh
image: alpine:3.14
imagePullPolicy: IfNotPresent
name: init-ssh
resources: {}
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
runAsNonRoot: true
and I believe, this code should set runasuser for init-ssh container as well, but apparently, it does not.
// The init script sets the permissions of the ssh folder in the user's home
// directory. The ownership is set based on the security context of the
// launcher's first container.
launcherSecurityCtx := job.Spec.MPIReplicaSpecs[kubeflow.MPIReplicaTypeLauncher].Template.Spec.Containers[0].SecurityContext
initScript := "" +
"cp -RL /mnt/ssh/* /mnt/home-ssh && " +
"chmod 700 /mnt/home-ssh && " +
"chmod 600 /mnt/home-ssh/*"
if launcherSecurityCtx != nil && launcherSecurityCtx.RunAsUser != nil {
initScript += fmt.Sprintf(" && chown %d -R /mnt/home-ssh", *launcherSecurityCtx.RunAsUser)
}
ah, nope, my bad. It is not implemented for init-ssh container.
Unfortunately, the init container needs root permissions to be able to change the owner of the volumes. This is some kubernetes limitation. One thing we could do is to have an option to disable the init-container, but then your image should know how to grab the ssh keys from /mnt/ssh
and put them in the HOME directory.
This was actually my initial approach, but it put burden in the author of the image to know how to configure the $HOME/.ssh
folder.
Would this be acceptable to you? I'm happy to review a change like that.
What about using a PVC for home? Those init containers are good as you do not need to make special initialization in the MPI containers. PVCs for home are already working OK.
So perhaps to add an option for home storage instead of emptyDir and allow init containers to run as user?
Do you mean having /home
as a Volume or /home/mpiuser
? I tried the latter but ssh
complained about directory permissions.
I don't think it being a PVC or emptyDir would make a difference.
You are probably right. However, I found how to mitigate this problem. Just set StrictModes no
into /etc/ssh/sshd_config. For non-root, also listen on, e.g., port 2222, set /etc/ssh/ssh_config to connect on port 2222 instead of 22, and PI example works even on PSP enabled clusters. Also do not set CAP on /usr/sbin/sshd in such a case.
The only thing I needed in operator was to put away chmod 0700 /mnt/home-ssh and chown /mnt/home-ssh from the init container. If this could be parametrized somehow, the operator supports running without root access.
Oh, I didn't know about StrictModes no
. Does this disable the check for both /home/mpiuser
and /home/mpiuser/.ssh
or just the first?
I think you should be able to change the port to 2222 and the ssh_config on your own images. I prefer to keep the sample simple. The operator doesn't need to know about the port.
The only thing I needed in operator was to put away chmod 0700 /mnt/home-ssh and chown /mnt/home-ssh from the init container. If this could be parametrized somehow, the operator supports running without root access.
It sounds like you have a working prototype. Can you send a PR?
I think we can simply remove those lines and recommend users to set StrictModes no
.
Since the workloads run in a container, there is no need to "secure" the home directory and ssh keys further.
I use emptyDir for ~/.ssh now with terrible access rights 0777 but it just works with StrictModes no
.
yeah the port 2222 is good perhaps for non-root example.
Prototype is currently like this:
initScript := "" +
"cp -RL /mnt/ssh/* /mnt/home-ssh && " +
// "chmod 700 /mnt/home-ssh &&" +
"chmod 600 /mnt/home-ssh/*"
/*if launcherSecurityCtx != nil && launcherSecurityCtx.RunAsUser != nil {
initScript += fmt.Sprintf(" && chown %d -R /mnt/home-ssh", *launcherSecurityCtx.RunAsUser)
}*/
but if you are ok just remove them, ok, no other change is needed.
So, you are happy to remove those lines, I can provide at least working PI - non root example.
Can we go further and remove the initContainer altogether?
The only reason we needed it was because of the tight file permission requirements. If we can get rid of those checks with StrictModes no
, can't we simply mount the Secret in the workload container and point sshd_config to the keys there?
/retitle v2 initContainer can't run as non-root
That didn't work, @xhejtman can you retitle the issue as suggested above?
We could if it is ok that ~/.ssh/known_hosts is not writtable. Mounted secrets are readonly.
I think we would have to mount the secret somewhere else, for example /opt/ssh-creds
. Then have a sshd_config
that points to the keys in there. And we also need .ssh/config
that points to the public key.
or you just can use UserKnownHostsFile to point .known-hosts to a different location.. or maybe a bit safer solution, pre-generate hostkeys and populate /etc/ssh/ssh_known_hosts. Not sure, if this is easy to do currently.
Generating the known_hosts would be impossible, because we don't know the Pod IPs before they start.
Having a different UserKnownHostsFile
sounds good. But the problem I see with mounting the .ssh
folder directly is anything that the image had in there is lost, like a .ssh/config
file. But maybe that's fine, because there are alternatives (like just editing /etc/ssh/ssh_config).
So it comes down to what is easier to use. Probably directly mounting the Volume in $HOME/.ssh
? I guess if users don't want this behavior and want to manage the directory some other way, they can simply change .spec.sshAuthMountPath
. Right?
Yes you are right, letting user decide where to mnount ssh keys is the best option. Whether they mount it to ~/.ssh or not is up to them.
So if you release a new version of the operator, I can test it and provide PI example without root.
I'm still working on it. After removing the init container, the non-root use case works, but root doesn't. This is the error from the launcher:
Permissions 0644 for '/root/.ssh/id_rsa' are too open.
It is required that your private key files are NOT accessible by others.
This private key will be ignored.
Are you aware of away to disable that check?
Are you mounting a secret? I think you can set mode for individual files: defaultMode: 400, just checked, for ssh connect, access rights 0777 on .ssh do not matter.
Yes, I'm mounting the secret directly. However, I cannot use 0400
for non-root. It has to be at least 0440
. I can detect if there is any security context with runAsUser
, but I would be great if I didn't have to do that.
Hello,
I use kubeflow 1.3.0, mpi-operator latest from Jul 5 2021. I tried MPI Job example horovod, however, launcher deployment has error in kubectl-delivery pod: Failed to list *v1.Pod: Get https://10.43.0.1:443/api/v1/namespaces/adrian/pods?limit=500&resourceVersion=0: dial tcp 10.43.0.1:443: connect: connection refused
I tested, the pod is unable to reach any outer node or external internet on ports 80 or 443. I tested those two workers, they are OK, they can reach anything, one of the workers is running on the same physical node as launcher and worker can reach network, launcher cannot. Launcher can ping interneter (e.g. 8.8.8.8).
Is it some istio misconfiguration problem? I tried to forcibly start launcher on a different node, result is the same. Thanks for help.