kubeflow / mpi-operator

Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
https://www.kubeflow.org/docs/components/training/mpi/
Apache License 2.0
430 stars 216 forks source link

v2 initContainer can't run as non-root #407

Closed xhejtman closed 3 years ago

xhejtman commented 3 years ago

Hello,

I use kubeflow 1.3.0, mpi-operator latest from Jul 5 2021. I tried MPI Job example horovod, however, launcher deployment has error in kubectl-delivery pod: Failed to list *v1.Pod: Get https://10.43.0.1:443/api/v1/namespaces/adrian/pods?limit=500&resourceVersion=0: dial tcp 10.43.0.1:443: connect: connection refused

I tested, the pod is unable to reach any outer node or external internet on ports 80 or 443. I tested those two workers, they are OK, they can reach anything, one of the workers is running on the same physical node as launcher and worker can reach network, launcher cannot. Launcher can ping interneter (e.g. 8.8.8.8).

Is it some istio misconfiguration problem? I tried to forcibly start launcher on a different node, result is the same. Thanks for help.

xhejtman commented 3 years ago

It seems that the problem is that MPIJob cannot run in istio enabled namespace. In pure namespace, it runs OK. Is this expected? mpi-operator is currently incompatible with istio?

xhejtman commented 3 years ago

The problem with istio namespace is, that an istio proxy sidecar is injected. If the istio proxy is not running, no pod can reach network, including init containers. However, istio proxy does not start till all init containers finish. mpi operator runs kubectl-delivery init container that needs to access kube api. This is not possible till istio proxy starts. However, the istio proxy does not start until init container finishes. Ergo deadlock. This is famous bug in istio: https://github.com/istio/istio/issues/11130. Situation could be solved, if kubectl-delivery init container forcibly starts before istio-validation init container. Not sure, if this is possible.

gaocegege commented 3 years ago

Can you add a label to disable injection?

xhejtman commented 3 years ago

I could but I guess it is the same as a new namespace without istio, isn't it? The question is whether the mpi operator can work in istio enabled namespaces or not. If not, maybe it is worth of mention it in README as whole kubeflow framework actually needs istio ;)

gaocegege commented 3 years ago

/cc @andreyvelich I remember that we encounter this problem before in Katib.

andreyvelich commented 3 years ago

We disabled istio sidecar for Katib Trials because we are using external source to download dataset: https://github.com/kubeflow/katib/blob/master/examples/v1beta1/mxnet-mnist/mnist.py#L44. I am not sure if MPI Operator can work with istio enabled.

What do you think @alculquicondor @terrytangyuan ?

terrytangyuan commented 3 years ago

I haven't tried it with Istio before so I would recommend adding a note on README regarding this unless others have successful experience with using them together.

alculquicondor commented 3 years ago

The v2 controller doesn't have init containers that access the network. Please consider giving it a try. I'm currently working on documenting it, but it's already fully functional.

You can build it with

make RELEASE_VERSION=dev CONTROLLER_VERSION=v2 images dev_manifest

Then you can deploy it with

kubectl apply -k manifests/overlays/dev
xhejtman commented 3 years ago

Ok, thanks, I will give it a try. Does it use ssh or does it work without root? I am interested in the latter one. Not sure whether to create a new issue or just mention it here, but it seems that if MPI job fails (e.g. due to istio problem) and you delete it and recreate then no pods ever get created on the second try. I need to rename the job.

alculquicondor commented 3 years ago

Yes, it uses ssh and it works without root, but you need to make sure the image supports it. See the example in examples/pi.

Please open a separate issue with precise steps, wait times, log lines, etc.

xhejtman commented 3 years ago

Unfortunately, it requires root for init-ssh container. If you have restrictive PSP, it fails. I try to fix it so it is able to run on PSP enabled cluster and provide patch if interested.

edit: strange, it should set everything correctly but for some reason, it does not.

xhejtman commented 3 years ago

this is launcher definition:

spec:
  containers:
  - args:
    - -n
    - "2"
    - /home/mpiuser/pi
    command:
    - mpirun
    env:
    - name: K_MPI_JOB_ROLE
      value: launcher
    - name: OMPI_MCA_orte_keep_fqdn_hostnames
      value: "true"
    - name: OMPI_MCA_orte_default_hostfile
      value: /etc/mpi/hostfile
    - name: OMPI_MCA_plm_rsh_args
      value: -o ConnectionAttempts=10
    - name: OMPI_MCA_orte_set_default_slots
      value: "1"
    - name: NVIDIA_VISIBLE_DEVICES
    - name: NVIDIA_DRIVER_CAPABILITIES
    image: kubeflow/mpi-pi
    imagePullPolicy: Always
    name: mpi-launcher
    resources:
      limits:
        cpu: "1"
        memory: 1Gi
      requests:
        cpu: "1"
        memory: 1Gi
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      runAsUser: 1000
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /home/mpiuser/.ssh
      name: ssh-home
    - mountPath: /etc/mpi
      name: mpi-job-config
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-dkn7g
      readOnly: true
  dnsPolicy: ClusterFirst
  ...
   initContainers:
  - args:
    - -c
    - cp -RL /mnt/ssh/* /mnt/home-ssh && chmod 700 /mnt/home-ssh && chmod 600 /mnt/home-ssh/*
      && chown 1000 -R /mnt/home-ssh
    command:
    - /bin/sh
    image: alpine:3.14
    imagePullPolicy: IfNotPresent
    name: init-ssh
    resources: {}
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      runAsNonRoot: true

and I believe, this code should set runasuser for init-ssh container as well, but apparently, it does not.

        // The init script sets the permissions of the ssh folder in the user's home
        // directory. The ownership is set based on the security context of the
        // launcher's first container.
        launcherSecurityCtx := job.Spec.MPIReplicaSpecs[kubeflow.MPIReplicaTypeLauncher].Template.Spec.Containers[0].SecurityContext
        initScript := "" +
                "cp -RL /mnt/ssh/* /mnt/home-ssh && " +
                "chmod 700 /mnt/home-ssh && " +
                "chmod 600 /mnt/home-ssh/*"
        if launcherSecurityCtx != nil && launcherSecurityCtx.RunAsUser != nil {
                initScript += fmt.Sprintf(" && chown %d -R /mnt/home-ssh", *launcherSecurityCtx.RunAsUser)
        }
xhejtman commented 3 years ago

ah, nope, my bad. It is not implemented for init-ssh container.

alculquicondor commented 3 years ago

Unfortunately, the init container needs root permissions to be able to change the owner of the volumes. This is some kubernetes limitation. One thing we could do is to have an option to disable the init-container, but then your image should know how to grab the ssh keys from /mnt/ssh and put them in the HOME directory.

This was actually my initial approach, but it put burden in the author of the image to know how to configure the $HOME/.ssh folder.

Would this be acceptable to you? I'm happy to review a change like that.

xhejtman commented 3 years ago

What about using a PVC for home? Those init containers are good as you do not need to make special initialization in the MPI containers. PVCs for home are already working OK.

So perhaps to add an option for home storage instead of emptyDir and allow init containers to run as user?

alculquicondor commented 3 years ago

Do you mean having /home as a Volume or /home/mpiuser? I tried the latter but ssh complained about directory permissions.

I don't think it being a PVC or emptyDir would make a difference.

xhejtman commented 3 years ago

You are probably right. However, I found how to mitigate this problem. Just set StrictModes no into /etc/ssh/sshd_config. For non-root, also listen on, e.g., port 2222, set /etc/ssh/ssh_config to connect on port 2222 instead of 22, and PI example works even on PSP enabled clusters. Also do not set CAP on /usr/sbin/sshd in such a case.

The only thing I needed in operator was to put away chmod 0700 /mnt/home-ssh and chown /mnt/home-ssh from the init container. If this could be parametrized somehow, the operator supports running without root access.

alculquicondor commented 3 years ago

Oh, I didn't know about StrictModes no. Does this disable the check for both /home/mpiuser and /home/mpiuser/.ssh or just the first?

I think you should be able to change the port to 2222 and the ssh_config on your own images. I prefer to keep the sample simple. The operator doesn't need to know about the port.

The only thing I needed in operator was to put away chmod 0700 /mnt/home-ssh and chown /mnt/home-ssh from the init container. If this could be parametrized somehow, the operator supports running without root access.

It sounds like you have a working prototype. Can you send a PR?

alculquicondor commented 3 years ago

I think we can simply remove those lines and recommend users to set StrictModes no.

Since the workloads run in a container, there is no need to "secure" the home directory and ssh keys further.

xhejtman commented 3 years ago

I use emptyDir for ~/.ssh now with terrible access rights 0777 but it just works with StrictModes no.

yeah the port 2222 is good perhaps for non-root example.

Prototype is currently like this:

        initScript := "" +
                "cp -RL /mnt/ssh/* /mnt/home-ssh && " +
                // "chmod 700 /mnt/home-ssh &&" +
                "chmod 600 /mnt/home-ssh/*"
        /*if launcherSecurityCtx != nil && launcherSecurityCtx.RunAsUser != nil {
                initScript += fmt.Sprintf(" && chown %d -R /mnt/home-ssh", *launcherSecurityCtx.RunAsUser)
        }*/

but if you are ok just remove them, ok, no other change is needed.

So, you are happy to remove those lines, I can provide at least working PI - non root example.

alculquicondor commented 3 years ago

Can we go further and remove the initContainer altogether?

The only reason we needed it was because of the tight file permission requirements. If we can get rid of those checks with StrictModes no, can't we simply mount the Secret in the workload container and point sshd_config to the keys there?

alculquicondor commented 3 years ago

/retitle v2 initContainer can't run as non-root

alculquicondor commented 3 years ago

That didn't work, @xhejtman can you retitle the issue as suggested above?

xhejtman commented 3 years ago

We could if it is ok that ~/.ssh/known_hosts is not writtable. Mounted secrets are readonly.

alculquicondor commented 3 years ago

I think we would have to mount the secret somewhere else, for example /opt/ssh-creds. Then have a sshd_config that points to the keys in there. And we also need .ssh/config that points to the public key.

xhejtman commented 3 years ago

or you just can use UserKnownHostsFile to point .known-hosts to a different location.. or maybe a bit safer solution, pre-generate hostkeys and populate /etc/ssh/ssh_known_hosts. Not sure, if this is easy to do currently.

alculquicondor commented 3 years ago

Generating the known_hosts would be impossible, because we don't know the Pod IPs before they start.

Having a different UserKnownHostsFile sounds good. But the problem I see with mounting the .ssh folder directly is anything that the image had in there is lost, like a .ssh/config file. But maybe that's fine, because there are alternatives (like just editing /etc/ssh/ssh_config).

So it comes down to what is easier to use. Probably directly mounting the Volume in $HOME/.ssh? I guess if users don't want this behavior and want to manage the directory some other way, they can simply change .spec.sshAuthMountPath. Right?

xhejtman commented 3 years ago

Yes you are right, letting user decide where to mnount ssh keys is the best option. Whether they mount it to ~/.ssh or not is up to them.

So if you release a new version of the operator, I can test it and provide PI example without root.

alculquicondor commented 3 years ago

I'm still working on it. After removing the init container, the non-root use case works, but root doesn't. This is the error from the launcher:

Permissions 0644 for '/root/.ssh/id_rsa' are too open.
It is required that your private key files are NOT accessible by others.
This private key will be ignored.

Are you aware of away to disable that check?

xhejtman commented 3 years ago

Are you mounting a secret? I think you can set mode for individual files: defaultMode: 400, just checked, for ssh connect, access rights 0777 on .ssh do not matter.

alculquicondor commented 3 years ago

Yes, I'm mounting the secret directly. However, I cannot use 0400 for non-root. It has to be at least 0440. I can detect if there is any security context with runAsUser, but I would be great if I didn't have to do that.