Add proper PSPs to enforce security and safety for Kubeflow on Kubernetes

juliusvonkohout commented 3 years ago

Related to https://github.com/kubeflow/manifests/pull/1756 @yanniszark @DavidSpek and https://github.com/kubeflow/manifests/issues/1984 @sunnythepatel

Currently there are no PodSecurityPolicies or SecurityContextConstraints to enforce security within kubeflow I would like to change that and put the necessary energy in pull requests. I am using the following on my cluster for months to run everything as non-root including a rootless istio-cni. It also works for pipelines with k8sapi or the new emissary executor https://github.com/kubeflow/pipelines/issues/5718 @Bobgy

I need your feedback on the following solution. if you are satisfied, I will create a pull request.

in the main kustomization yaml kustomize_istio.zip kustomize_addons_psp_scc.zip

# PSPs and SCCs
- PSP
- SCC
- PSP_SCC_clusterrole

# We install istio with CNI and other configurations
#- ../manifests/common/istio-1-9/istio-crds/base
#- ../manifests/common/istio-1-9/istio-namespace/base
#- ../manifests/common/istio-1-9/istio-install/base
- istio/openshift
#-istio/kubernetes

sunnythepatel commented 3 years ago

HI @juliusvonkohout , Can I use your solution to deploy all pods if PSP is enabled?

Thank you

juliusvonkohout commented 3 years ago

HI @juliusvonkohout , Can I use your solution to deploy all pods if PSP is enabled?

Thank you

There is not "one" PSP. Please read the whole Kubernetes documentation to PSPs first. You need to understand Kubernetes before altering Kubeflow. If your company is interested in a managed Kubeflow contact me (t-systems) or Arrikto for a managed offer.

sunnythepatel commented 3 years ago

Hi, @juliusvonkohout, Thank you I was able to resolve my PSP issue by adding the below things.

- PSP
- istio/kubernetes

I still had few issues though I am not sure I guess I am having an issue in installing these pods

manifests-1.3.1 kubectl logs cache-deployer-deployment-6dbb64ddcd-dwplm -n kubeflow
kubectl      cache-deployer-deployment-6dbb64ddcd-dwplm
+ kubectl logs cache-deployer-deployment-6dbb64ddcd-dwplm -n kubeflow
+ echo 'Start deploying cache service to existing cluster:'
+ NAMESPACE=kubeflow
Start deploying cache service to existing cluster:
+ MUTATING_WEBHOOK_CONFIGURATION_NAME=cache-webhook-kubeflow
+ WEBHOOK_SECRET_NAME=webhook-server-tls
+ mkdir -p /home/cloudsdk/bin
+ export 'PATH=/home/cloudsdk/bin:/google-cloud-sdk/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin'
+ kubectl version --output json
+ jq --raw-output '(.serverVersion.major + "." + .serverVersion.minor)'
+ tr -d '"+'
+ server_version_major_minor=1.20
+ curl -s https://storage.googleapis.com/kubernetes-release/release/stable-1.20.txt
+ stable_build_version=v1.20.10
+ kubectl_url=https://storage.googleapis.com/kubernetes-release/release/v1.20.10/bin/linux/amd64/kubectl
+ curl -L -o /home/cloudsdk/bin/kubectl https://storage.googleapis.com/kubernetes-release/release/v1.20.10/bin/linux/amd64/kubectl
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 38.3M  100 38.3M    0     0  14.1M      0  0:00:02  0:00:02 --:--:-- 14.1M
+ chmod +x /home/cloudsdk/bin/kubectl
/kfp/cache/deployer/deploy-cache-service.sh: line 47: can't create webhooks.txt: Permission denied
1:31
 manifests-1.3.1 kubectl logs cache-server-f84f6bdcc-x9nlf -n kubeflow

+ kubectl logs cache-server-f84f6bdcc-x9nlf -n kubeflow
Error from server (BadRequest): container "server" in pod "cache-server-f84f6bdcc-x9nlf" is waiting to start: PodInitializing

🔥 manifests-1.3.1

for cache-server

Events:
  Type     Reason       Age                   From               Message
  ----     ------       ----                  ----               -------
  Normal   Scheduled    60m                   default-scheduler  Successfully assigned kubeflow/cache-server-f84f6bdcc-x9nlf to k8-prod-dev-m2-k8s-node-nf-1
  Warning  FailedMount  58m                   kubelet            Unable to attach or mount volumes: unmounted volumes=[webhook-tls-certs], unattached volumes=[kubeflow-pipelines-cache-token-8jgpg webhook-tls-certs istiod-ca-cert istio-data istio-envoy istio-token istio-podinfo]: timed out waiting for the condition
  Warning  FailedMount  56m                   kubelet            Unable to attach or mount volumes: unmounted volumes=[webhook-tls-certs], unattached volumes=[istio-token istio-podinfo kubeflow-pipelines-cache-token-8jgpg webhook-tls-certs istiod-ca-cert istio-data istio-envoy]: timed out waiting for the condition
  Warning  FailedMount  49m                   kubelet            Unable to attach or mount volumes: unmounted volumes=[webhook-tls-certs], unattached volumes=[istio-podinfo kubeflow-pipelines-cache-token-8jgpg webhook-tls-certs istiod-ca-cert istio-data istio-envoy istio-token]: timed out waiting for the condition
  Warning  FailedMount  45m (x2 over 54m)     kubelet            Unable to attach or mount volumes: unmounted volumes=[webhook-tls-certs], unattached volumes=[webhook-tls-certs istiod-ca-cert istio-data istio-envoy istio-token istio-podinfo kubeflow-pipelines-cache-token-8jgpg]: timed out waiting for the condition
  Warning  FailedMount  24m (x5 over 47m)     kubelet            Unable to attach or mount volumes: unmounted volumes=[webhook-tls-certs], unattached volumes=[istiod-ca-cert istio-data istio-envoy istio-token istio-podinfo kubeflow-pipelines-cache-token-8jgpg webhook-tls-certs]: timed out waiting for the condition
  Warning  FailedMount  20m (x4 over 51m)     kubelet            Unable to attach or mount volumes: unmounted volumes=[webhook-tls-certs], unattached volumes=[istio-envoy istio-token istio-podinfo kubeflow-pipelines-cache-token-8jgpg webhook-tls-certs istiod-ca-cert istio-data]: timed out waiting for the condition
  Warning  FailedMount  15m (x2 over 33m)     kubelet            Unable to attach or mount volumes: unmounted volumes=[webhook-tls-certs], unattached volumes=[istio-data istio-envoy istio-token istio-podinfo kubeflow-pipelines-cache-token-8jgpg webhook-tls-certs istiod-ca-cert]: timed out waiting for the condition
  Warning  FailedMount  5m46s (x35 over 60m)  kubelet            MountVolume.SetUp failed for volume "webhook-tls-certs" : secret "webhook-server-tls" not found

Also, there are issues with pod admitting for User Namespace due to PSP Finally, create a new namespace for the default user (named kubeflow-user-example-com).

kustomize build common/user-namespace/base | kubectl apply -f -

Can you please provide me with some pointer on this how I can resolve it?

Thank you.

juliusvonkohout commented 3 years ago

@sunnythepatel If you would have investigated the cache-server issue yourself, you would have found out that it is fixed upstream in 1.4 and there are instruction on how to build a version for 1.3.1. https://github.com/kubeflow/pipelines/pull/5742 ? I am using a patched 1.5.1 image myself withKubeflow 1.3.1

juliusvonkohout commented 3 years ago

"Also, there are issues with pod admitting for User Namespace due to PSP Finally, create a new namespace for the default user (named kubeflow-user-example-com)."

Why did you deliberately omit "- PSP_SCC_clusterrole" from my instructions? If you do not add the PSP to all user namespaces using the clusterrole it will obviously not work.

sunnythepatel commented 3 years ago

"Also, there are issues with pod admitting for User Namespace due to PSP Finally, create a new namespace for the default user (named kubeflow-user-example-com)."

Why did you deliberately omit "- PSP_SCC_clusterrole" from my instructions? If you do not add the PSP to all user namespaces using the clusterrole it will obviously not work.

Sorry, I completely missed that it works now. Thank you

juliusvonkohout commented 3 years ago

"Also, there are issues with pod admitting for User Namespace due to PSP Finally, create a new namespace for the default user (named kubeflow-user-example-com)." Why did you deliberately omit "- PSP_SCC_clusterrole" from my instructions? If you do not add the PSP to all user namespaces using the clusterrole it will obviously not work.

Sorry, I completely missed that it works now. Thank you

Please check everything and confirm whether it works. Then we might be able to persuade the manifest working group to get this upstream.

sunnythepatel commented 3 years ago

@sunnythepatel If you would have investigated the cache-server issue yourself, you would have found out that it is fixed upstream in 1.4 and there are instruction on how to build a version for 1.3.1. kubeflow/pipelines#5742 ? I am using a patched 1.5.1 image myself withKubeflow 1.3.1

Hi, @juliusvonkohout Thank you for your reply but I tried to access the image but I was not able to do it successfully and I also can't find the instructions mentioned over there.

Thank you

juliusvonkohout commented 3 years ago

@sunnythepatel If you would have investigated the cache-server issue yourself, you would have found out that it is fixed upstream in 1.4 and there are instruction on how to build a version for 1.3.1. kubeflow/pipelines#5742 ? I am using a patched 1.5.1 image myself withKubeflow 1.3.1

Hi, @juliusvonkohout Thank you for your reply but I tried to access the image but I was not able to do it successfully and I also can't find the instructions mentioned over there.

Thank you

The instruction is the pull request itself. If you are uncapable of building an OCI image use mtr.external.otc.telekomcloud.com/ml-pipeline/cache-deployer:1.5.1

sunnythepatel commented 3 years ago

"Also, there are issues with pod admitting for User Namespace due to PSP Finally, create a new namespace for the default user (named kubeflow-user-example-com)." Why did you deliberately omit "- PSP_SCC_clusterrole" from my instructions? If you do not add the PSP to all user namespaces using the clusterrole it will obviously not work.

Sorry, I completely missed that it works now. Thank you

Please check everything and confirm whether it works. Then we might be able to persuade the manifest working group to get this upstream.

Yes, It works I can confirm now.

 kubectl get pods -n kubeflow-user-example-com
NAME                                               READY   STATUS    RESTARTS   AGE
ml-pipeline-ui-artifact-767659f9df-lscb9           2/2     Running   0          6m26s
ml-pipeline-visualizationserver-6ff9f47c6b-f62g5   2/2     Running   0          6m26s

I am now trying to fix these few pod issue

+ kubectl get pods -n kubeflow
NAME                                                        READY   STATUS             RESTARTS   AGE
admission-webhook-deployment-f5d8f47f8-458nx                1/1     Running            0          8h
cache-deployer-deployment-6dbb64ddcd-7tb9p                  1/2     CrashLoopBackOff   71         5h48m
cache-server-f84f6bdcc-x9nlf                                0/2     Init:0/1           0          8h
centraldashboard-5fb844d56d-txz6b                           1/1     Running            0          8h
jupyter-web-app-deployment-bdfb5d69f-wbzbt                  1/1     Running            0          8h
katib-controller-7b98cd6865-v9thk                           1/1     Running            0          8h
katib-db-manager-7689947dc5-kl2fb                           0/1     CrashLoopBackOff   104        8h
katib-mysql-586f79b694-ccvk6                                0/1     CrashLoopBackOff   104        8h
katib-ui-64fbdf4d94-7x59k                                   1/1     Running            0          8h
kfserving-controller-manager-0                              2/2     Running            0          8h
kubeflow-pipelines-profile-controller-6cfd6bf9bd-r5hzn      1/1     Running            0          8h
metacontroller-0                                            1/1     Running            0          8h
metadata-envoy-deployment-95b58bbbb-wsvg2                   1/1     Running            0          8h
metadata-grpc-deployment-7cb87744c7-dwdbr                   2/2     Running            1          8h
metadata-writer-76b6b98985-c9c2p                            2/2     Running            0          8h
minio-5b65df66c9-29p8s                                      2/2     Running            0          8h
ml-pipeline-84858dd97b-mpln6                                2/2     Running            1          8h
ml-pipeline-persistenceagent-6ff46967ff-7rslv               2/2     Running            0          8h
ml-pipeline-scheduledworkflow-66bdf9948d-2vngf              2/2     Running            0          8h
ml-pipeline-ui-867664b965-8kpfl                             2/2     Running            0          8h
ml-pipeline-viewer-crd-64dddf4597-t7xg9                     2/2     Running            1          8h
ml-pipeline-visualizationserver-7f88f8b84b-mfj4m            2/2     Running            0          8h
mpi-operator-d5bfb8489-9p4fz                                0/1     CrashLoopBackOff   102        8h
mysql-f7b9b7dd4-9nfvc                                       2/2     Running            0          8h
notebook-controller-deployment-c88b44b79-qgkpc              1/1     Running            0          8h
profiles-deployment-5c94fd8fbf-d85sd                        2/2     Running            0          8h
tensorboard-controller-controller-manager-d7c68d6df-cb2f5   3/3     Running            1          8h
tensorboards-web-app-deployment-59ff4c7bd8-ssg9v            1/1     Running            0          8h
tf-job-operator-859885c8c4-fb4bm                            1/1     Running            0          8h
volumes-web-app-deployment-6457c9bcfc-gzpjq                 1/1     Running            0          8h
workflow-controller-7b44676dff-mvl6k                        2/2     Running            1          8h

juliusvonkohout commented 3 years ago

The instruction is the pull request itself. If you are incapable of building an OCI image use mtr.external.otc.telekomcloud.com/ml-pipeline/cache-deployer:1.5.1

For Katib-mysql you have to set the fsgroup to the actual user. That is a bug in the mysql image.

sunnythepatel commented 3 years ago

The instruction is the pull request itself. If you are uncapable of building an OCI image use mtr.external.otc.telekomcloud.com/ml-pipeline/cache-deployer:1.5.1

For Katib-mysql you have to set the fsgroup to the actual user. That is a bug in the mysql image.

Hi, @juliusvonkohout Thank you for your reply I tried for now with your image but I am getting now this error in logs

+ echo 'ERROR: After approving csr cache-server.kubeflow, the signed certificate did not appear on the resource. Giving up after 10 attempts.'
ERROR: After approving csr cache-server.kubeflow, the signed certificate did not appear on the resource. Giving up after 10 attempts.
+ exit 1

I think the issue is related to this https://github.com/kubeflow/pipelines/issues/4505 but not able to understand the solution I am using k8s version v1.20.8

sunnythepatel commented 3 years ago

The instruction is the pull request itself. If you are incapable of building an OCI image use mtr.external.otc.telekomcloud.com/ml-pipeline/cache-deployer:1.5.1

For Katib-mysql you have to set the fsgroup to the actual user. That is a bug in the mysql image.

Thanks, @juliusvonkohout For katib-mysql setting below in securityContext works

securityContext:
          fsGroup: 999

sunnythepatel commented 3 years ago

Hi, @juliusvonkohout

Thanks to you I was able to fix all the issues except. I fixed the mpi-operator issue as well


+ kubectl get pods -n kubeflow
NAME                                                        READY   STATUS             RESTARTS   AGE
admission-webhook-deployment-f5d8f47f8-458nx                1/1     Running            0          10h
cache-deployer-deployment-6dbb64ddcd-nvcsq                  1/2     CrashLoopBackOff   11         37m
cache-server-f84f6bdcc-jbcgm                                0/2     Init:0/1           0          80m
centraldashboard-5fb844d56d-txz6b                           1/1     Running            0          10h
jupyter-web-app-deployment-bdfb5d69f-wbzbt                  1/1     Running            0          10h
katib-controller-7b98cd6865-v9thk                           1/1     Running            0          10h
katib-db-manager-7689947dc5-kl2fb                           1/1     Running            123        10h
katib-mysql-76cdb996b-8clns                                 1/1     Running            0          27m
katib-ui-64fbdf4d94-7x59k                                   1/1     Running            0          10h
kfserving-controller-manager-0                              2/2     Running            0          10h
kubeflow-pipelines-profile-controller-6cfd6bf9bd-f9rnf      1/1     Running            0          94m
metacontroller-0                                            1/1     Running            0          94m
metadata-envoy-deployment-95b58bbbb-smg84                   1/1     Running            0          94m
metadata-grpc-deployment-7cb87744c7-7dmxd                   2/2     Running            5          94m
metadata-writer-76b6b98985-9hwgs                            2/2     Running            1          94m
minio-5b65df66c9-fbhnd                                      2/2     Running            0          94m
ml-pipeline-84858dd97b-7w6lj                                2/2     Running            4          94m
ml-pipeline-persistenceagent-6ff46967ff-xz2qg               2/2     Running            0          94m
ml-pipeline-scheduledworkflow-66bdf9948d-f9xsp              2/2     Running            0          94m
ml-pipeline-ui-867664b965-8sgx8                             2/2     Running            0          94m
ml-pipeline-viewer-crd-64dddf4597-4xtx8                     2/2     Running            1          94m
ml-pipeline-visualizationserver-7f88f8b84b-h7jnr            2/2     Running            0          94m
mpi-operator-795968c79c-rs5zh                               1/1     Running            0          6m5s
mysql-f7b9b7dd4-z767j                                       2/2     Running            0          30m
notebook-controller-deployment-c88b44b79-qgkpc              1/1     Running            0          10h
profiles-deployment-5c94fd8fbf-d85sd                        2/2     Running            0          10h
tensorboard-controller-controller-manager-d7c68d6df-cb2f5   3/3     Running            1          10h
tensorboards-web-app-deployment-59ff4c7bd8-ssg9v            1/1     Running            0          10h
tf-job-operator-859885c8c4-fb4bm                            1/1     Running            0          10h
volumes-web-app-deployment-6457c9bcfc-gzpjq                 1/1     Running            0          10h
workflow-controller-7b44676dff-87jpl                        2/2     Running            1          94m

Just need to fix the issue of cache-deployer-deployment and cache-server

kubectl describe pods cache-server-5bdf4f4457-bgwt7 -n kubeflow
Events:
  Type     Reason       Age                  From               Message
  ----     ------       ----                 ----               -------
  Normal   Scheduled    2m19s                default-scheduler  Successfully assigned kubeflow/cache-server-5bdf4f4457-bgwt7 to k8-prod-dev
  Warning  FailedMount  16s                  kubelet            Unable to attach or mount volumes: unmounted volumes=[webhook-tls-certs], unattached volumes=[webhook-tls-certs istiod-ca-cert istio-data istio-envoy istio-token istio-podinfo kubeflow-pipelines-cache-token-xb6nw]: timed out waiting for the condition
  Warning  FailedMount  11s (x9 over 2m19s)  kubelet            MountVolume.SetUp failed for volume "webhook-tls-certs" : secret "webhook-server-tls" not found
🥃 manifests-1.3.1

kubectl logs cache-deployer-deployment-79fdf9c5c9-z5lwc  -n kubeflow
echo 'ERROR: After approving csr cache-server.kubeflow, the signed certificate did not appear on the resource. Giving up after 10 attempts.'
ERROR: After approving csr cache-server.kubeflow, the signed certificate did not appear on the resource. Giving up after 10 attempts.
+ exit 1

I think the issue is related to this kubeflow/pipelines#4505 but not able to understand the solution I am using k8s version v1.20.8

juliusvonkohout commented 3 years ago

Alright caching v1 is broken by design in my opinion. Just disable it. It works on my kubernetes 1.20 but has other limitations. Bobgy already proposed caching V2.

juliusvonkohout commented 3 years ago

Since another user was able to run without root rights, should I proceed by creating a pull request? I could

Add istio-cni 1.9.8 for kubernetes and openshift so we have 3 istio options.
Add SCCs and PSPs and clusterroles.
Add a patch for katib-mysql
Copy and modify https://github.com/kubeflow/manifests/blob/master/example/kustomization.yaml to /secure-example/kustomization.yaml. I could also do it in place.

So we could integrate it into the testing pipelines and evaluate it for some time while the old insecure example is still available.

What do you think? @bobgy @yanniszark @davidspek Or is there someone else I should mention here?

Maybe @elikatsis @kimwnasptd

What do you think @manifests-wg

kimwnasptd commented 2 years ago

Thanks you your time in this effort @juliusvonkohout!

Some initial questions I have:

What security standards does your proposal include? I would expect for Pods to run as non-root, but did you have other policies in mind?
Is this an effort to introduce PodSecurityPolicies that affect Pods in the kubeflow namespace, or Pods in the user profiles/namespaces as well?
Is there a hard dependency in OpenShift for this effort? I think it's a sub-part of this work with SecurityContextConstraints, but I'd like to confirm

Then there's also the discussion around the deprecation of PodSecurityPolicies with PodSecurity admission, but let's go into this later on since it affects the versions of K8s supported by Kubeflow.

juliusvonkohout commented 2 years ago

What security standards does your proposal include? I would expect for Pods to run as non-root, but did you have other policies in mind?

run as non-root and blocking all capabilities as described here https://kubernetes.io/docs/concepts/security/pod-security-standards/#restricted. This is achievable with istio-cni which does not need net_admin and net_raw https://istio.io/latest/docs/ops/deployment/requirements/#pod-requirements. Istio-cni has an init container limitatation that you can workaround with a simple pod annotation https://discuss.istio.io/t/istio-cni-drops-initcontainers-outgoing-traffic/2311. i tested that with kfserving and seldon (annotations: traffic.sidecar.istio.io/excludeOutboundIPRanges: "0.0.0.0/0"). We might be able to set this on a namespace level.

In the long term i would even consider enforcing readOnlyRootFilesystem and use an emptydir or pvc for stuff like https://github.com/kubeflow/pipelines/blob/ef6e01c90c2c88606a0ad56d848ecc98609410c3/backend/src/cache/deployer/deploy-cache-service.sh#L39. But this is not essential at the moment and as far as i know not even enforced by the restricted profile.

2. Is this an effort to introduce `PodSecurityPolicies` that affect Pods in the `kubeflow` namespace, or Pods in the user profiles/namespaces as well?

ALL namspaces including profile namespaces, kubeflow, auth, istio-system, knative-serving, knative-eventing etc. We can start with the non-profile namespaces and handle profile namespaces later on. I apply them for profile namespaces too at the moment with kubeflow-pipelines-profile-controller. Soon it will be merged into the profiles controller as discussed here https://github.com/kubeflow/pipelines/pull/7219#issuecomment-1024086393 and here https://github.com/kubeflow/pipelines/pull/6629#issuecomment-930642835

3. Is there a hard dependency in OpenShift for this effort? I think it's a sub-part of this work with `SecurityContextConstraints`, but I'd like to confirm

Openshift needs SecurityContextConstraints. They have a slightly different syntax and are more annoying and ugly than podsecuritypolicies or podsecuritystandards. We can support both at the same time.

Then there's also the discussion around the deprecation of PodSecurityPolicies with PodSecurity admission, but let's go into this later on since it affects the versions of K8s supported by Kubeflow.

This actually does not matter much. We use a podsecuritypolicy that is equivalent to the podsecuritystandards restricted profile https://kubernetes.io/docs/concepts/security/pod-security-standards/#restricted. If podsecuritypolicies are deprecated we just have to flip a switch.

I would like to get this into the official build and testing environments too such that security issues get detected in the CICD pipelines for merge requests.

juliusvonkohout commented 1 year ago

@kimwnasptd i will work on it with cloudflare in https://github.com/kubeflow/manifests/pull/2455

juliusvonkohout commented 1 year ago

/reopen

google-oss-prow[bot] commented 1 year ago

@juliusvonkohout: Reopened this issue.

In response to [this](https://github.com/kubeflow/manifests/issues/2014#issuecomment-1710335599): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

juliusvonkohout commented 1 year ago

closed in favor of https://github.com/kubeflow/manifests/issues/2528

kubeflow / manifests

Add proper PSPs to enforce security and safety for Kubeflow on Kubernetes #2014