Closed juliusvonkohout closed 1 year ago
HI @juliusvonkohout , Can I use your solution to deploy all pods if PSP is enabled?
Thank you
HI @juliusvonkohout , Can I use your solution to deploy all pods if PSP is enabled?
Thank you
There is not "one" PSP. Please read the whole Kubernetes documentation to PSPs first. You need to understand Kubernetes before altering Kubeflow. If your company is interested in a managed Kubeflow contact me (t-systems) or Arrikto for a managed offer.
Hi, @juliusvonkohout, Thank you I was able to resolve my PSP issue by adding the below things.
- PSP
- istio/kubernetes
I still had few issues though I am not sure I guess I am having an issue in installing these pods
manifests-1.3.1 kubectl logs cache-deployer-deployment-6dbb64ddcd-dwplm -n kubeflow
kubectl cache-deployer-deployment-6dbb64ddcd-dwplm
+ kubectl logs cache-deployer-deployment-6dbb64ddcd-dwplm -n kubeflow
+ echo 'Start deploying cache service to existing cluster:'
+ NAMESPACE=kubeflow
Start deploying cache service to existing cluster:
+ MUTATING_WEBHOOK_CONFIGURATION_NAME=cache-webhook-kubeflow
+ WEBHOOK_SECRET_NAME=webhook-server-tls
+ mkdir -p /home/cloudsdk/bin
+ export 'PATH=/home/cloudsdk/bin:/google-cloud-sdk/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin'
+ kubectl version --output json
+ jq --raw-output '(.serverVersion.major + "." + .serverVersion.minor)'
+ tr -d '"+'
+ server_version_major_minor=1.20
+ curl -s https://storage.googleapis.com/kubernetes-release/release/stable-1.20.txt
+ stable_build_version=v1.20.10
+ kubectl_url=https://storage.googleapis.com/kubernetes-release/release/v1.20.10/bin/linux/amd64/kubectl
+ curl -L -o /home/cloudsdk/bin/kubectl https://storage.googleapis.com/kubernetes-release/release/v1.20.10/bin/linux/amd64/kubectl
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 38.3M 100 38.3M 0 0 14.1M 0 0:00:02 0:00:02 --:--:-- 14.1M
+ chmod +x /home/cloudsdk/bin/kubectl
/kfp/cache/deployer/deploy-cache-service.sh: line 47: can't create webhooks.txt: Permission denied
1:31
manifests-1.3.1 kubectl logs cache-server-f84f6bdcc-x9nlf -n kubeflow
+ kubectl logs cache-server-f84f6bdcc-x9nlf -n kubeflow
Error from server (BadRequest): container "server" in pod "cache-server-f84f6bdcc-x9nlf" is waiting to start: PodInitializing
🔥 manifests-1.3.1
for cache-server
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 60m default-scheduler Successfully assigned kubeflow/cache-server-f84f6bdcc-x9nlf to k8-prod-dev-m2-k8s-node-nf-1
Warning FailedMount 58m kubelet Unable to attach or mount volumes: unmounted volumes=[webhook-tls-certs], unattached volumes=[kubeflow-pipelines-cache-token-8jgpg webhook-tls-certs istiod-ca-cert istio-data istio-envoy istio-token istio-podinfo]: timed out waiting for the condition
Warning FailedMount 56m kubelet Unable to attach or mount volumes: unmounted volumes=[webhook-tls-certs], unattached volumes=[istio-token istio-podinfo kubeflow-pipelines-cache-token-8jgpg webhook-tls-certs istiod-ca-cert istio-data istio-envoy]: timed out waiting for the condition
Warning FailedMount 49m kubelet Unable to attach or mount volumes: unmounted volumes=[webhook-tls-certs], unattached volumes=[istio-podinfo kubeflow-pipelines-cache-token-8jgpg webhook-tls-certs istiod-ca-cert istio-data istio-envoy istio-token]: timed out waiting for the condition
Warning FailedMount 45m (x2 over 54m) kubelet Unable to attach or mount volumes: unmounted volumes=[webhook-tls-certs], unattached volumes=[webhook-tls-certs istiod-ca-cert istio-data istio-envoy istio-token istio-podinfo kubeflow-pipelines-cache-token-8jgpg]: timed out waiting for the condition
Warning FailedMount 24m (x5 over 47m) kubelet Unable to attach or mount volumes: unmounted volumes=[webhook-tls-certs], unattached volumes=[istiod-ca-cert istio-data istio-envoy istio-token istio-podinfo kubeflow-pipelines-cache-token-8jgpg webhook-tls-certs]: timed out waiting for the condition
Warning FailedMount 20m (x4 over 51m) kubelet Unable to attach or mount volumes: unmounted volumes=[webhook-tls-certs], unattached volumes=[istio-envoy istio-token istio-podinfo kubeflow-pipelines-cache-token-8jgpg webhook-tls-certs istiod-ca-cert istio-data]: timed out waiting for the condition
Warning FailedMount 15m (x2 over 33m) kubelet Unable to attach or mount volumes: unmounted volumes=[webhook-tls-certs], unattached volumes=[istio-data istio-envoy istio-token istio-podinfo kubeflow-pipelines-cache-token-8jgpg webhook-tls-certs istiod-ca-cert]: timed out waiting for the condition
Warning FailedMount 5m46s (x35 over 60m) kubelet MountVolume.SetUp failed for volume "webhook-tls-certs" : secret "webhook-server-tls" not found
Also, there are issues with pod admitting for User Namespace due to PSP Finally, create a new namespace for the default user (named kubeflow-user-example-com).
kustomize build common/user-namespace/base | kubectl apply -f -
Can you please provide me with some pointer on this how I can resolve it?
Thank you.
@sunnythepatel If you would have investigated the cache-server issue yourself, you would have found out that it is fixed upstream in 1.4 and there are instruction on how to build a version for 1.3.1. https://github.com/kubeflow/pipelines/pull/5742 ? I am using a patched 1.5.1 image myself withKubeflow 1.3.1
"Also, there are issues with pod admitting for User Namespace due to PSP Finally, create a new namespace for the default user (named kubeflow-user-example-com)."
Why did you deliberately omit "- PSP_SCC_clusterrole" from my instructions? If you do not add the PSP to all user namespaces using the clusterrole it will obviously not work.
"Also, there are issues with pod admitting for User Namespace due to PSP Finally, create a new namespace for the default user (named kubeflow-user-example-com)."
Why did you deliberately omit "- PSP_SCC_clusterrole" from my instructions? If you do not add the PSP to all user namespaces using the clusterrole it will obviously not work.
Sorry, I completely missed that it works now. Thank you
"Also, there are issues with pod admitting for User Namespace due to PSP Finally, create a new namespace for the default user (named kubeflow-user-example-com)." Why did you deliberately omit "- PSP_SCC_clusterrole" from my instructions? If you do not add the PSP to all user namespaces using the clusterrole it will obviously not work.
Sorry, I completely missed that it works now. Thank you
Please check everything and confirm whether it works. Then we might be able to persuade the manifest working group to get this upstream.
@sunnythepatel If you would have investigated the cache-server issue yourself, you would have found out that it is fixed upstream in 1.4 and there are instruction on how to build a version for 1.3.1. kubeflow/pipelines#5742 ? I am using a patched 1.5.1 image myself withKubeflow 1.3.1
Hi, @juliusvonkohout Thank you for your reply but I tried to access the image but I was not able to do it successfully and I also can't find the instructions mentioned over there.
Thank you
@sunnythepatel If you would have investigated the cache-server issue yourself, you would have found out that it is fixed upstream in 1.4 and there are instruction on how to build a version for 1.3.1. kubeflow/pipelines#5742 ? I am using a patched 1.5.1 image myself withKubeflow 1.3.1
Hi, @juliusvonkohout Thank you for your reply but I tried to access the image but I was not able to do it successfully and I also can't find the instructions mentioned over there.
Thank you
The instruction is the pull request itself. If you are uncapable of building an OCI image use mtr.external.otc.telekomcloud.com/ml-pipeline/cache-deployer:1.5.1
"Also, there are issues with pod admitting for User Namespace due to PSP Finally, create a new namespace for the default user (named kubeflow-user-example-com)." Why did you deliberately omit "- PSP_SCC_clusterrole" from my instructions? If you do not add the PSP to all user namespaces using the clusterrole it will obviously not work.
Sorry, I completely missed that it works now. Thank you
Please check everything and confirm whether it works. Then we might be able to persuade the manifest working group to get this upstream.
Yes, It works I can confirm now.
kubectl get pods -n kubeflow-user-example-com
NAME READY STATUS RESTARTS AGE
ml-pipeline-ui-artifact-767659f9df-lscb9 2/2 Running 0 6m26s
ml-pipeline-visualizationserver-6ff9f47c6b-f62g5 2/2 Running 0 6m26s
I am now trying to fix these few pod issue
+ kubectl get pods -n kubeflow
NAME READY STATUS RESTARTS AGE
admission-webhook-deployment-f5d8f47f8-458nx 1/1 Running 0 8h
cache-deployer-deployment-6dbb64ddcd-7tb9p 1/2 CrashLoopBackOff 71 5h48m
cache-server-f84f6bdcc-x9nlf 0/2 Init:0/1 0 8h
centraldashboard-5fb844d56d-txz6b 1/1 Running 0 8h
jupyter-web-app-deployment-bdfb5d69f-wbzbt 1/1 Running 0 8h
katib-controller-7b98cd6865-v9thk 1/1 Running 0 8h
katib-db-manager-7689947dc5-kl2fb 0/1 CrashLoopBackOff 104 8h
katib-mysql-586f79b694-ccvk6 0/1 CrashLoopBackOff 104 8h
katib-ui-64fbdf4d94-7x59k 1/1 Running 0 8h
kfserving-controller-manager-0 2/2 Running 0 8h
kubeflow-pipelines-profile-controller-6cfd6bf9bd-r5hzn 1/1 Running 0 8h
metacontroller-0 1/1 Running 0 8h
metadata-envoy-deployment-95b58bbbb-wsvg2 1/1 Running 0 8h
metadata-grpc-deployment-7cb87744c7-dwdbr 2/2 Running 1 8h
metadata-writer-76b6b98985-c9c2p 2/2 Running 0 8h
minio-5b65df66c9-29p8s 2/2 Running 0 8h
ml-pipeline-84858dd97b-mpln6 2/2 Running 1 8h
ml-pipeline-persistenceagent-6ff46967ff-7rslv 2/2 Running 0 8h
ml-pipeline-scheduledworkflow-66bdf9948d-2vngf 2/2 Running 0 8h
ml-pipeline-ui-867664b965-8kpfl 2/2 Running 0 8h
ml-pipeline-viewer-crd-64dddf4597-t7xg9 2/2 Running 1 8h
ml-pipeline-visualizationserver-7f88f8b84b-mfj4m 2/2 Running 0 8h
mpi-operator-d5bfb8489-9p4fz 0/1 CrashLoopBackOff 102 8h
mysql-f7b9b7dd4-9nfvc 2/2 Running 0 8h
notebook-controller-deployment-c88b44b79-qgkpc 1/1 Running 0 8h
profiles-deployment-5c94fd8fbf-d85sd 2/2 Running 0 8h
tensorboard-controller-controller-manager-d7c68d6df-cb2f5 3/3 Running 1 8h
tensorboards-web-app-deployment-59ff4c7bd8-ssg9v 1/1 Running 0 8h
tf-job-operator-859885c8c4-fb4bm 1/1 Running 0 8h
volumes-web-app-deployment-6457c9bcfc-gzpjq 1/1 Running 0 8h
workflow-controller-7b44676dff-mvl6k 2/2 Running 1 8h
The instruction is the pull request itself. If you are incapable of building an OCI image use mtr.external.otc.telekomcloud.com/ml-pipeline/cache-deployer:1.5.1
For Katib-mysql you have to set the fsgroup to the actual user. That is a bug in the mysql image.
The instruction is the pull request itself. If you are uncapable of building an OCI image use mtr.external.otc.telekomcloud.com/ml-pipeline/cache-deployer:1.5.1
For Katib-mysql you have to set the fsgroup to the actual user. That is a bug in the mysql image.
Hi, @juliusvonkohout Thank you for your reply I tried for now with your image but I am getting now this error in logs
+ echo 'ERROR: After approving csr cache-server.kubeflow, the signed certificate did not appear on the resource. Giving up after 10 attempts.'
ERROR: After approving csr cache-server.kubeflow, the signed certificate did not appear on the resource. Giving up after 10 attempts.
+ exit 1
I think the issue is related to this https://github.com/kubeflow/pipelines/issues/4505 but not able to understand the solution I am using k8s version v1.20.8
The instruction is the pull request itself. If you are incapable of building an OCI image use mtr.external.otc.telekomcloud.com/ml-pipeline/cache-deployer:1.5.1
For Katib-mysql you have to set the fsgroup to the actual user. That is a bug in the mysql image.
Thanks, @juliusvonkohout For katib-mysql setting below in securityContext works
securityContext:
fsGroup: 999
Hi, @juliusvonkohout
Thanks to you I was able to fix all the issues except. I fixed the mpi-operator issue as well
+ kubectl get pods -n kubeflow
NAME READY STATUS RESTARTS AGE
admission-webhook-deployment-f5d8f47f8-458nx 1/1 Running 0 10h
cache-deployer-deployment-6dbb64ddcd-nvcsq 1/2 CrashLoopBackOff 11 37m
cache-server-f84f6bdcc-jbcgm 0/2 Init:0/1 0 80m
centraldashboard-5fb844d56d-txz6b 1/1 Running 0 10h
jupyter-web-app-deployment-bdfb5d69f-wbzbt 1/1 Running 0 10h
katib-controller-7b98cd6865-v9thk 1/1 Running 0 10h
katib-db-manager-7689947dc5-kl2fb 1/1 Running 123 10h
katib-mysql-76cdb996b-8clns 1/1 Running 0 27m
katib-ui-64fbdf4d94-7x59k 1/1 Running 0 10h
kfserving-controller-manager-0 2/2 Running 0 10h
kubeflow-pipelines-profile-controller-6cfd6bf9bd-f9rnf 1/1 Running 0 94m
metacontroller-0 1/1 Running 0 94m
metadata-envoy-deployment-95b58bbbb-smg84 1/1 Running 0 94m
metadata-grpc-deployment-7cb87744c7-7dmxd 2/2 Running 5 94m
metadata-writer-76b6b98985-9hwgs 2/2 Running 1 94m
minio-5b65df66c9-fbhnd 2/2 Running 0 94m
ml-pipeline-84858dd97b-7w6lj 2/2 Running 4 94m
ml-pipeline-persistenceagent-6ff46967ff-xz2qg 2/2 Running 0 94m
ml-pipeline-scheduledworkflow-66bdf9948d-f9xsp 2/2 Running 0 94m
ml-pipeline-ui-867664b965-8sgx8 2/2 Running 0 94m
ml-pipeline-viewer-crd-64dddf4597-4xtx8 2/2 Running 1 94m
ml-pipeline-visualizationserver-7f88f8b84b-h7jnr 2/2 Running 0 94m
mpi-operator-795968c79c-rs5zh 1/1 Running 0 6m5s
mysql-f7b9b7dd4-z767j 2/2 Running 0 30m
notebook-controller-deployment-c88b44b79-qgkpc 1/1 Running 0 10h
profiles-deployment-5c94fd8fbf-d85sd 2/2 Running 0 10h
tensorboard-controller-controller-manager-d7c68d6df-cb2f5 3/3 Running 1 10h
tensorboards-web-app-deployment-59ff4c7bd8-ssg9v 1/1 Running 0 10h
tf-job-operator-859885c8c4-fb4bm 1/1 Running 0 10h
volumes-web-app-deployment-6457c9bcfc-gzpjq 1/1 Running 0 10h
workflow-controller-7b44676dff-87jpl 2/2 Running 1 94m
Just need to fix the issue of cache-deployer-deployment and cache-server
kubectl describe pods cache-server-5bdf4f4457-bgwt7 -n kubeflow
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 2m19s default-scheduler Successfully assigned kubeflow/cache-server-5bdf4f4457-bgwt7 to k8-prod-dev
Warning FailedMount 16s kubelet Unable to attach or mount volumes: unmounted volumes=[webhook-tls-certs], unattached volumes=[webhook-tls-certs istiod-ca-cert istio-data istio-envoy istio-token istio-podinfo kubeflow-pipelines-cache-token-xb6nw]: timed out waiting for the condition
Warning FailedMount 11s (x9 over 2m19s) kubelet MountVolume.SetUp failed for volume "webhook-tls-certs" : secret "webhook-server-tls" not found
🥃 manifests-1.3.1
kubectl logs cache-deployer-deployment-79fdf9c5c9-z5lwc -n kubeflow
echo 'ERROR: After approving csr cache-server.kubeflow, the signed certificate did not appear on the resource. Giving up after 10 attempts.'
ERROR: After approving csr cache-server.kubeflow, the signed certificate did not appear on the resource. Giving up after 10 attempts.
+ exit 1
I think the issue is related to this kubeflow/pipelines#4505 but not able to understand the solution I am using k8s version v1.20.8
Alright caching v1 is broken by design in my opinion. Just disable it. It works on my kubernetes 1.20 but has other limitations. Bobgy already proposed caching V2.
Since another user was able to run without root rights, should I proceed by creating a pull request? I could
So we could integrate it into the testing pipelines and evaluate it for some time while the old insecure example is still available.
What do you think? @bobgy @yanniszark @davidspek Or is there someone else I should mention here?
Maybe @elikatsis @kimwnasptd
What do you think @manifests-wg
Thanks you your time in this effort @juliusvonkohout!
Some initial questions I have:
PodSecurityPolicies
that affect Pods in the kubeflow
namespace, or Pods in the user profiles/namespaces as well?SecurityContextConstraints
, but I'd like to confirmThen there's also the discussion around the deprecation of PodSecurityPolicies
with PodSecurity
admission, but let's go into this later on since it affects the versions of K8s supported by Kubeflow.
- What security standards does your proposal include? I would expect for Pods to run as non-root, but did you have other policies in mind?
run as non-root and blocking all capabilities as described here https://kubernetes.io/docs/concepts/security/pod-security-standards/#restricted. This is achievable with istio-cni which does not need net_admin and net_raw https://istio.io/latest/docs/ops/deployment/requirements/#pod-requirements. Istio-cni has an init container limitatation that you can workaround with a simple pod annotation https://discuss.istio.io/t/istio-cni-drops-initcontainers-outgoing-traffic/2311. i tested that with kfserving and seldon (annotations: traffic.sidecar.istio.io/excludeOutboundIPRanges: "0.0.0.0/0"). We might be able to set this on a namespace level.
In the long term i would even consider enforcing readOnlyRootFilesystem and use an emptydir or pvc for stuff like https://github.com/kubeflow/pipelines/blob/ef6e01c90c2c88606a0ad56d848ecc98609410c3/backend/src/cache/deployer/deploy-cache-service.sh#L39. But this is not essential at the moment and as far as i know not even enforced by the restricted profile.
2. Is this an effort to introduce `PodSecurityPolicies` that affect Pods in the `kubeflow` namespace, or Pods in the user profiles/namespaces as well?
ALL namspaces including profile namespaces, kubeflow, auth, istio-system, knative-serving, knative-eventing etc. We can start with the non-profile namespaces and handle profile namespaces later on. I apply them for profile namespaces too at the moment with kubeflow-pipelines-profile-controller. Soon it will be merged into the profiles controller as discussed here https://github.com/kubeflow/pipelines/pull/7219#issuecomment-1024086393 and here https://github.com/kubeflow/pipelines/pull/6629#issuecomment-930642835
3. Is there a hard dependency in OpenShift for this effort? I think it's a sub-part of this work with `SecurityContextConstraints`, but I'd like to confirm
Openshift needs SecurityContextConstraints. They have a slightly different syntax and are more annoying and ugly than podsecuritypolicies or podsecuritystandards. We can support both at the same time.
Then there's also the discussion around the deprecation of
PodSecurityPolicies
withPodSecurity
admission, but let's go into this later on since it affects the versions of K8s supported by Kubeflow.
This actually does not matter much. We use a podsecuritypolicy that is equivalent to the podsecuritystandards restricted profile https://kubernetes.io/docs/concepts/security/pod-security-standards/#restricted. If podsecuritypolicies are deprecated we just have to flip a switch.
I would like to get this into the official build and testing environments too such that security issues get detected in the CICD pipelines for merge requests.
@kimwnasptd i will work on it with cloudflare in https://github.com/kubeflow/manifests/pull/2455
/reopen
@juliusvonkohout: Reopened this issue.
closed in favor of https://github.com/kubeflow/manifests/issues/2528
Related to https://github.com/kubeflow/manifests/pull/1756 @yanniszark @DavidSpek and https://github.com/kubeflow/manifests/issues/1984 @sunnythepatel
Currently there are no PodSecurityPolicies or SecurityContextConstraints to enforce security within kubeflow I would like to change that and put the necessary energy in pull requests. I am using the following on my cluster for months to run everything as non-root including a rootless istio-cni. It also works for pipelines with k8sapi or the new emissary executor https://github.com/kubeflow/pipelines/issues/5718 @Bobgy
I need your feedback on the following solution. if you are satisfied, I will create a pull request.
in the main kustomization yaml kustomize_istio.zip kustomize_addons_psp_scc.zip