kubeflow / pipelines

Machine Learning Pipelines for Kubeflow
https://www.kubeflow.org/docs/components/pipelines/
Apache License 2.0
3.53k stars 1.59k forks source link

Cache deployer fails if the cluster signer is not set #4505

Closed davidspek closed 5 months ago

davidspek commented 3 years ago

What steps did you take:

[A clear and concise description of what the bug is.] When deploying kubeflow using kfctl_istio_dex.v1.1.0.yaml on a Charmed Kubernetes 1.19 cluster the cache-server and cache-deployer-deployment pods get stuck in PodInitializing and CrashLoopBackOff respectively. The cache-server pod shows the error MountVolume.SetUp failed for volume "webhook-tls-certs" : secret "webhook-server-tls" not found. Redploying either or both of the pods does not fix the issue. The cache-deployer-deployment pod gives the following logs:

+ echo 'Start deploying cache service to existing cluster:'
+ NAMESPACE=kubeflow
+ MUTATING_WEBHOOK_CONFIGURATION_NAME=cache-webhook-kubeflow
+ WEBHOOK_SECRET_NAME=webhook-server-tls
Start deploying cache service to existing cluster:
+ kubectl get mutatingwebhookconfigurations cache-webhook-kubeflow --namespace kubeflow --ignore-not-found
+ kubectl get secrets webhook-server-tls --namespace kubeflow --ignore-not-found
+ webhook_config_exists=false
+ grep cache-webhook-kubeflow -w
+ webhook_secret_exists=false
+ grep webhook-server-tls -w
+ '[' false '==' true ]
+ '[' false '==' true ]
+ '[' false '==' true ]
+ export 'CA_FILE=ca_cert'
+ rm -f ca_cert
+ touch ca_cert
+ ./webhook-create-signed-cert.sh --namespace kubeflow --cert_output_path ca_cert --secret webhook-server-tls
+ [[ 6 -gt 0 ]]
+ case ${1} in
+ namespace=kubeflow
+ shift
+ shift
+ [[ 4 -gt 0 ]]
+ case ${1} in
+ cert_output_path=ca_cert
+ shift
+ shift
+ [[ 2 -gt 0 ]]
+ case ${1} in
+ secret=webhook-server-tls
+ shift
+ shift
+ [[ 0 -gt 0 ]]
+ '[' -z ']'
+ service=cache-server
+ '[' -z webhook-server-tls ']'
+ '[' -z kubeflow ']'
+ '[' -z ca_cert ']'
++ command -v openssl
+ '[' '!' -x /usr/bin/openssl ']'
+ csrName=cache-server.kubeflow
++ mktemp -d
+ tmpdir=/tmp/tmp.KGlEMA
+ echo 'creating certs in tmpdir /tmp/tmp.KGlEMA '
creating certs in tmpdir /tmp/tmp.KGlEMA 
+ cat
+ openssl genrsa -out /tmp/tmp.KGlEMA/server-key.pem 2048
Generating RSA private key, 2048 bit long modulus (2 primes)
.......................................................................................+++++
...................................................................+++++
e is 65537 (0x010001)
+ openssl req -new -key /tmp/tmp.KGlEMA/server-key.pem -subj /CN=cache-server.kubeflow.svc -out /tmp/tmp.KGlEMA/server.csr -config /tmp/tmp.KGlEMA/csr.conf
+ echo 'start running kubectl...'
start running kubectl...
+ kubectl delete csr cache-server.kubeflow
certificatesigningrequest.certificates.k8s.io "cache-server.kubeflow" deleted
+ cat
+ kubectl create -f -
++ cat /tmp/tmp.KGlEMA/server.csr
++ base64
++ tr -d '\n'
certificatesigningrequest.certificates.k8s.io/cache-server.kubeflow created
+ true
+ kubectl get csr cache-server.kubeflow
NAME                    AGE   SIGNERNAME                     REQUESTOR                                                             CONDITION
cache-server.kubeflow   0s    kubernetes.io/legacy-unknown   system:serviceaccount:kubeflow:kubeflow-pipelines-cache-deployer-sa   Pending
+ '[' 0 -eq 0 ']'
+ break
+ kubectl certificate approve cache-server.kubeflow
No resources found
error: no kind "CertificateSigningRequest" is registered for version "certificates.k8s.io/v1" in scheme "k8s.io/kubernetes/pkg/kubectl/scheme/scheme.go:28"

The cache-server.kubeflow csr is stuck in a Pending condition. However, manually running kubectl certificate approve cache-server.kubeflow does work.

The following pull requests seem to be related: https://github.com/openshift/oc/pull/501 https://github.com/openshift/installer/pull/3943

Environment:

Charmed Kubernetes 1.19 running on Ubuntu 20.04.1.

How did you deploy Kubeflow Pipelines (KFP)? full Kubeflow deployment

/kind bug /area backend

davidspek commented 3 years ago

/area backend

Ark-kun commented 3 years ago

I wonder what would be the best way to deal with this issue. The request we send is v1beta1, not v1. This looks like a bug in some version of kubectl.

davidspek commented 3 years ago

@Ark-kun, could this have to do with the cluster being Kubernetes 1.19 and its changes in regards to beta API's?

Ark-kun commented 3 years ago

Maybe the problem is related to the version mismatch between kubectl version in the container and the Kubernetes server version. Kubectl is v1.16 in the deployer container:

Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.9", GitCommit:"a17149e1a189050796ced469dbd78d380f2ed5ef", GitTreeState:"clean", BuildDate:"2020-04-16T11:44:51Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
davidspek commented 3 years ago

@Ark-kun It doesn't seem like this issue has been resolved. I just deployed Kubeflow 1.2 on Kubernetes 1.19.4 and the cache-server and cache-deployer-deployment are still stuck with errors.

I have spotted 2 Certificate Signing Requests, both identical with one in namespace istio-system and the other in kubeflow. I remember there was an issue that the CSR was not being approved, which it now is but I don't think it is getting issued.

kubectl describe csr cache-server.kubeflow -n istio-system

Name:               cache-server.kubeflow
Labels:             <none>
Annotations:        <none>
CreationTimestamp:  Fri, 20 Nov 2020 22:54:30 +0100
Requesting User:    system:serviceaccount:kubeflow:kubeflow-pipelines-cache-deployer-sa
Signer:             kubernetes.io/legacy-unknown
Status:             Approved
Subject:
  Common Name:    cache-server.kubeflow.svc
  Serial Number:  
Subject Alternative Names:
         DNS Names:  cache-server
                     cache-server.kubeflow
                     cache-server.kubeflow.svc
Events:  <none>

Cache-deployer-deployment logs:


+ shift
+ [[ 2 -gt 0 ]]
+ case ${1} in
+ secret=webhook-server-tls
+ shift
+ shift
+ [[ 0 -gt 0 ]]
+ '[' -z ']'
+ service=cache-server
+ '[' -z webhook-server-tls ']'
+ '[' -z kubeflow ']'
+ '[' -z ca_cert ']'
++ command -v openssl
+ '[' '!' -x /usr/bin/openssl ']'
+ csrName=cache-server.kubeflow
++ mktemp -d
+ tmpdir=/tmp/tmp.meigEj
+ echo 'creating certs in tmpdir /tmp/tmp.meigEj '
creating certs in tmpdir /tmp/tmp.meigEj 
+ cat
+ openssl genrsa -out /tmp/tmp.meigEj/server-key.pem 2048
Generating RSA private key, 2048 bit long modulus (2 primes)
............................................+++++
..........................................................................+++++
e is 65537 (0x010001)
+ openssl req -new -key /tmp/tmp.meigEj/server-key.pem -subj /CN=cache-server.kubeflow.svc -out /tmp/tmp.meigEj/server.csr -config /tmp/tmp.meigEj/csr.conf
+ echo 'start running kubectl...'
+ kubectl delete csr cache-server.kubeflow
start running kubectl...
certificatesigningrequest.certificates.k8s.io "cache-server.kubeflow" deleted
+ cat
+ kubectl create -f -
++ cat /tmp/tmp.meigEj/server.csr
++ base64
++ tr -d '\n'
Warning: certificates.k8s.io/v1beta1 CertificateSigningRequest is deprecated in v1.19+, unavailable in v1.22+; use certificates.k8s.io/v1 CertificateSigningRequest
certificatesigningrequest.certificates.k8s.io/cache-server.kubeflow created
+ true
+ kubectl get csr cache-server.kubeflow
NAME                    AGE   SIGNERNAME                     REQUESTOR                                                             CONDITION
cache-server.kubeflow   0s    kubernetes.io/legacy-unknown   system:serviceaccount:kubeflow:kubeflow-pipelines-cache-deployer-sa   Pending
+ '[' 0 -eq 0 ']'
+ break
+ kubectl certificate approve cache-server.kubeflow
certificatesigningrequest.certificates.k8s.io/cache-server.kubeflow approved
++ seq 10
+ for x in $(seq 10)
++ kubectl get csr cache-server.kubeflow -o 'jsonpath={.status.certificate}'
+ serverCert=
+ [[ '' != '' ]]
+ sleep 1
+ for x in $(seq 10)
++ kubectl get csr cache-server.kubeflow -o 'jsonpath={.status.certificate}'
+ serverCert=
+ [[ '' != '' ]]
+ sleep 1
+ for x in $(seq 10)
++ kubectl get csr cache-server.kubeflow -o 'jsonpath={.status.certificate}'
+ serverCert=
+ [[ '' != '' ]]
+ sleep 1
+ for x in $(seq 10)
++ kubectl get csr cache-server.kubeflow -o 'jsonpath={.status.certificate}'
+ serverCert=
+ [[ '' != '' ]]
+ sleep 1
+ for x in $(seq 10)
++ kubectl get csr cache-server.kubeflow -o 'jsonpath={.status.certificate}'
+ serverCert=
+ [[ '' != '' ]]
+ sleep 1
+ for x in $(seq 10)
++ kubectl get csr cache-server.kubeflow -o 'jsonpath={.status.certificate}'
+ serverCert=
+ [[ '' != '' ]]
+ sleep 1
+ for x in $(seq 10)
++ kubectl get csr cache-server.kubeflow -o 'jsonpath={.status.certificate}'
+ serverCert=
+ [[ '' != '' ]]
+ sleep 1
+ for x in $(seq 10)
++ kubectl get csr cache-server.kubeflow -o 'jsonpath={.status.certificate}'
+ serverCert=
+ [[ '' != '' ]]
+ sleep 1
+ for x in $(seq 10)
++ kubectl get csr cache-server.kubeflow -o 'jsonpath={.status.certificate}'
+ serverCert=
+ [[ '' != '' ]]
+ sleep 1
+ for x in $(seq 10)
++ kubectl get csr cache-server.kubeflow -o 'jsonpath={.status.certificate}'
+ serverCert=
+ [[ '' != '' ]]
+ sleep 1
+ [[ '' == '' ]]
+ echo 'ERROR: After approving csr cache-server.kubeflow, the signed certificate did not appear on the resource. Giving up after 10 attempts.'
ERROR: After approving csr cache-server.kubeflow, the signed certificate did not appear on the resource. Giving up after 10 attempts.
+ exit 1
davidspek commented 3 years ago

I think the issue is caused by the fact that signerName is a required field that is not set, and kubernetes.io/legacy-unknown has been removed from Kubernetes 1.19. It will need to replaced by kubernetes.io/kube-apiserver-client, kubernetes.io/kube-apiserver-client-kubelet or kubernetes.io/kubelet-serving. https://kubernetes.io/docs/reference/access-authn-authz/certificate-signing-requests/#kubernetes-signers

davidspek commented 3 years ago

It would seem that it might also be because --cluster-signing-cert-file and --cluster-signing-key-file need to be set for kube-controller-manager. I'm not sure if that was mentioned in the docs somewhere as a requirement, but it should similarly to how JWT for istio is stated if it is indeed required.

davidspek commented 3 years ago

As one would expect, it was the fact that the --cluster-signing-cert-file and --cluster-signing-key-file were not set.

Bobgy commented 3 years ago

/reopen Thanks @DavidSpek for the investigation

k8s-ci-robot commented 3 years ago

@Bobgy: Reopened this issue.

In response to [this](https://github.com/kubeflow/pipelines/issues/4505#issuecomment-731467880): >/reopen >Thanks @DavidSpek for the investigation Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
Bobgy commented 3 years ago

To support the webhook set up process stabler, we should seriously consider https://github.com/kubeflow/pipelines/issues/4695

davidspek commented 3 years ago

I would also suggest using cert-manager, as it seems the other applications are using that as well. Also, for my specific situation with Canonical's CDK, it is a manual multi-step process to copy the ca.key from the EasyRSA node to the master nodes due to the permissions on the file.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

grmoktan commented 3 years ago

Hi @DavidSpek :

I am also getting the same error:

+ echo 'ERROR: After approving csr cache-server.kubeflow, the signed certificate did not appear on the resource. Giving up after 10 attempts.'
ERROR: After approving csr cache-server.kubeflow, the signed certificate did not appear on the resource. Giving up after 10 attempts.
+ exit 1

And I have no way of setting the --cluster-signing-cert-file and --cluster-signing-key-file from my side as the rancher kubernetes deployment is managed elsewhere.

Is there an example of what the cert-manager approach entails?

I'm trying to deploy kubeflow v1.3-branch with kustomize.

cavepopo commented 2 years ago

Getting hit by this very same behaviour. Rancher 2.5.8, k8s v1.20.9.

Any workaround ? I'm still too noobish to hack the cert-manager and other resources...

Bobgy commented 2 years ago

I highly recommend checking out v2 caching now, it does not depend on any privilege.

https://www.kubeflow.org/docs/components/pipelines/caching-v2/

cavepopo commented 2 years ago

I highly recommend checking out v2 caching now, it does not depend on any privilege.

https://www.kubeflow.org/docs/components/pipelines/caching-v2/

Hi @Bobgy , Thanks for the tips, excuse my noobiness but how should I use the caching-v2 on an existing install or for a a new install ?

Thanks

Bobgy commented 2 years ago

@cavepopo no worries. You'll need to either upgrade your existing install or make a new install. Note the version requirement (actually latest release is KFP 1.7.0-rc.4):

Kubeflow Pipelines 1.7.0

kaben commented 2 years ago

Dunno whether this is solved yet. The problems might be in backend/src/cache/deployer/webhook-create-signed-cert.sh line 118, where a CertificateSigningRequest is created with usages including - server auth.

It might need to be replaced with - client auth. See Certificate Signing Request, Kubernetes signers, 1.4,

Permitted key usages - must include ["client auth"]. Must not include key usages beyond ["digital signature", "key encipherment", "client auth"].

The generated CertificateSigningRequest would then read something like this:

apiVersion: certificates.k8s.io/v1
kind: CertificateSigningRequest
metadata:
  name: ${csrName}
spec:
  groups:
  - system:authenticated
  request: $(cat ${tmpdir}/server.csr | base64 | tr -d '\n')
  signerName: kubernetes.io/kube-apiserver-client
  usages:
  - digital signature
  - key encipherment
  - client auth

with suitable replacements in the metadata.name and spec.request fields.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

xubofei1983 commented 8 months ago

is this still an issue? I have EKS 1.26 and deploy from master branch 2.x and still get no CSR certificate and cache-deployer crash in loop

rimolive commented 5 months ago

Looks like it's not an issue anymore. I'll close it but feel free to reopen if the issue persists.

/close

google-oss-prow[bot] commented 5 months ago

@rimolive: Closing this issue.

In response to [this](https://github.com/kubeflow/pipelines/issues/4505#issuecomment-1983237070): >Looks like it's not an issue anymore. I'll close it but feel free to reopen if the issue persists. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.