Closed davidspek closed 5 months ago
/area backend
I wonder what would be the best way to deal with this issue.
The request we send is v1beta1
, not v1
. This looks like a bug in some version of kubectl.
@Ark-kun, could this have to do with the cluster being Kubernetes 1.19 and its changes in regards to beta
API's?
Maybe the problem is related to the version mismatch between kubectl version in the container and the Kubernetes server version. Kubectl is v1.16 in the deployer container:
Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.9", GitCommit:"a17149e1a189050796ced469dbd78d380f2ed5ef", GitTreeState:"clean", BuildDate:"2020-04-16T11:44:51Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
@Ark-kun It doesn't seem like this issue has been resolved. I just deployed Kubeflow 1.2 on Kubernetes 1.19.4 and the cache-server and cache-deployer-deployment are still stuck with errors.
I have spotted 2 Certificate Signing Requests, both identical with one in namespace istio-system
and the other in kubeflow
. I remember there was an issue that the CSR was not being approved, which it now is but I don't think it is getting issued.
kubectl describe csr cache-server.kubeflow -n istio-system
Name: cache-server.kubeflow
Labels: <none>
Annotations: <none>
CreationTimestamp: Fri, 20 Nov 2020 22:54:30 +0100
Requesting User: system:serviceaccount:kubeflow:kubeflow-pipelines-cache-deployer-sa
Signer: kubernetes.io/legacy-unknown
Status: Approved
Subject:
Common Name: cache-server.kubeflow.svc
Serial Number:
Subject Alternative Names:
DNS Names: cache-server
cache-server.kubeflow
cache-server.kubeflow.svc
Events: <none>
Cache-deployer-deployment logs:
+ shift
+ [[ 2 -gt 0 ]]
+ case ${1} in
+ secret=webhook-server-tls
+ shift
+ shift
+ [[ 0 -gt 0 ]]
+ '[' -z ']'
+ service=cache-server
+ '[' -z webhook-server-tls ']'
+ '[' -z kubeflow ']'
+ '[' -z ca_cert ']'
++ command -v openssl
+ '[' '!' -x /usr/bin/openssl ']'
+ csrName=cache-server.kubeflow
++ mktemp -d
+ tmpdir=/tmp/tmp.meigEj
+ echo 'creating certs in tmpdir /tmp/tmp.meigEj '
creating certs in tmpdir /tmp/tmp.meigEj
+ cat
+ openssl genrsa -out /tmp/tmp.meigEj/server-key.pem 2048
Generating RSA private key, 2048 bit long modulus (2 primes)
............................................+++++
..........................................................................+++++
e is 65537 (0x010001)
+ openssl req -new -key /tmp/tmp.meigEj/server-key.pem -subj /CN=cache-server.kubeflow.svc -out /tmp/tmp.meigEj/server.csr -config /tmp/tmp.meigEj/csr.conf
+ echo 'start running kubectl...'
+ kubectl delete csr cache-server.kubeflow
start running kubectl...
certificatesigningrequest.certificates.k8s.io "cache-server.kubeflow" deleted
+ cat
+ kubectl create -f -
++ cat /tmp/tmp.meigEj/server.csr
++ base64
++ tr -d '\n'
Warning: certificates.k8s.io/v1beta1 CertificateSigningRequest is deprecated in v1.19+, unavailable in v1.22+; use certificates.k8s.io/v1 CertificateSigningRequest
certificatesigningrequest.certificates.k8s.io/cache-server.kubeflow created
+ true
+ kubectl get csr cache-server.kubeflow
NAME AGE SIGNERNAME REQUESTOR CONDITION
cache-server.kubeflow 0s kubernetes.io/legacy-unknown system:serviceaccount:kubeflow:kubeflow-pipelines-cache-deployer-sa Pending
+ '[' 0 -eq 0 ']'
+ break
+ kubectl certificate approve cache-server.kubeflow
certificatesigningrequest.certificates.k8s.io/cache-server.kubeflow approved
++ seq 10
+ for x in $(seq 10)
++ kubectl get csr cache-server.kubeflow -o 'jsonpath={.status.certificate}'
+ serverCert=
+ [[ '' != '' ]]
+ sleep 1
+ for x in $(seq 10)
++ kubectl get csr cache-server.kubeflow -o 'jsonpath={.status.certificate}'
+ serverCert=
+ [[ '' != '' ]]
+ sleep 1
+ for x in $(seq 10)
++ kubectl get csr cache-server.kubeflow -o 'jsonpath={.status.certificate}'
+ serverCert=
+ [[ '' != '' ]]
+ sleep 1
+ for x in $(seq 10)
++ kubectl get csr cache-server.kubeflow -o 'jsonpath={.status.certificate}'
+ serverCert=
+ [[ '' != '' ]]
+ sleep 1
+ for x in $(seq 10)
++ kubectl get csr cache-server.kubeflow -o 'jsonpath={.status.certificate}'
+ serverCert=
+ [[ '' != '' ]]
+ sleep 1
+ for x in $(seq 10)
++ kubectl get csr cache-server.kubeflow -o 'jsonpath={.status.certificate}'
+ serverCert=
+ [[ '' != '' ]]
+ sleep 1
+ for x in $(seq 10)
++ kubectl get csr cache-server.kubeflow -o 'jsonpath={.status.certificate}'
+ serverCert=
+ [[ '' != '' ]]
+ sleep 1
+ for x in $(seq 10)
++ kubectl get csr cache-server.kubeflow -o 'jsonpath={.status.certificate}'
+ serverCert=
+ [[ '' != '' ]]
+ sleep 1
+ for x in $(seq 10)
++ kubectl get csr cache-server.kubeflow -o 'jsonpath={.status.certificate}'
+ serverCert=
+ [[ '' != '' ]]
+ sleep 1
+ for x in $(seq 10)
++ kubectl get csr cache-server.kubeflow -o 'jsonpath={.status.certificate}'
+ serverCert=
+ [[ '' != '' ]]
+ sleep 1
+ [[ '' == '' ]]
+ echo 'ERROR: After approving csr cache-server.kubeflow, the signed certificate did not appear on the resource. Giving up after 10 attempts.'
ERROR: After approving csr cache-server.kubeflow, the signed certificate did not appear on the resource. Giving up after 10 attempts.
+ exit 1
I think the issue is caused by the fact that signerName
is a required field that is not set, and kubernetes.io/legacy-unknown
has been removed from Kubernetes 1.19. It will need to replaced by kubernetes.io/kube-apiserver-client
, kubernetes.io/kube-apiserver-client-kubelet
or kubernetes.io/kubelet-serving
.
https://kubernetes.io/docs/reference/access-authn-authz/certificate-signing-requests/#kubernetes-signers
It would seem that it might also be because --cluster-signing-cert-file
and --cluster-signing-key-file
need to be set for kube-controller-manager. I'm not sure if that was mentioned in the docs somewhere as a requirement, but it should similarly to how JWT for istio is stated if it is indeed required.
As one would expect, it was the fact that the --cluster-signing-cert-file
and --cluster-signing-key-file
were not set.
/reopen Thanks @DavidSpek for the investigation
@Bobgy: Reopened this issue.
To support the webhook set up process stabler, we should seriously consider https://github.com/kubeflow/pipelines/issues/4695
I would also suggest using cert-manager, as it seems the other applications are using that as well. Also, for my specific situation with Canonical's CDK, it is a manual multi-step process to copy the ca.key from the EasyRSA node to the master nodes due to the permissions on the file.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Hi @DavidSpek :
I am also getting the same error:
+ echo 'ERROR: After approving csr cache-server.kubeflow, the signed certificate did not appear on the resource. Giving up after 10 attempts.'
ERROR: After approving csr cache-server.kubeflow, the signed certificate did not appear on the resource. Giving up after 10 attempts.
+ exit 1
And I have no way of setting the --cluster-signing-cert-file
and --cluster-signing-key-file
from my side as the rancher kubernetes deployment is managed elsewhere.
Is there an example of what the cert-manager approach entails?
I'm trying to deploy kubeflow v1.3-branch with kustomize.
Getting hit by this very same behaviour. Rancher 2.5.8, k8s v1.20.9.
Any workaround ? I'm still too noobish to hack the cert-manager and other resources...
I highly recommend checking out v2 caching now, it does not depend on any privilege.
https://www.kubeflow.org/docs/components/pipelines/caching-v2/
I highly recommend checking out v2 caching now, it does not depend on any privilege.
https://www.kubeflow.org/docs/components/pipelines/caching-v2/
Hi @Bobgy , Thanks for the tips, excuse my noobiness but how should I use the caching-v2 on an existing install or for a a new install ?
Thanks
@cavepopo no worries. You'll need to either upgrade your existing install or make a new install. Note the version requirement (actually latest release is KFP 1.7.0-rc.4):
Kubeflow Pipelines 1.7.0
Dunno whether this is solved yet. The problems might be in backend/src/cache/deployer/webhook-create-signed-cert.sh line 118, where a CertificateSigningRequest
is created with usages
including - server auth
.
It might need to be replaced with - client auth
. See Certificate Signing Request, Kubernetes signers, 1.4,
Permitted key usages - must include
["client auth"]
. Must not include key usages beyond["digital signature", "key encipherment", "client auth"]
.
The generated CertificateSigningRequest
would then read something like this:
apiVersion: certificates.k8s.io/v1
kind: CertificateSigningRequest
metadata:
name: ${csrName}
spec:
groups:
- system:authenticated
request: $(cat ${tmpdir}/server.csr | base64 | tr -d '\n')
signerName: kubernetes.io/kube-apiserver-client
usages:
- digital signature
- key encipherment
- client auth
with suitable replacements in the metadata.name
and spec.request
fields.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
is this still an issue? I have EKS 1.26 and deploy from master branch 2.x and still get no CSR certificate and cache-deployer crash in loop
Looks like it's not an issue anymore. I'll close it but feel free to reopen if the issue persists.
/close
@rimolive: Closing this issue.
What steps did you take:
[A clear and concise description of what the bug is.] When deploying kubeflow using kfctl_istio_dex.v1.1.0.yaml on a Charmed Kubernetes 1.19 cluster the cache-server and cache-deployer-deployment pods get stuck in PodInitializing and CrashLoopBackOff respectively. The cache-server pod shows the error
MountVolume.SetUp failed for volume "webhook-tls-certs" : secret "webhook-server-tls" not found
. Redploying either or both of the pods does not fix the issue. The cache-deployer-deployment pod gives the following logs:The cache-server.kubeflow csr is stuck in a Pending condition. However, manually running
kubectl certificate approve cache-server.kubeflow
does work.The following pull requests seem to be related: https://github.com/openshift/oc/pull/501 https://github.com/openshift/installer/pull/3943
Environment:
Charmed Kubernetes 1.19 running on Ubuntu 20.04.1.
How did you deploy Kubeflow Pipelines (KFP)? full Kubeflow deployment
/kind bug /area backend