kubeflow / katib

Automated Machine Learning on Kubernetes
https://www.kubeflow.org/docs/components/katib
Apache License 2.0
1.51k stars 442 forks source link

Webhook certificates validation fails #1512

Closed maanur closed 3 years ago

maanur commented 3 years ago

/kind bug

What steps did you take and what happened: I installed the latest version of Katib by cloning the repo's master tree and running make deploy against aour OpenShift 4.6.21 cluster. Then I applied random-example.yaml. Created experiment remains in Running condition, Trial's pods are not updated with sidecar containers, `deployment/katib-controller' shows logs with following lines:

2021/04/07 14:57:53 http: TLS handshake error from 10.254.2.1:47974: remote error: tls: bad certificate
2021/04/07 14:57:53 http: TLS handshake error from 10.254.2.1:47972: remote error: tls: bad certificate

What did you expect to happen: Webhook certificates are valid, Trial's pods are injected with metric-gathering sidecars, Experiment successfully gathers metrics and progresses as it should.

Anything else you would like to add: As a result of job/katib-cert-generator WebhookConfiguration's .webhooks[].clientConfig.caBundle are updated with ca.crt from katib-cert-generator-token secret, assigned for the SA katib-cert-generator. According to documentation on CSR, ServiceAccount's ca.crt are not guaranteed to verify arbitrary client certificates:

None of these usages are related to ServiceAccount token secrets .data[ca.crt] in any way. That CA bundle is only guaranteed to verify a connection to the API server using the default service (kubernetes.default.svc).

I fetched tls.crt from secret/katib-webhook-cert and ca.crt from secret/katib-cert-generator-token-***, attached to the corresponding SA. Indeed, the pair is not valid:

[maanur@maanur-notebook katib-webhook-cert]$ openssl verify -verbose -CAfile ca.crt katib.crt
O = system:nodes, CN = system:node:katib-controller.kubeflow.svc
error 20 at 0 depth lookup: unable to get local issuer certificate
error katib.crt: verification failed

Environment:

andreyvelich commented 3 years ago

Thank you for creating this @maanur and tested Katib on OpenShift!

Please can you try to specify kubernetes.io/legacy-unknown signerName here: https://github.com/kubeflow/katib/blob/master/hack/cert-generator.sh#L82. Then, build and push your custom image for the cert generator:

docker build -t docker.io/<registry>/cert-generator -f cmd/cert-generator/v1beta1/Dockerfile .
docker push docker.io/<registry>/cert-generator

And use your custom image in the manifest: https://github.com/kubeflow/katib/blob/master/manifests/v1beta1/installs/katib-standalone/kustomization.yaml#L46.

My concern is that for OpenShift we need a different signerName. /cc @tenzen-y

maanur commented 3 years ago

Changed the cert-generator image, reran the job.

[maanur@toolbox katib]$ oc get csr/katib-controller.kubeflow -o jsonpath="{.spec.signerName}"
kubernetes.io/legacy-unknown

The issue reproduces:

2021/04/08 05:33:33 http: TLS handshake error from 10.254.0.1:32810: remote error: tls: bad certificate
2021/04/08 05:33:33 http: TLS handshake error from 10.254.0.1:32812: remote error: tls: bad certificate
2021/04/08 05:33:33 http: TLS handshake error from 10.254.0.1:32814: remote error: tls: bad certificate
2021/04/08 05:33:34 http: TLS handshake error from 10.254.0.1:32816: remote error: tls: bad certificate

As it is mentioned in Kubernetes docs,

Distribution of trust happens out of band for these signers. Any trust outside of those described above are strictly coincidental.

I'll try to write some kustomization overlay for OpenShift to utilize the service serving certificate feature.

andreyvelich commented 3 years ago

I'll try to write some kustomization overlay for OpenShift to utilize the service serving certificate feature.

That would be great. Thank you @maanur! Also, check this PR: https://github.com/kubeflow/katib/pull/1498#issuecomment-815343266, please. We are refactoring Katib manifests.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 3 years ago

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

stale[bot] commented 3 years ago

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.