pingcap / tidb-operator

TiDB operator creates and manages TiDB clusters running in Kubernetes.
https://docs.pingcap.com/tidb-in-kubernetes/
Apache License 2.0
1.24k stars 499 forks source link

Webhook certificate expired when API server starts one year #5520

Open Smityz opened 10 months ago

Smityz commented 10 months ago

Bug Report

What version of Kubernetes are you using?

v1.22

What version of TiDB Operator are you using?

v1.4.4

What did you do? After running stably for several months, the operator suddenly keeps reporting errors and cannot complete sync, after disable the webhook , the operator returned to normal. Related error log:

E0112 17:49:03.476708       1 tidb_cluster_controller.go:133] TidbCluster: x sync failed Internal error occurred: failed calling webhook "defaulting.admission.tidb.pingcap.com": failed to call webhook: Post "https://kubernetes.default.svc:443/apis/admission.tidb.pingcap.com/v1alpha1/pingcapresourcemutations?timeout=10s": x509: certificate has expired or is not yet valid: current time 2024-01-12T17:49:03+08:00 is after 2024-01-10T09:50:21Z, requeuing
E0112 17:49:03.859792       1 tidbcluster_control.go:90] failed to update TidbCluster: [x], error: Internal error occurred: failed calling webhook "defaulting.admission.tidb.pingcap.com": failed to call webhook: Post "https://kubernetes.default.svc:443/apis/admission.tidb.pingcap.com/v1alpha1/pingcapresourcemutations?timeout=10s": x509: certificate has expired or is not yet valid: current time 2024-01-12T17:49:03+08:00 is after 2024-01-10T09:50:21Z

We speculate that this may be related to the self-signed mechanism of the api-server, because the expiration time of the certificate happens to be one year after the api server starts. And we also found related bug here https://github.com/openshift/generic-admission-server/issues/33

csuzhangxc commented 10 months ago

as https://github.com/openshift/generic-admission-server/issues/33#issuecomment-620513624 said, in k8s 1.18, k8s.io/apiserver supports reload of the serving certs.

TiDB Operator v1.4.4 has been using v1.19 of K8s (https://github.com/pingcap/tidb-operator/blob/v1.4.4/go.mod#L65), and this version of generic-admission-server also using k8s v1.19 (https://github.com/openshift/generic-admission-server/blob/da96454c926de350e52f6c7a6ee86af49ee96b00/go.mod), it should reload the certs.

Did your cert just expire or renew after expired?

iPenx commented 10 months ago

that's not the certs of tidb-webhook expired, but the CA of "kuberntes.default.svc" in the k8s apiserver is.

because the call flow of tidb crd adminssion is k8s apiserver -> apiservice (kuberntes.default.svc) -> tidb webhook pod i.e. k8s apiserver -> k8s apiserver (kuberntes.default.svc) -> tidb webhook pod

when a k8s apiserver runs for more that one year and doesn't restart, the CA of kuberntes.default.svc in the k8s apiserver memory will expire. As a result, the k8s apiserver accessing the k8s apiserver itself will fail after a year in this case.

by default the CA of kuberntes.default.svc in k8s apiserver memory is self-signed for one year during k8s apiserver starting.

csuzhangxc commented 10 months ago

@Smityz is this caused as iPenx said? Have you resolved it?

Smityz commented 10 months ago

@Smityz is this caused as iPenx said? Have you resolved it?

Yes, we are in the same team. We disable webhook finally, but I think it's a common problem and it needs to be solve.