kubernetes-sigs / cluster-api

Home for Cluster API, a subproject of sig-cluster-lifecycle
https://cluster-api.sigs.k8s.io
Apache License 2.0
3.55k stars 1.3k forks source link

KCP should persist etcd client certificate #4304

Closed randomvariable closed 1 year ago

randomvariable commented 3 years ago

What steps did you take and what happened: [A clear and concise description on how to REPRODUCE the bug.]

  1. Deploy a vSphere workload cluster from another vSphere management cluster. All ESXi hosts are NTP synced.
  2. Only one control plane instance appears when 3 replicas were requested. KCP is reporting etcd unhealthy, in the meantime, a worker is created and successfully joined.
  3. After a very long time, another CP instance is created.

In the etcd logs, we see:

2021-03-11 07:59:48.743560 I | embed: rejected connection from "127.0.0.1:39236" (error "tls: failed to verify client certificate: x509: certificate has expired or is not yet valid: current time 2021-03-11T07:59:48Z is before 2021-03-11T08:09:21Z", ServerName "etcd-work176-master-control-plane-5xrrc")
2021-03-11 07:59:49.773047 I | embed: rejected connection from "127.0.0.1:39244" (error "tls: failed to verify client certificate: x509: certificate has expired or is not yet valid: current time 2021-03-11T07:59:49Z is before 2021-03-11T08:09:21Z", ServerName "etcd-work176-master-control-plane-5xrrc")
2021-03-11 08:00:06.849986 I | embed: rejected connection from "127.0.0.1:39278" (error "tls: failed to verify client certificate: x509: certificate has expired or is not yet valid: current time 2021-03-11T08:00:06Z is before 2021-03-11T08:09:39Z", ServerName "etcd-work176-master-control-plane-5xrrc")
2021-03-11 08:00:07.882861 I | embed: rejected connection from "127.0.0.1:39282" (error "tls: failed to verify client certificate: x509: certificate has expired or is not yet valid: current time 2021-03-11T08:00:07Z is before 2021-03-11T08:09:39Z", ServerName "etcd-work176-master-control-plane-5xrrc")
2021-03-11 08:00:24.663342 I | embed: rejected connection from "127.0.0.1:39318" (error "tls: failed to verify client certificate: x509: certificate has expired or is not yet valid: current time 2021-03-11T08:00:24Z is before 2021-03-11T08:09:57Z", ServerName "etcd-work176-master-control-plane-5xrrc")
2021-03-11 08:00:26.286498 I | embed: rejected connection from "127.0.0.1:39322" (error "tls: failed to verify client certificate: x509: certificate has expired or is not yet valid: current time 2021-03-11T08:00:26Z is before 2021-03-11T08:09:57Z", ServerName "etcd-work176-master-control-plane-5xrrc")
...

KCP logs show:

I0312 00:06:13.548340       1 scale.go:206] controllers/KubeadmControlPlane "msg"="Waiting for control plane to pass preflight checks" "cluster"="work176" "kubeadmControlPlane"="work176-master-control-plane" "namespace"="work176" "failures"="machine work176-master-control-plane-5xrrc reports EtcdMemberHealthy condition is unknown (Failed to connect to the etcd pod on the work176-master-control-plane-5xrrc node)"

What did you expect to happen:

KCP is able to connect to etcd in < 20 minutes and create new replicas.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

The etcd logs show that we have a sliding window of certificate validity that keeps moving into the future. For stacked etcd, KCP generates a new etcd client certificate on every connect. This certificate should be persisted and reused, which would cut down on CPU cycles.

In this instance, it appears the behaviour of the cp instance is that for some time period, it's clock was slow behind 15 minutes, but eventually did synchronise. If KCP had generated a single certificate and persisted, we would see the same problem, but only for five minutes (KCP sets a certificate with a NotBefore of time.Now() - 5 minutes).

Environment:

/kind bug [One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

fabriziopandini commented 3 years ago

In this instance, it appears the behaviour of the cp instance is that for some time period, it's clock was slow behind 15 minutes, but eventually did synchronise.

This seems an NTP problem...

fabriziopandini commented 3 years ago

/milestone v0.4.0

vincepri commented 3 years ago

/milestone Next /priority awaiting-more-evidence

@randomvariable While persisting the certificate could be a good improvement (although that seems mostly on kubeadm side), I'd agree with Fabrizio above that it looks more like an NTP clock skew issue

randomvariable commented 3 years ago

This seems an NTP problem...

Yes, we're aware, the VM did eventually reach NTP synchronisation, it just took longer than expected. We don't expect to see it very often, but it seems certificate persistence would be useful.

Note that it's not a kubeadm issue, it's KCP's client connection to etcd in https://github.com/kubernetes-sigs/cluster-api/blob/master/controlplane/kubeadm/internal/workload_cluster.go#L389

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

chymy commented 3 years ago

I encountered a similar problem, when I modified the controlplane replica from 1 to 3, it failed. KCP logs show:

I0707 00:32:51.445862       1 controller.go:350] controller-runtime/manager/controller/kubeadmcontrolplane "msg"="Scaling up control plane" "cluster"="innercluster" "name"="innercluster-control-plane" "namespace"="default" "reconciler group"="controlplane.cluster.x-k8s.io" "reconciler kind"="KubeadmControlPlane" "Desired"=3 "Existing"=1
I0707 00:32:51.445992       1 scale.go:205]  "msg"="Waiting for control plane to pass preflight checks" "cluster-name"="innercluster" "name"="innercluster-control-plane" "namespace"="default" "failures"="machine innercluster-control-plane-jmtjk reports EtcdMemberHealthy condition is unknown (Failed to connect to the etcd pod on the innercluster-control-plane-twpm8 node)"

kcp status:

...
status:
    conditions:
    - lastTransitionTime: "2021-07-06T12:07:15Z"
      message: Scaling up control plane to 3 replicas (actual 1)
      reason: ScalingUp
      severity: Warning
      status: "False"
      type: Ready
    - lastTransitionTime: "2021-07-06T08:14:11Z"
      status: "True"
      type: Available
    - lastTransitionTime: "2021-07-06T08:08:21Z"
      status: "True"
      type: CertificatesAvailable
    - lastTransitionTime: "2021-07-06T08:14:12Z"
      status: "True"
      type: ControlPlaneComponentsHealthy
    - lastTransitionTime: "2021-07-06T08:14:14Z"
      message: 'Following machines are reporting unknown etcd member status: innercluster-control-plane-jmtjk'
      reason: EtcdClusterUnknown
      status: Unknown
      type: EtcdClusterHealthyCondition
    - lastTransitionTime: "2021-07-06T08:08:23Z"
      status: "True"
      type: MachinesCreated
    - lastTransitionTime: "2021-07-06T08:09:28Z"
      status: "True"
      type: MachinesReady
    - lastTransitionTime: "2021-07-06T12:07:15Z"
      message: Scaling up control plane to 3 replicas (actual 1)
      reason: ScalingUp
      severity: Warning
      status: "False"
      type: Resized
    initialized: true
    observedGeneration: 2
    ready: true
    readyReplicas: 1
    replicas: 1
    selector: cluster.x-k8s.io/cluster-name=innercluster,cluster.x-k8s.io/control-plane
    updatedReplicas: 1
...

In the controlpalne etcd logs:

2021-07-07 00:18:22.919594 I | embed: rejected connection from "127.0.0.1:48141" (error "remote error: tls: bad certificate", ServerName "etcd-innercluster-control-plane-twpm8")
2021-07-07 00:18:23.969817 I | embed: rejected connection from "127.0.0.1:48155" (error "remote error: tls: bad certificate", ServerName "etcd-innercluster-control-plane-twpm8")
2021-07-07 00:18:26.301411 I | etcdserver/api/etcdhttp: /health OK (status code 200)
2021-07-07 00:18:36.301451 I | etcdserver/api/etcdhttp: /health OK (status code 200)
2021-07-07 00:18:41.867121 I | embed: rejected connection from "127.0.0.1:48275" (error "remote error: tls: bad certificate", ServerName "etcd-innercluster-control-plane-twpm8")
2021-07-07 00:18:43.017143 I | embed: rejected connection from "127.0.0.1:48287" (error "remote error: tls: bad certificate", ServerName "etcd-innercluster-control-plane-twpm8")
2021-07-07 00:18:44.369349 I | mvcc: store.index: compact 132110
2021-07-07 00:18:44.371512 I | mvcc: finished scheduled compaction at 132110 (took 1.901392ms)
2021-07-07 00:18:44.377872 I | embed: rejected connection from "127.0.0.1:48303" (error "remote error: tls: bad certificate", ServerName "etcd-innercluster-control-plane-twpm8")
2021-07-07 00:18:45.426647 I | embed: rejected connection from "127.0.0.1:48313" (error "remote error: tls: bad certificate", ServerName "etcd-innercluster-control-plane-twpm8")

Is there any solution?

/remove-lifecycle stale

randomvariable commented 3 years ago

/lifecycle frozen

randomvariable commented 3 years ago

@chymy you should ensure that there's a good NTP time source in your environment which should prevent the problem occurring. Doing the fix here is an optimisation of sorts.

Two things to ensure in your environment: Your OpenStack hosts are time synced, such that when VMs boot, they are already relatively well synced. And secondly, that each VM guest has a time source configured, either via DHCP Option 42 or via setting NTP servers in the kubeadmconfig and KCP resources.

chymy commented 3 years ago

I ensure the NTP is good. However, In the same environment, after testing, alpha3(should be v0.3.11) is ok, alpha4(v0.3.16) not.

vincepri commented 3 years ago

v0.3.16 is still using v1alpha3, v1alpha4 is on v0.4.x

randomvariable commented 2 years ago

/area control-plane

sbueringer commented 2 years ago

/assign @yastij to take a closer look

vincepri commented 2 years ago

Any updates on this issue?

fabriziopandini commented 2 years ago

/triage accepted /unassign @yastij

/help

k8s-ci-robot commented 2 years ago

@fabriziopandini: This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to [this](https://github.com/kubernetes-sigs/cluster-api/issues/4304): >/triage accepted >/unassign @yastij > >/help Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
fabriziopandini commented 1 year ago

(doing some cleanup on old issues without updates) /close there is still a probability that this issue might happen, but we don't have recent reports about this happening so I'm closing this for now

k8s-ci-robot commented 1 year ago

@fabriziopandini: Closing this issue.

In response to [this](https://github.com/kubernetes-sigs/cluster-api/issues/4304#issuecomment-1483065283): >(doing some cleanup on old issues without updates) >/close >there is still a probability that this issue might happen, but we don't have recent reports about this happening so I'm closing this for now Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.