kubermatic / kubeone

Kubermatic KubeOne automate cluster operations on all your cloud, on-prem, edge, and IoT environments.
https://kubeone.io
Apache License 2.0
1.39k stars 235 forks source link

Reconciliation of request failed: failed to get machine for node #2114

Closed dcardellino closed 2 years ago

dcardellino commented 2 years ago

Hello together,

We deploy Kubernetes Clusters on Hetzner with kubermatic/kubeone which uses by default the kubermatic/machine-controller addon. I don't know if this is an issue, because (i think so) everything works fine.

But when I view the logs of the machine controller i see the following logs occuring every second:

I0622 12:51:39.408545       1 node_csr_approver.go:103] Reconciling CSR csr-dnfp5                                                                                                                                                                                                                                                                                      
E0622 12:51:39.409188       1 node_csr_approver.go:89] Reconciliation of request /csr-dnfp5 failed: failed to get machine for node 'control-plane-1': failed to get machine for given node name 'control-plane-1'                                                                                                          
I0622 12:51:43.525874       1 node_csr_approver.go:103] Reconciling CSR csr-sw2mt                                                                                                                                                                                                                                                                                    
E0622 12:51:43.527863       1 node_csr_approver.go:89] Reconciliation of request /csr-sw2mt failed: failed to get machine for node 'control-plane-3': failed to get machine for given node name 'control-plane-3'                                                                                                          
I0622 12:51:44.107206       1 node_csr_approver.go:103] Reconciling CSR csr-jb2h6                                                                                                                                                                                                                                                                                      
E0622 12:51:44.108756       1 node_csr_approver.go:89] Reconciliation of request /csr-jb2h6 failed: failed to get machine for node 'control-plane-2': failed to get machine for given node name 'control-plane-2'                                                                                                          

And my kube-system namespace is flooded with csr's. I don't think this is good for my clusters.

Regards,

Dome

kron4eg commented 2 years ago

The log you've shown is completely normal, since control-plane (and static workers) nodes could have no corresponding Machine objects.

But what do you mean by "flooded with CSRs"?

dcardellino commented 2 years ago

@kron4eg

Just a bunch of them:

k get csr -n kube-system                                                                                                                                                                                                                                                                                                                                            ─╯
NAME        AGE     SIGNERNAME                      REQUESTOR                                           REQUESTEDDURATION   CONDITION
csr-244dx   23h     kubernetes.io/kubelet-serving   system:node:hcloud-k8s-core-stage-control-plane-2   <none>              Pending
csr-245mg   8h      kubernetes.io/kubelet-serving   system:node:hcloud-k8s-core-stage-control-plane-2   <none>              Pending
csr-28pkd   22h     kubernetes.io/kubelet-serving   system:node:hcloud-k8s-core-stage-control-plane-3   <none>              Pending
csr-2ccdv   23h     kubernetes.io/kubelet-serving   system:node:hcloud-k8s-core-stage-control-plane-3   <none>              Pending
csr-2dxsc   10h     kubernetes.io/kubelet-serving   system:node:hcloud-k8s-core-stage-control-plane-2   <none>              Pending
csr-2klcw   7h33m   kubernetes.io/kubelet-serving   system:node:hcloud-k8s-core-stage-control-plane-2   <none>              Pending
csr-2qwwc   21h     kubernetes.io/kubelet-serving   system:node:hcloud-k8s-core-stage-control-plane-1   <none>              Pending
csr-47wth   4h58m   kubernetes.io/kubelet-serving   system:node:hcloud-k8s-core-stage-control-plane-2   <none>              Pending
csr-48vcl   6h16m   kubernetes.io/kubelet-serving   system:node:hcloud-k8s-core-stage-control-plane-2   <none>              Pending
csr-49jdp   15h     kubernetes.io/kubelet-serving   system:node:hcloud-k8s-core-stage-control-plane-1   <none>              Pending
csr-4bkxt   178m    kubernetes.io/kubelet-serving   system:node:hcloud-k8s-core-stage-control-plane-1   <none>              Pending
csr-4cbcz   20h     kubernetes.io/kubelet-serving   system:node:hcloud-k8s-core-stage-control-plane-2   <none>              Pending
csr-4dt8b   4h32m   kubernetes.io/kubelet-serving   system:node:hcloud-k8s-core-stage-control-plane-3   <none>              Pending
csr-4h4wl   7h21m   kubernetes.io/kubelet-serving   system:node:hcloud-k8s-core-stage-control-plane-1   <none>              Pending
csr-4pnmt   15h     kubernetes.io/kubelet-serving   system:node:hcloud-k8s-core-stage-control-plane-2   <none>              Pending
csr-4r7tc   6h35m   kubernetes.io/kubelet-serving   system:node:hcloud-k8s-core-stage-control-plane-1   <none>              Pending
csr-4rbgx   6h19m   kubernetes.io/kubelet-serving   system:node:hcloud-k8s-core-stage-control-plane-1   <none>              Pending
csr-4snm6   11h     kubernetes.io/kubelet-serving   system:node:hcloud-k8s-core-stage-control-plane-2   <none>              Pending
csr-4zn6b   19h     kubernetes.io/kubelet-serving   system:node:hcloud-k8s-core-stage-control-plane-3   <none>              Pending
kron4eg commented 2 years ago

@dcardellino I'll move this issue to kubeone

kron4eg commented 2 years ago

@dcardellino what kubeone version are you running?

dcardellino commented 2 years ago

@kron4eg We are running the following version:

kubeone version
{
  "kubeone": {
    "major": "1",
    "minor": "4",
    "gitVersion": "1.4.3",
    "gitCommit": "717787f2287964e5793d80ec8ca2c2169936b0ac",
    "gitTreeState": "",
    "buildDate": "2022-05-11T14:18:03Z",
    "goVersion": "go1.18.1",
    "compiler": "gc",
    "platform": "linux/amd64"
  },
  "machine_controller": {
    "major": "1",
    "minor": "43",
    "gitVersion": "v1.43.2",
    "gitCommit": "",
    "gitTreeState": "",
    "buildDate": "",
    "goVersion": "",
    "compiler": "",
    "platform": "linux/amd64"
  }
}
dcardellino commented 2 years ago

@kron4eg Any updates here?

kron4eg commented 2 years ago

@dcardellino is it possible that you test-drive dev build from the master?

dcardellino commented 2 years ago

@kron4eg

Is there a Make target to build from master branch?

kron4eg commented 2 years ago

make build will build dist/kubeone

kron4eg commented 2 years ago

I'm not really sure why kubelet keep generating new certificate requests, probably it's some kind of a bug in the kubelet. But I've observed that if I approve all pending certificates their generation is stopped.

dcardellino commented 2 years ago

@kron4eg Sorry for the delay.

But I still get this issue altough I use the latest release of kubeobe and did the suggested solution like in #2199:

This issue is fixed by restarting Kubelet on all control plane nodes after CCM initializes nodes. Kubelet will automatically generate new CSRs when starting, which we approve after a minute or so (we give some time to be sure that all CSRs soaked in).
xmudrii commented 2 years ago

@dcardellino Since #2199, we're not able to reproduce the issue any longer (neither manually, nor in the E2E tests). Can you give KubeOne 1.5.0 a try?

dcardellino commented 2 years ago

@xmudrii Sorry, I forgot to mention that I recently approved the pending CSRs manually with kubectl certificate approve <csr-name>. And now everything works fine! But thank you!

c4tz commented 10 months ago

Hey, I just wanted to let you know that I had the same issue today with KubeOne 1.5.4 and k8s 1.25.11. Sadly, I also manually approved the CSRs before finding this issue.

I still have about 280 pending requests, but they seemingly get deleted when becoming older than 24h. I also have these log entries:

2024-01-31 15:15:21.715 E0131 14:15:21.715009       1 node_csr_approver.go:89] Reconciliation of request /csr-5bjq6 failed: failed to get machine for node 'staging-control-plane-3': failed to get machine for given node name 'staging-control-plane-3'
2024-01-31 15:15:21.715 I0131 14:15:21.714902       1 node_csr_approver.go:103] Reconciling CSR csr-5bjq6
2024-01-31 15:15:21.614 E0131 14:15:21.614419       1 node_csr_approver.go:89] Reconciliation of request /csr-zsg56 failed: failed to get machine for node 'staging-control-plane-2': failed to get machine for given node name 'staging-control-plane-2'
2024-01-31 15:15:21.614 I0131 14:15:21.614329       1 node_csr_approver.go:103] Reconciling CSR csr-zsg56
2024-01-31 15:15:21.515 E0131 14:15:21.515034       1 node_csr_approver.go:89] Reconciliation of request /csr-s99s4 failed: failed to get machine for node 'staging-control-plane-1': failed to get machine for given node name 'staging-control-plane-1'
2024-01-31 15:15:21.515 I0131 14:15:21.514952       1 node_csr_approver.go:103] Reconciling CSR csr-s99s4

even after running KubeOne again (exact command: kubeone apply --manifest kubeone.yaml --tfjson output.json --upgrade-machine-deployments --auto-approve) and deleting the MachineController pod manually, so it gets recreated/restarted.

Either the fix didn't work or my MachineController somehow is running an old version (its image is quay.io/kubermatic/machine-controller:v1.56.2 currently). :thinking: