Cluster autoscaling - new nodes cannot join control plane

cristianrat commented 2 years ago

What happened?

I followed the guide to spin up a kube one cluster with 3 control plane nodes, 2 worker nodes and then I also enabled autoscaling and it worked ok - I tested it

Then a few days later when I scaled it up (by increasing a deployment replicas) new machines are created, but they can't join control plane.

Seems that is the case in 2 clusters - it's fine on the first day, then later it won't work Note: this is on hetzner Log says

Sep 16 10:47:06 dev-pool1-77c8b59b86-9xrgs kubelet[6923]: I0916 10:47:06.652377    6923 kubelet_node_status.go:70] "Attempting to register node" node="dev-pool1-77c8b59b86-9xrgs"

Sep 16 10:47:06 dev-pool1-77c8b59b86-9xrgs kubelet[6923]: E0916 10:47:06.655857    6923 kubelet_node_status.go:92] "Unable to register node with API server" err="nodes is forbidden: User \"system:anonymous\" cannot create resource \"nodes\" in API group \"\" at the cluster scope" node="dev-pool1-77c8b59b86-9xrgs"

And that log just repeats with various flavours / components

Deleting the MachineDeployment and MachineSet and re-creating the MachineDeployment usually fixes it - but not really a practical solution I'm happy to investigate further, if you could point me in the right direction

Expected behavior

Auto scaling to continue working

How to reproduce the issue?

Spin up cluster Setup auto scaling Go away for 1-2 days Come back and try to scale

What KubeOne version are you using?

```console $ kubeone version { "kubeone": { "major": "1", "minor": "5", "gitVersion": "1.5.0-beta.0", "gitCommit": "3800c3d9d244f76d915bb05aa26e45acafc32158", "gitTreeState": "", "buildDate": "2022-08-04T14:03:44Z", "goVersion": "go1.19", "compiler": "gc", "platform": "linux/amd64" }, "machine_controller": { "major": "1", "minor": "53", "gitVersion": "v1.53.0", "gitCommit": "", "gitTreeState": "", "buildDate": "", "goVersion": "", "compiler": "", "platform": "linux/amd64" } } ```

Provide your KubeOneCluster manifest here (if applicable)

```yaml apiVersion: kubeone.k8c.io/v1beta2 kind: KubeOneCluster versions: kubernetes: '1.24.4' cloudProvider: hetzner: {} external: true addons: enable: true addons: - name: cluster-autoscaler - name: unattended-upgrades ```

What cloud provider are you running on?

What operating system are you running in your cluster?

Ubuntu 20.04

Additional information

Happy to help investigate this further, but unsure where to look next

cristianrat commented 2 years ago

update: I've spun up a new cluster with no autoscaling enabled for two new machine deployments and even setting the desired nodes to a higher number gives the same output. So I have to delete the machine deployment and recreate - the only way I can get new nodes

xmudrii commented 2 years ago

@cristianrat Are you still using KubeOne v1.5.0-beta.0? If yes, can you please upgrade to KubeOne 1.5.0? Please make sure to run kubeone apply after upgrading, so that MC and OSM are upgraded to the latest versions. This issue might be fixed in the latest release, so please give it a try again after upgrading.

xmudrii commented 2 years ago

This issue should be fixed by https://github.com/kubermatic/operating-system-manager/pull/205 which is included in 1.5.0 (but not in 1.5.0-beta.0).

cristianrat commented 2 years ago

@xmudrii Will try and let you know - thanks for the suggestion

I've updated so now am getting this

  "kubeone": {
    "major": "1",
    "minor": "5",
    "gitVersion": "1.5.0-rc.0",
    "gitCommit": "16c6bdfd0d219cd1ff0b4beeb9fcdf89d815d84c",
    "gitTreeState": "",
    "buildDate": "2022-08-25T13:18:09Z",
    "goVersion": "go1.19",
    "compiler": "gc",
    "platform": "linux/amd64"
  },
  "machine_controller": {
    "major": "1",
    "minor": "54",
    "gitVersion": "v1.54.0",
    "gitCommit": "",
    "gitTreeState": "",
    "buildDate": "",
    "goVersion": "",
    "compiler": "",
    "platform": "linux/amd64"
  }
}

let's see

later update : the curl -sfL https://get.kubeone.io | sh way of installing, actually installed 1.5 rc 0...

cristianrat commented 2 years ago

@xmudrii I've redeployed and it didn't seem to make any difference Anything special I need to do before trying a scale?

cristianrat commented 2 years ago

can confirm that having redeployed, this still doesn't work so question is, do I need to spin up a completely new cluster to have this working?

xmudrii commented 2 years ago

@cristianrat I'm trying to reproduce this and I'll get back to you with more information soon.

cristianrat commented 2 years ago

@xmudrii If you want, I am happy to help, I can even give you access to my cluster if it would help

xmudrii commented 2 years ago

@cristianrat Can you please try rolling out all your MachineDeployments (note that this is going to recreate all worker nodes) as described in this document? After that, it should work as expected.

cristianrat commented 2 years ago

@xmudrii I think it fixed it. ran the above command, and just tested the autoscaling after 2 hours Will give it another go tomorrow, give the token a chance to expire :) Thanks so much for your help!

cristianrat commented 2 years ago

Tried now, and the issue is still there, unfortunately

xmudrii commented 2 years ago

@cristianrat Have you tried upgrading to KubeOne 1.5.0? We fixed the release script yesterday, so it should download the stable release instead of a prerelease. Additionally, can you please confirm that you're running OSM v1.0.0 in your cluster:

kubectl get deploy -n kube-system operating-system-manager -o jsonpath='{.spec.template.spec.containers[0].image}'

It should return the following image:

quay.io/kubermatic/operating-system-manager:v1.0.0

I've been trying to reproduce this issue today, while leaving the cluster idle for several hours, but I wasn't able to reproduce it.

cristianrat commented 2 years ago

@xmudrii yes, I've upgraded and then redeployed (via gh actions) the result of said command above is quay.io/kubermatic/operating-system-manager:v1.0.0 Will give it another go, perhaps I missed something

cristianrat commented 2 years ago

@xmudrii I've tested, yet again, and no, doesn't seem to work Perhaps I am doing something wrong, but I have redeployed with 1.5.0 and also did the step above you recommended. Would the only solution then, remain to setup a new cluster?

xmudrii commented 2 years ago

@cristianrat How long did you wait to scale up the cluster? Also, how did you scale it up? Did you use cluster-autoscaler or did you scaled it up manually?

cristianrat commented 2 years ago

@xmudrii I have the machine set at 24 hours, so it's at least 1 day old I scaled it by simply increasing the desired replicas (ie: not via auto scaler) Will try via auto scaler if you think it's worth it

xmudrii commented 2 years ago

I scaled it by simply increasing the desired replicas (ie: not via auto scaler)

You're increasing the desired replicas by editing MachineDeployment object and setting .spec.replicas, right?

Will try via auto scaler if you think it's worth it

I don't think that's important here. cluster-autoscaler is basically doing what you are doing -- changing the number of replicas on the MachineDeployment object.

cristianrat commented 2 years ago

I scaled it by simply increasing the desired replicas (ie: not via auto scaler)

You're increasing the desired replicas by editing MachineDeployment object and setting .spec.replicas, right?

Will try via auto scaler if you think it's worth it

I don't think that's important here. cluster-autoscaler is basically doing what you are doing -- changing the number of replicas on the MachineDeployment object.

That's right, just changing replicas, like so: replicas: 3 instead of 2, for example Under Machine I can see a new one, but it won't connect I ssh into it, tail the logs, and it's always the same, with the anonymous user I realize this is hard to debug and not sure if its just my problem or an actual problem I have 3 clusters and this issue is on all of them, so I guess I either made the same mistake 3 times, or a real problem? :D

xmudrii commented 2 years ago

@cristianrat Can you share your MachineDeployment object, machine-controller and operating-system-manager logs? Just make sure that you redact any secrets if there are any.

mnordstr commented 2 years ago

I am experiencing the same on Hetzner, 1.5.0. Got it to scale by doing a rolling restart of the machinedeployment. Not sure if it works after that (not autoscaling usually very often).

cristianrat commented 2 years ago

I am experiencing the same on Hetzner, 1.5.0. Got it to scale by doing a rolling restart of the machinedeployment. Not sure if it works after that (not autoscaling usually very often).

oh good, so I'm not a total idiot :D

cristianrat commented 2 years ago

@xmudrii Rather than pollute this thread with the files, I've added them to a repo and invited you to it (once we're sorted here, will delete it)

xmudrii commented 2 years ago

@cristianrat Got it, thank you! I'll take a look at this tomorrow and get back to you hopefully with some solution.

cristianrat commented 2 years ago

@cristianrat Got it, thank you! I'll take a look at this tomorrow and get back to you hopefully with some solution.

👍 really appreciate your help, anything else I can do, let me know

xmudrii commented 2 years ago

@cristianrat Can you just also share your kubeone.yaml and terraform.tfvars? Just make sure to redact any secrets if there are any.

cristianrat commented 2 years ago

@xmudrii sure, will add those shortly

kron4eg commented 2 years ago

I was able to reproduce the issue: and like I was suspecting, the problem is in the bootstrap token, which is missing! However it's not clear why exactly it's the this way.

cristianrat commented 2 years ago

I was able to reproduce the issue: and like I was suspecting, the problem is in the bootstrap token, which is missing! However it's not clear why exactly it's the this way.

nice any idea when there will be a fix (or what the fix is)?

xmudrii commented 2 years ago

@cristianrat I haven't confirmed it yet, but I have an assumption about what's going on here. We fixed the original issue that you reported in Operating System Manager (OSM) by:

changing the logic for reconciling OperatingSystemConfig (OSC) objects to regenerate the bootstrap token if the previous one expired
changing the OperatingSystemProfile (OSP) to take the bootstrap kubeconfig from a correct place

All that happened in https://github.com/kubermatic/operating-system-manager/issues/202

However, there is a bug in OSM which causes OSM to not reconcile OperatingSystemProfiles (OSPs) after creating them for the first time. That means, even if you updated OSM from 0.6 to 1.0.0 (which was done when upgrading KubeOne), you don't have the latest OSP and therefore you don't have the needed fix.

This turns out to be a known issue that is being worked on: https://github.com/kubermatic/operating-system-manager/issues/221

In the meanwhile, you should be able to manually mitigate the issue. You need to remove the OSP, restart OSM pods, let OSM create a new and proper OSP, and then rollout all your MachineDeployments. You can accomplish this in the following way:

Remove osp-ubuntu OperatingSystemProfile: kubectl delete osp -n kube-system osp-ubuntu
- You might have to remove a different OSP if you're not using Ubuntu
Delete OSM and OSM webhook pods. Run kubectl get pod -n kube-system, then locate OSM-related pods and remove them using kubectl delete pod -n kube-system <osm-pod-name> <osm-webhook-pod-name>
Wait a few minutes for the new osp-ubuntu to get created. You can run kubectl get osp -n kube-system to see what OSPs you have. Once you see osp-ubuntu, you're good to go to the next step
Rollout all your MachineDeployment objects as described here: https://docs.kubermatic.com/kubeone/v1.5/cheat-sheets/rollout-machinedeployment/
Make sure that the new worker nodes joined the cluster

After that, try to reproduce the issue again. Hopefully, this time you'll not be able to do so, but if you still have the issue, please let us know. As I said earlier, this is only an assumption, but let's hope that we got it right this time. :smile:

cristianrat commented 2 years ago

@xmudrii thank you. Will give it a go today / tomorrow, hopefully.

cristianrat commented 2 years ago

@xmudrii Applied the patches, will test autoscaling tomorrow Give it time to expire the token for anybody reading this, the osm pods are found by their full name operating-system-....

cristianrat commented 2 years ago

@xmudrii I just checked my 3 clusters this morning, I saw some auto scaling activity during the night, just tested now as well by increasing the desired tasks. All worked So I believe this is fixed now Thanks for much for your help!

kubermatic / kubeone