Closed cristianrat closed 2 years ago
update: I've spun up a new cluster with no autoscaling enabled for two new machine deployments and even setting the desired nodes to a higher number gives the same output. So I have to delete the machine deployment and recreate - the only way I can get new nodes
@cristianrat Are you still using KubeOne v1.5.0-beta.0? If yes, can you please upgrade to KubeOne 1.5.0? Please make sure to run kubeone apply
after upgrading, so that MC and OSM are upgraded to the latest versions. This issue might be fixed in the latest release, so please give it a try again after upgrading.
This issue should be fixed by https://github.com/kubermatic/operating-system-manager/pull/205 which is included in 1.5.0 (but not in 1.5.0-beta.0).
@xmudrii Will try and let you know - thanks for the suggestion
I've updated so now am getting this
"kubeone": {
"major": "1",
"minor": "5",
"gitVersion": "1.5.0-rc.0",
"gitCommit": "16c6bdfd0d219cd1ff0b4beeb9fcdf89d815d84c",
"gitTreeState": "",
"buildDate": "2022-08-25T13:18:09Z",
"goVersion": "go1.19",
"compiler": "gc",
"platform": "linux/amd64"
},
"machine_controller": {
"major": "1",
"minor": "54",
"gitVersion": "v1.54.0",
"gitCommit": "",
"gitTreeState": "",
"buildDate": "",
"goVersion": "",
"compiler": "",
"platform": "linux/amd64"
}
}
let's see
later update : the curl -sfL https://get.kubeone.io | sh
way of installing, actually installed 1.5 rc 0...
@xmudrii I've redeployed and it didn't seem to make any difference Anything special I need to do before trying a scale?
can confirm that having redeployed, this still doesn't work so question is, do I need to spin up a completely new cluster to have this working?
@cristianrat I'm trying to reproduce this and I'll get back to you with more information soon.
@xmudrii If you want, I am happy to help, I can even give you access to my cluster if it would help
@cristianrat Can you please try rolling out all your MachineDeployments (note that this is going to recreate all worker nodes) as described in this document? After that, it should work as expected.
@xmudrii I think it fixed it. ran the above command, and just tested the autoscaling after 2 hours Will give it another go tomorrow, give the token a chance to expire :) Thanks so much for your help!
Tried now, and the issue is still there, unfortunately
@cristianrat Have you tried upgrading to KubeOne 1.5.0? We fixed the release script yesterday, so it should download the stable release instead of a prerelease. Additionally, can you please confirm that you're running OSM v1.0.0 in your cluster:
kubectl get deploy -n kube-system operating-system-manager -o jsonpath='{.spec.template.spec.containers[0].image}'
It should return the following image:
quay.io/kubermatic/operating-system-manager:v1.0.0
I've been trying to reproduce this issue today, while leaving the cluster idle for several hours, but I wasn't able to reproduce it.
@xmudrii yes, I've upgraded and then redeployed (via gh actions)
the result of said command above is
quay.io/kubermatic/operating-system-manager:v1.0.0
Will give it another go, perhaps I missed something
@xmudrii I've tested, yet again, and no, doesn't seem to work Perhaps I am doing something wrong, but I have redeployed with 1.5.0 and also did the step above you recommended. Would the only solution then, remain to setup a new cluster?
@cristianrat How long did you wait to scale up the cluster? Also, how did you scale it up? Did you use cluster-autoscaler or did you scaled it up manually?
@xmudrii I have the machine set at 24 hours, so it's at least 1 day old I scaled it by simply increasing the desired replicas (ie: not via auto scaler) Will try via auto scaler if you think it's worth it
I scaled it by simply increasing the desired replicas (ie: not via auto scaler)
You're increasing the desired replicas by editing MachineDeployment object and setting .spec.replicas
, right?
Will try via auto scaler if you think it's worth it
I don't think that's important here. cluster-autoscaler is basically doing what you are doing -- changing the number of replicas on the MachineDeployment object.
I scaled it by simply increasing the desired replicas (ie: not via auto scaler)
You're increasing the desired replicas by editing MachineDeployment object and setting
.spec.replicas
, right?Will try via auto scaler if you think it's worth it
I don't think that's important here. cluster-autoscaler is basically doing what you are doing -- changing the number of replicas on the MachineDeployment object.
That's right, just changing replicas, like so: replicas: 3
instead of 2, for example
Under Machine I can see a new one, but it won't connect
I ssh into it, tail the logs, and it's always the same, with the anonymous user
I realize this is hard to debug and not sure if its just my problem or an actual problem
I have 3 clusters and this issue is on all of them, so I guess I either made the same mistake 3 times, or a real problem? :D
@cristianrat Can you share your MachineDeployment object, machine-controller and operating-system-manager logs? Just make sure that you redact any secrets if there are any.
I am experiencing the same on Hetzner, 1.5.0. Got it to scale by doing a rolling restart of the machinedeployment. Not sure if it works after that (not autoscaling usually very often).
I am experiencing the same on Hetzner, 1.5.0. Got it to scale by doing a rolling restart of the machinedeployment. Not sure if it works after that (not autoscaling usually very often).
oh good, so I'm not a total idiot :D
@xmudrii Rather than pollute this thread with the files, I've added them to a repo and invited you to it (once we're sorted here, will delete it)
@cristianrat Got it, thank you! I'll take a look at this tomorrow and get back to you hopefully with some solution.
@cristianrat Got it, thank you! I'll take a look at this tomorrow and get back to you hopefully with some solution.
👍 really appreciate your help, anything else I can do, let me know
@cristianrat Can you just also share your kubeone.yaml
and terraform.tfvars
? Just make sure to redact any secrets if there are any.
@xmudrii sure, will add those shortly
I was able to reproduce the issue: and like I was suspecting, the problem is in the bootstrap token, which is missing! However it's not clear why exactly it's the this way.
I was able to reproduce the issue: and like I was suspecting, the problem is in the bootstrap token, which is missing! However it's not clear why exactly it's the this way.
nice any idea when there will be a fix (or what the fix is)?
@cristianrat I haven't confirmed it yet, but I have an assumption about what's going on here. We fixed the original issue that you reported in Operating System Manager (OSM) by:
All that happened in https://github.com/kubermatic/operating-system-manager/issues/202
However, there is a bug in OSM which causes OSM to not reconcile OperatingSystemProfiles (OSPs) after creating them for the first time. That means, even if you updated OSM from 0.6 to 1.0.0 (which was done when upgrading KubeOne), you don't have the latest OSP and therefore you don't have the needed fix.
This turns out to be a known issue that is being worked on: https://github.com/kubermatic/operating-system-manager/issues/221
In the meanwhile, you should be able to manually mitigate the issue. You need to remove the OSP, restart OSM pods, let OSM create a new and proper OSP, and then rollout all your MachineDeployments. You can accomplish this in the following way:
osp-ubuntu
OperatingSystemProfile: kubectl delete osp -n kube-system osp-ubuntu
kubectl get pod -n kube-system
, then locate OSM-related pods and remove them using kubectl delete pod -n kube-system <osm-pod-name> <osm-webhook-pod-name>
osp-ubuntu
to get created. You can run kubectl get osp -n kube-system
to see what OSPs you have. Once you see osp-ubuntu
, you're good to go to the next stepAfter that, try to reproduce the issue again. Hopefully, this time you'll not be able to do so, but if you still have the issue, please let us know. As I said earlier, this is only an assumption, but let's hope that we got it right this time. :smile:
@xmudrii thank you. Will give it a go today / tomorrow, hopefully.
@xmudrii Applied the patches, will test autoscaling tomorrow
Give it time to expire the token
for anybody reading this, the osm
pods are found by their full name operating-system-....
@xmudrii I just checked my 3 clusters this morning, I saw some auto scaling activity during the night, just tested now as well by increasing the desired tasks. All worked So I believe this is fixed now Thanks for much for your help!
What happened?
I followed the guide to spin up a kube one cluster with 3 control plane nodes, 2 worker nodes and then I also enabled autoscaling and it worked ok - I tested it
Then a few days later when I scaled it up (by increasing a deployment replicas) new machines are created, but they can't join control plane.
Seems that is the case in 2 clusters - it's fine on the first day, then later it won't work Note: this is on hetzner Log says
And that log just repeats with various flavours / components
Deleting the MachineDeployment and MachineSet and re-creating the MachineDeployment usually fixes it - but not really a practical solution I'm happy to investigate further, if you could point me in the right direction
Expected behavior
Auto scaling to continue working
How to reproduce the issue?
Spin up cluster Setup auto scaling Go away for 1-2 days Come back and try to scale
What KubeOne version are you using?
Provide your KubeOneCluster manifest here (if applicable)
What cloud provider are you running on?
What operating system are you running in your cluster?
Ubuntu 20.04
Additional information
Happy to help investigate this further, but unsure where to look next