Open laverya opened 4 days ago
Does 1.29.6 count as "latest"? I really have no idea if I should have that box checked here
It's important to understand that not everything is synced when using dynamic config. You still have to handle certain node-specific parts of the k0s config manually. Additional user-managed stacks in the manifest folders are expected to be synchronized manually, as well.
but the clusterconfig was removed / dynamic config was disabled
That's... surprising. How did you verify? AFAIK there's no code in k0s which would delete ClusterConfig resources. What could have happened is that the second controller has overwritten the ClusterConfig of the first one? :thinking:
existing helm charts were removed
How were the Helm charts installed? Via manifests in <data-dir>/manifests
or via k0s config in spec.extensions.helm
?
metrics-server/coredns are no longer running
Did you check the logs of the new controller? Are there some files in <data-dir>/manifests/{coredns,metricserver}
? Maybe it had some troubles in getting the ClusterConfig from the existing cluster? I can only speculate that this is happening due to the two controllers having conflicting service CIDRs and then the cluster networking gets borked. Need to investigate.
but the clusterconfig was removed / dynamic config was disabled
Running kubectl get clusterconfig -A
returned no results - if there's a different way to verify I can try that
How were the Helm charts installed? Via manifests in
/manifests or via k0s config in spec.extensions.helm?
Via a k0s config in spec.extensions.helm
. I've included the k0s config used for the initial primary node in the ticket above.
Did you check the logs of the new controller? Are there some files in
/manifests/{coredns,metricserver}? Maybe it had some troubles in getting the ClusterConfig from the existing cluster? I can only speculate that this is happening due to the two controllers having conflicting service CIDRs and then the cluster networking gets borked. Need to investigate.
I can spin up a new pair of instances and get any logs you'd like me to, this has reproduced consistently for me so far
Before creating an issue, make sure you've checked the following:
Platform
Version
v1.29.6+k0s.0
Sysinfo
`k0s sysinfo`
What happened?
I created a cluster with calico networking, custom CIDRs, and a minimal helm chart extension to demonstrate the problem (goldpinger).
k0s was installed with
sudo /usr/local/bin/k0s install controller -c k0s.yaml --enable-dynamic-config --enable-worker --no-taints
.After starting and waiting for things to stabilize, a clusterconfig
k0s
existed in kube-system and the following pods were running:I then created a controller token and joined an additional node with
sudo /usr/local/bin/k0s install controller --token-file token.txt --enable-worker --no-taints --enable-dynamic-config
.After the additional controller joined, there is no longer a clusterconfig present, and only the following pods are running:
Goldpinger, coredns, and metrics-server are all no longer present.
Steps to reproduce
Expected behavior
The node joined as an additional controller and the cluster config was unchanged.
Actual behavior
The node joined as an additional controller, but the clusterconfig was removed / dynamic config was disabled, existing helm charts were removed, and metrics-server/coredns are no longer running.
Screenshots and logs
No response
Additional context
`k0s config`
The SANs on the certificate used by the joining node are also wrong (demonstrated with
openssl s_client -connect 10.11.0.1:443 </dev/null 2>/dev/null | openssl x509 -inform pem -text | grep -A1 "Subject Alternative Name"
), but that will be a different issue.I have also tested this with 1.30.2, and it has not reproduced there in my testing. I also believe this to be OS-dependent, as I was able to reproduce with Rocky 9 but not Ubuntu.