siderolabs / omni

SaaS-simple deployment of Kubernetes - on your own hardware.
Other
400 stars 23 forks source link

[bug] Cluster patch changes rebooting all control plane nodes at once #244

Closed netthier closed 1 month ago

netthier commented 1 month ago

Is there an existing issue for this?

Current Behavior

After upgrading my Omni-managed cluster from 1.6 to 1.7, applying cluster-level config patches reboots all my nodes at once, including control plane nodes.

Expected Behavior

I would have expected control plane nodes to reboot one-by-one to prevent downtime or worse.

Steps To Reproduce

Not sure if all steps are relevant, but I'm posting everything just in case.

  1. With Sidero-managed Omni v0.35.1, upgrade a cluster with the tailscale extension from 1.6 to 1.7
  2. During the upgrade, begin creating patches specific to newly upgraded machines containing the ExtensionServiceConfig.
  3. After the upgrade concludes, replace the machine-specific patches with a cluster patch containing the config (no reboot occurs here)
  4. Remove the machine.files section containing the old Tailscale configuration from the cluster patch
  5. Observe all control plane nodes (the cluster is 3 control planes, 0 workers) rebooting at once: image

I later upgraded the cluster to k8s 1.30.1 and began hitting https://github.com/siderolabs/talos/issues/8652. After setting diskQuotaSupport to false in the cluster patch, I observed a fully parallel reboot again. Cluster seems fine otherwise.

What browsers are you seeing the problem on?

Firefox

Anything else?

I posted about this issue on Slack: https://taloscommunity.slack.com/archives/C04D4PDAJT0/p1715861414751159