Closed M4t7e closed 1 week ago
I just had the idea to manually add the taint node-role.kubernetes.io/control-plane:NoSchedule
using registerWithTaints
, and it seems to fix the issue. However, I'm not sure if it's safe to add this manually. Neither talosctl gen config
nor the Terraform talos_machine_configuration
data source apply this taint by default.
If you apply Longhorn early enough, yes, I guess this race does exist, and it's a good one to get fixed.
Bug Report
Description
During cluster bootstrap, there's a chance that a control plane node briefly lacks taints, inadvertently allowing pod scheduling that should be restricted. This issue surfaced when integrating Longhorn, as a control plane node unintentionally served as a Longhorn storage node. Investigating this, I found that during about 25% of cluster bootstraps, a node remains untainted by
node-role.kubernetes.io/control-plane=:NoSchedule
longer than the other nodes. Typically, other taints vanish after some time, leaving the node untainted for up to ~ 3 seconds before thenode-role.kubernetes.io/control-plane=:NoSchedule
taint is applied:This brief period is enough for pods, which normally shouldn't be scheduled on control plane nodes, to be deployed there. This was observed when Longhorn was inadvertently assigned to a control plane node:
Environment