siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.91k stars 556 forks source link

Control Plane Scheduling Glitch during Bootstrap #9691

Closed M4t7e closed 1 week ago

M4t7e commented 2 weeks ago

Bug Report

Description

During cluster bootstrap, there's a chance that a control plane node briefly lacks taints, inadvertently allowing pod scheduling that should be restricted. This issue surfaced when integrating Longhorn, as a control plane node unintentionally served as a Longhorn storage node. Investigating this, I found that during about 25% of cluster bootstraps, a node remains untainted by node-role.kubernetes.io/control-plane=:NoSchedule longer than the other nodes. Typically, other taints vanish after some time, leaving the node untainted for up to ~ 3 seconds before the node-role.kubernetes.io/control-plane=:NoSchedule taint is applied: image

This brief period is enough for pods, which normally shouldn't be scheduled on control plane nodes, to be deployed there. This was observed when Longhorn was inadvertently assigned to a control plane node: image

Environment

M4t7e commented 2 weeks ago

I just had the idea to manually add the taint node-role.kubernetes.io/control-plane:NoSchedule using registerWithTaints, and it seems to fix the issue. However, I'm not sure if it's safe to add this manually. Neither talosctl gen config nor the Terraform talos_machine_configuration data source apply this taint by default.

smira commented 2 weeks ago

If you apply Longhorn early enough, yes, I guess this race does exist, and it's a good one to get fixed.