I setup a Cluster API environment with Tinkerbell provider, plus a tinkerbell stack on a single server by following this https://github.com/tinkerbell/cluster-api-provider-tinkerbell/tree/main/docs. It successfully provisioned a workload K8s cluster (3 control plane nodes + 3 workload nodes) where all servers are physical machines.
When testing rolling restart a MachineDeployment which contains 3 workload nodes, I set maxUnavailable: 3 (an extreme case that I want to test how refreshing (actually re-image + rejoin) nodes works in parallel. Then, a single Hardware was assigned to two TinkerbellMachines and so the rolling restart got stuck.
Expected Behaviour
The nodes managed by the MachineDeployment should be linked to different Hardwares even when multiple nodes are getting refreshed or restarted.
Current Behaviour
From the screenshot, the hardware n62-107-74 was linked to two TinkerbellMachines: capi-quickstart-worker-a-2kjwb and capi-quickstart-worker-a-h9f8z.
Possible Solution
Explained in the context below.
Steps to Reproduce (for bugs)
Provision a MachineDeployment with multiple nodes (>= 2).
Set the strategy of the MachineDeployment:
strategy:
rollingUpdate:
maxSurge: 0
maxUnavailable: 3 # Just set it to 3 for easy repo.
type: RollingUpdate
Rolling restart the MachineDeployment with clusterctl:
All 3 Machines would be deleted first (including TinkerbellMachines, Workflows as well) and then enter into provisioning stage.
Then, it would be highly possible to happen that a single Hardware is linked to multiple TinkerbellMachines (2 or 3).
Context
This issue blocks nodes upgrading and restarting.
After digging into code, it looks like the owner labels on the Hardware got deleted twice (even 3 times).
Let's use the case above as an example, and the here are labels on the Hardware:
It could happen like:
The 3 machines were being deleted.
The reconcile of machine (n62-107-74) calls DeleteMachineWithDependencies() method, which deletes the ownership labels on the Hardware and then creates a PowerOff bmc job to shutdown this (n62-107-74) machine.
For reconcile of machine (n62-107.78), it might be faster and already completes its bmc job. Then it will call ensureHardware() to select a Hardware for itself as well. So, it's possible to select Hardware n62-107-74 as the ownership labels of n62-107-74 got deleted in step 2. It adds the ownership labels for Hardware n62-107-74. eg: v1alpha1.tinkerbell.org/ownerName=capi-quickstart-worker-a-h9f8z
The reconcile of machine (n62-107-74) calls DeleteMachineWithDependencies() method again to double check if the PowerOff bmc job completes. Then inside DeleteMachineWithDependencies(), it deletes the ownership labels again, but the owenrship labels were just created in steps 3 by reconcile of machine (n62-107.78).
The reconcile of machine (n62-107-74) continues when the PowerOff bmc job completes, so it will call ensureHardware() to select a Hardware for itself, which is possible to select Hardware n62-107-74 as the ownership labels got deleted in step 4. So it put the ownership label as v1alpha1.tinkerbell.org/ownerName=capi-quickstart-worker-a-2kjwb.
Then it causes two machines (n62-107-74) and (n62-107.78) link to the same Hardware (n62-107-74).
A possible solution is to move deleting ownership labels behind checking PowerOff bmc job. Here is the code I just used which works well in my environment.
// DeleteMachineWithDependencies removes template and workflow objects associated with given machine.
func (scope *machineReconcileScope) DeleteMachineWithDependencies() error {
scope.log.Info("Removing machine", "hardwareName", scope.tinkerbellMachine.Spec.HardwareName)
// Fetch hardware for the machine.
hardware := &tinkv1.Hardware{}
if err := scope.getHardwareForMachine(hardware); err != nil {
return err
}
// getOrCreateBMCPowerOffJob() is method I wrote for PowerOff job only.
if bmcJob, err := scope.getOrCreateBMCPowerOffJob(hardware); err != nil {
return err
} else {
// Check the Job conditions to ensure the power off job is complete.
if !bmcJob.HasCondition(rufiov1.JobCompleted, rufiov1.ConditionTrue) {
scope.log.Info("Waiting BMCJob of power off hardware to complete",
"Name", bmcJob.Name,
"Namespace", bmcJob.Namespace,
)
return nil
}
if bmcJob.HasCondition(rufiov1.JobFailed, rufiov1.ConditionTrue) {
return fmt.Errorf("bmc job %s/%s failed", bmcJob.Namespace, bmcJob.Name) //nolint:goerr113
}
}
// Only remove ownership labels here when BMC PowerOff job completes.
if err := scope.removeDependencies(hardware); err != nil {
return err
}
// The hardware BMCRef is nil.
// Remove finalizers and let machine object delete.
if hardware.Spec.BMCRef == nil {
scope.log.Info("Hardware BMC reference not present; skipping hardware power off",
"BMCRef", hardware.Spec.BMCRef, "Hardware", hardware.Name)
}
return scope.removeFinalizer()
}
Your Environment
Operating System and version (e.g. Linux, Windows, MacOS):
Debian 10
How are you running Tinkerbell? Using Vagrant & VirtualBox, Vagrant & Libvirt, on Packet using Terraform, or give details:
I deployed Tinkerbell stack as sanbox one with docker-compose on a server.
Link to your project or a code example to reproduce issue:
N/A
I setup a Cluster API environment with Tinkerbell provider, plus a tinkerbell stack on a single server by following this https://github.com/tinkerbell/cluster-api-provider-tinkerbell/tree/main/docs. It successfully provisioned a workload K8s cluster (3 control plane nodes + 3 workload nodes) where all servers are physical machines. When testing rolling restart a MachineDeployment which contains 3 workload nodes, I set
maxUnavailable: 3
(an extreme case that I want to test how refreshing (actually re-image + rejoin) nodes works in parallel. Then, a single Hardware was assigned to two TinkerbellMachines and so the rolling restart got stuck.Expected Behaviour
The nodes managed by the MachineDeployment should be linked to different Hardwares even when multiple nodes are getting refreshed or restarted.
Current Behaviour
From the screenshot, the hardware
n62-107-74
was linked to two TinkerbellMachines:capi-quickstart-worker-a-2kjwb
andcapi-quickstart-worker-a-h9f8z
.Possible Solution
Explained in the context below.
Steps to Reproduce (for bugs)
clusterctl
:Context
This issue blocks nodes upgrading and restarting. After digging into code, it looks like the owner labels on the Hardware got deleted twice (even 3 times). Let's use the case above as an example, and the here are labels on the Hardware:
It could happen like:
n62-107-74
) callsDeleteMachineWithDependencies()
method, which deletes the ownership labels on the Hardware and then creates a PowerOff bmc job to shutdown this (n62-107-74
) machine.n62-107.78
), it might be faster and already completes its bmc job. Then it will callensureHardware()
to select a Hardware for itself as well. So, it's possible to select Hardwaren62-107-74
as the ownership labels ofn62-107-74
got deleted in step 2. It adds the ownership labels for Hardwaren62-107-74
. eg:v1alpha1.tinkerbell.org/ownerName=capi-quickstart-worker-a-h9f8z
n62-107-74
) callsDeleteMachineWithDependencies()
method again to double check if the PowerOff bmc job completes. Then insideDeleteMachineWithDependencies()
, it deletes the ownership labels again, but the owenrship labels were just created in steps 3 by reconcile of machine (n62-107.78
).n62-107-74
) continues when the PowerOff bmc job completes, so it will callensureHardware()
to select a Hardware for itself, which is possible to select Hardwaren62-107-74
as the ownership labels got deleted in step 4. So it put the ownership label asv1alpha1.tinkerbell.org/ownerName=capi-quickstart-worker-a-2kjwb
.n62-107-74
) and (n62-107.78
) link to the same Hardware (n62-107-74
).A possible solution is to move deleting ownership labels behind checking PowerOff bmc job. Here is the code I just used which works well in my environment.
Your Environment
Operating System and version (e.g. Linux, Windows, MacOS): Debian 10
How are you running Tinkerbell? Using Vagrant & VirtualBox, Vagrant & Libvirt, on Packet using Terraform, or give details: I deployed Tinkerbell stack as sanbox one with docker-compose on a server.
Link to your project or a code example to reproduce issue: N/A