tinkerbell / cluster-api-provider-tinkerbell

Cluster API Infrastructure Provider
Apache License 2.0
88 stars 34 forks source link

Same Hardware assigned to multiple TinkerbellMachines when maxUnavailable >= 2 in MachineDeployment #330

Open rockc2020 opened 7 months ago

rockc2020 commented 7 months ago

I setup a Cluster API environment with Tinkerbell provider, plus a tinkerbell stack on a single server by following this https://github.com/tinkerbell/cluster-api-provider-tinkerbell/tree/main/docs. It successfully provisioned a workload K8s cluster (3 control plane nodes + 3 workload nodes) where all servers are physical machines. When testing rolling restart a MachineDeployment which contains 3 workload nodes, I set maxUnavailable: 3 (an extreme case that I want to test how refreshing (actually re-image + rejoin) nodes works in parallel. Then, a single Hardware was assigned to two TinkerbellMachines and so the rolling restart got stuck.

Expected Behaviour

The nodes managed by the MachineDeployment should be linked to different Hardwares even when multiple nodes are getting refreshed or restarted.

Current Behaviour

From the screenshot, the hardware n62-107-74 was linked to two TinkerbellMachines: capi-quickstart-worker-a-2kjwb and capi-quickstart-worker-a-h9f8z.

image

Possible Solution

Explained in the context below.

Steps to Reproduce (for bugs)

  1. Provision a MachineDeployment with multiple nodes (>= 2).
  2. Set the strategy of the MachineDeployment:
    strategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 3 # Just set it to 3 for easy repo.
    type: RollingUpdate
  3. Rolling restart the MachineDeployment with clusterctl:
    clusterctl alpha rollout restart machinedeployment/capi-quickstart-worker-a
  4. All 3 Machines would be deleted first (including TinkerbellMachines, Workflows as well) and then enter into provisioning stage.
  5. Then, it would be highly possible to happen that a single Hardware is linked to multiple TinkerbellMachines (2 or 3).

Context

This issue blocks nodes upgrading and restarting. After digging into code, it looks like the owner labels on the Hardware got deleted twice (even 3 times). Let's use the case above as an example, and the here are labels on the Hardware:

image

It could happen like:

  1. The 3 machines were being deleted.
  2. The reconcile of machine (n62-107-74) calls DeleteMachineWithDependencies() method, which deletes the ownership labels on the Hardware and then creates a PowerOff bmc job to shutdown this (n62-107-74) machine.
  3. For reconcile of machine (n62-107.78), it might be faster and already completes its bmc job. Then it will call ensureHardware() to select a Hardware for itself as well. So, it's possible to select Hardware n62-107-74 as the ownership labels of n62-107-74 got deleted in step 2. It adds the ownership labels for Hardware n62-107-74. eg: v1alpha1.tinkerbell.org/ownerName=capi-quickstart-worker-a-h9f8z
  4. The reconcile of machine (n62-107-74) calls DeleteMachineWithDependencies() method again to double check if the PowerOff bmc job completes. Then inside DeleteMachineWithDependencies(), it deletes the ownership labels again, but the owenrship labels were just created in steps 3 by reconcile of machine (n62-107.78).
  5. The reconcile of machine (n62-107-74) continues when the PowerOff bmc job completes, so it will call ensureHardware() to select a Hardware for itself, which is possible to select Hardware n62-107-74 as the ownership labels got deleted in step 4. So it put the ownership label as v1alpha1.tinkerbell.org/ownerName=capi-quickstart-worker-a-2kjwb.
  6. Then it causes two machines (n62-107-74) and (n62-107.78) link to the same Hardware (n62-107-74).

A possible solution is to move deleting ownership labels behind checking PowerOff bmc job. Here is the code I just used which works well in my environment.

// DeleteMachineWithDependencies removes template and workflow objects associated with given machine.
func (scope *machineReconcileScope) DeleteMachineWithDependencies() error {
    scope.log.Info("Removing machine", "hardwareName", scope.tinkerbellMachine.Spec.HardwareName)

    // Fetch hardware for the machine.
    hardware := &tinkv1.Hardware{}
    if err := scope.getHardwareForMachine(hardware); err != nil {
        return err
    }

        // getOrCreateBMCPowerOffJob() is method I wrote for PowerOff job only.
    if bmcJob, err := scope.getOrCreateBMCPowerOffJob(hardware); err != nil {
        return err
    } else {
        // Check the Job conditions to ensure the power off job is complete.
        if !bmcJob.HasCondition(rufiov1.JobCompleted, rufiov1.ConditionTrue) {
            scope.log.Info("Waiting BMCJob of power off hardware to complete",
                "Name", bmcJob.Name,
                "Namespace", bmcJob.Namespace,
            )
            return nil
        }

        if bmcJob.HasCondition(rufiov1.JobFailed, rufiov1.ConditionTrue) {
            return fmt.Errorf("bmc job %s/%s failed", bmcJob.Namespace, bmcJob.Name) //nolint:goerr113
        }
    }

    // Only remove ownership labels here when BMC PowerOff job completes.
    if err := scope.removeDependencies(hardware); err != nil {
        return err
    }

    // The hardware BMCRef is nil.
    // Remove finalizers and let machine object delete.
    if hardware.Spec.BMCRef == nil {
        scope.log.Info("Hardware BMC reference not present; skipping hardware power off",
            "BMCRef", hardware.Spec.BMCRef, "Hardware", hardware.Name)
    }

    return scope.removeFinalizer()
}

Your Environment

chrisdoherty4 commented 7 months ago

Thank you for the thorough report.

rockc2020 commented 7 months ago

Thank you for the thorough report.

Sure and feel free let me know if that's the case and I'd be happy to raise a PR for a fix as well.