tinkerbell / cluster-api-provider-tinkerbell

Cluster API Infrastructure Provider
Apache License 2.0
103 stars 36 forks source link

Update how network booting is disabled #363

Closed jacobweinstock closed 6 months ago

jacobweinstock commented 7 months ago

Currently, CAPT tells Smee to not allowing a machine to network boot after it has been provisioned. This happens by CAPT setting 2 values in a Hardware object. Hardware.Spec.Metadata.Instance.State = provisioned and Hardware.Spec.Metadata.State = in_use.

https://github.com/tinkerbell/cluster-api-provider-tinkerbell/blob/d828a9e7b165b4a2b0e0975ebd67b1a9f2a83d8c/controllers/machine_reconcile_scope.go#L52 https://github.com/tinkerbell/cluster-api-provider-tinkerbell/blob/d828a9e7b165b4a2b0e0975ebd67b1a9f2a83d8c/controllers/machine_reconcile_scope.go#L53 https://github.com/tinkerbell/cluster-api-provider-tinkerbell/blob/d828a9e7b165b4a2b0e0975ebd67b1a9f2a83d8c/controllers/machine_reconcile_scope.go#L110 https://github.com/tinkerbell/cluster-api-provider-tinkerbell/blob/d828a9e7b165b4a2b0e0975ebd67b1a9f2a83d8c/controllers/machine_reconcile_scope.go#L187 https://github.com/tinkerbell/cluster-api-provider-tinkerbell/blob/d828a9e7b165b4a2b0e0975ebd67b1a9f2a83d8c/controllers/machine_reconcile_scope.go#L723 https://github.com/tinkerbell/cluster-api-provider-tinkerbell/blob/d828a9e7b165b4a2b0e0975ebd67b1a9f2a83d8c/controllers/machine_reconcile_scope.go#L724

This was the case because Smee was using these fields to gate network booting. This is no longer the case in Smee (since v0.10.0). Gating of network booting in Smee now occurs via Hardware.Spec.Interfaces[].Netboot.AllowPXE.

The affect of this is that when a machine reboots, if the firmware is setup to network boot first, then the machine will be served network boot packets from Smee and the machine will boot into HookOS and sit there indefinitely.

Expected Behaviour

A machine provisioned by CAPT should not network boot after a reboot (hardware configured to tell Smee not to netboot a machine).

Current Behaviour

Possible Solution

Update CAPT to set Hardware.Spec.Interfaces[].Netboot.AllowPXE = false after a machine is provisioned.

Steps to Reproduce (for bugs)

  1. Provision a cluster with CAPT
  2. Set a machine's firmware to network boot
  3. Reboot the machine
  4. See that HookOS is loaded and sits indefinitely

Context

Your Environment