tinkerbell / smee

DHCP and iPXE Server
https://tinkerbell.org
Apache License 2.0
253 stars 81 forks source link

feature/bug: expected behavior around short circuiting a netboot request #243

Open jacobweinstock opened 2 years ago

jacobweinstock commented 2 years ago

Currently, there is logic to short circuit a netboot request based on some tink/cacher hardware data. see here. I looked through the Tink code base and didn't see any code paths where hardware data was updated based on a workflow progression. The tink worker sends report statuses as it progresses (ref here) but tink server doesn't update hardware data in any way. Taking all this into account it appears that the code here in Boots is expected some external entity to update the hardware data in conjunction with a workflow's progress. This makes the Boots -> Tink server combo always netboot unless hardware data is manually updated. This, in my option, is not expected behavior. This was also raised in the Tinkerbell community Slack channel, here. This feels like probably a feature request more than a bug. But at a bare minimum, a non-documented quark that affects a generally expected behavior, in my opinion.

CC @rothgar

Expected Behaviour

After a machine has been provisioned we should be able to boot from a local disk without changing the boot order.

Current Behaviour

See above.

Possible Solution

Write a workflow action that updates tink hardware data. This is just for the sake of giving any kind of workaround. I don't think this is a viable mid-long term solution.

Steps to Reproduce (for bugs)

1. 2. 3. 4.

Context

Your Environment

jacobweinstock commented 2 years ago

Thinking about this some more. There is currently functionality that allows a user to provide a custom ipxe script. In this custom ipxe case, tink worker is never launched. So if a solution to this issue is to have boots look for an "active" workflow, we'll need to think about how to handle this same idea with custom ipxe scripts.

rothgar commented 2 years ago

I updated my tink stack and found that once my system successfully completes the workflow on the next boot, boots doesn't respond from that mac address with a pxe response. This does the intended thing but has a different side effect I didn't anticipate.

Once my servers fail PXE and boot from the local disk my systems (for reasons unknown to me) add a UEFI boot item to the top of the boot order. This means if I ever want to pxe boot again I have to go into the bios for each device and change the order.

A better approach would be to respond to all PXE events and have a default workflow/iPXE target to boot locally (this is how cobbler and RHEL satellite/foreman worked in environments I worked in). There are a couple different iPXE options to do that and it will keep boot priority and ownership centralized in boots rather than rely on different BIOS support/configuration.

An ipxe script with this content will likely work in most situations

sanboot --no-describe --drive 0x80

More details in the docs https://ipxe.org/cmd/sanboot