Open citrus-it opened 6 months ago
I was able to repro with a bare Propolis server. I have a hunch as to what's happening:
EfiBootManagerRefreshAllBootOption
during VM startup.BmEnumerateBootOptions
, which looks for block I/O devices and FAT file systems that could be viable boot options. 2
and 3
are getting added by some other disambiguation code that runs after this (some evidence for this below).I suspect that what's happening is that when the new drive is added in slot 0, the descriptions are shifting around: before, slot 1 was labeled UEFI
and slot 2 was labeled UEFI 2
; after, slot 0 gets the UEFI
label and the other two slots shift by one.
The reason this matters is that once EfiBootManagerRefreshAllBootOption
loads the boot options, it goes through this code block, which removes "invalid" boot options from the nonvolatile boot order. An option is "invalid" if EfiBootManagerFindLoadOption
doesn't find a match for it in the boot options loaded from the nonvolatile variables. One of the reasons this can happen is a mismatch in the searched-for Description
or FilePath
for the option under consideration. After these entries are pruned, the refresh function calls EfiBootManagerAddLoadOptionVariable
to add back all the options that aren't currently accounted for.
I think this would explain what's being seen above: when the new disk gets added, the existing entries for the disks in slots 1 and 2 end up mismatching either on their file paths or descriptions, so they get trimmed; then all three disks get added back at the end of the boot order, where they happen to be after the UEFI shell entry, which makes the instance boot to the shell.
To see this in action, I added some debug prints to the bootrom to show how these comparisons proceed on a functional and non-functional boot sequence. Here's what I get when I boot a VM with a working boot disk attached at 0.17.0:
Assigned description via handler 4
Assigned description PXEv4 (MAC:020820138D79) via handler 2
EfiBootManagerFindLoadOption: UEFI
EfiBootManagerFindLoadOption: candidate 0 (UEFI )
EBMFLO: Matched!
EfiBootManagerFindLoadOption: UEFI PXEv4 (MAC:020820138D79)
EfiBootManagerFindLoadOption: candidate 0 (UEFI )
EBMFLO: Key->Description UEFI PXEv4 (MAC:020820138D79) != UEFI
EfiBootManagerFindLoadOption: candidate 1 (UEFI PXEv4 (MAC:020820138D79))
EBMFLO: Matched!
EfiBootManagerFindLoadOption: UEFI
EfiBootManagerFindLoadOption: candidate 0 (UiApp)
EBMFLO: Key->Attributes 1 != 265
EfiBootManagerFindLoadOption: candidate 1 (UEFI )
EBMFLO: Matched!
EfiBootManagerFindLoadOption: UEFI PXEv4 (MAC:020820138D79)
EfiBootManagerFindLoadOption: candidate 0 (UiApp)
EBMFLO: Key->Attributes 1 != 265
EfiBootManagerFindLoadOption: candidate 1 (UEFI )
EBMFLO: Key->Description UEFI PXEv4 (MAC:020820138D79) != UEFI
EfiBootManagerFindLoadOption: candidate 2 (UEFI PXEv4 (MAC:020820138D79))
EBMFLO: Matched!
EfiBootManagerFindLoadOption: EFI Internal Shell
EfiBootManagerFindLoadOption: candidate 0 (UiApp)
EBMFLO: Key->Attributes 1 != 265
EfiBootManagerFindLoadOption: candidate 1 (UEFI )
EBMFLO: Key->Description EFI Internal Shell != UEFI
EfiBootManagerFindLoadOption: candidate 2 (UEFI PXEv4 (MAC:020820138D79))
EBMFLO: Key->Description EFI Internal Shell != UEFI PXEv4 (MAC:020820138D79)
EfiBootManagerFindLoadOption: candidate 3 (EFI Internal Shell)
EBMFLO: Matched!
Select Item: 0x19
[Bds]OsIndication: 0000000000000000
[Bds]=============Begin Load Options Dumping ...=============
Driver Options:
SysPrep Options:
Boot Options:
Boot0000: UiApp 0x0109
Boot0001: UEFI 0x0001
Boot0002: UEFI PXEv4 (MAC:020820138D79) 0x0001
Boot0003: EFI Internal Shell 0x0001
PlatformRecovery Options:
PlatformRecovery0000: Default PlatformRecovery 0x0001
[Bds]=============End Load Options Dumping=============
(Handler 4 is the NVMe device description handler.)
If I now attach a blank disk at 0.16.0 I get the following:
Assigned description via handler 4
Assigned description via handler 4
Assigned description PXEv4 (MAC:020820138D79) via handler 2
EfiBootManagerFindLoadOption: UEFI
EfiBootManagerFindLoadOption: candidate 0 (UEFI )
EBMFLO: FilePath mismatch
EfiBootManagerFindLoadOption: candidate 1 (UEFI 2)
EBMFLO: Key->Description UEFI != UEFI 2
EfiBootManagerFindLoadOption: candidate 2 (UEFI PXEv4 (MAC:020820138D79))
EBMFLO: Key->Description UEFI != UEFI PXEv4 (MAC:020820138D79)
EmuVariablesUpdatedCallback
FSOpen: Open 'NvVars' Success
Saved NV Variables to NvVars file
EmuVariablesUpdatedCallback
FSOpen: Open 'NvVars' Success
Saved NV Variables to NvVars file
EfiBootManagerFindLoadOption: UEFI PXEv4 (MAC:020820138D79)
EfiBootManagerFindLoadOption: candidate 0 (UEFI )
EBMFLO: Key->Description UEFI PXEv4 (MAC:020820138D79) != UEFI
EfiBootManagerFindLoadOption: candidate 1 (UEFI 2)
EBMFLO: Key->Description UEFI PXEv4 (MAC:020820138D79) != UEFI 2
EfiBootManagerFindLoadOption: candidate 2 (UEFI PXEv4 (MAC:020820138D79))
EBMFLO: Matched!
EfiBootManagerFindLoadOption: UEFI
EfiBootManagerFindLoadOption: candidate 0 (UiApp)
EBMFLO: Key->Attributes 1 != 265
EfiBootManagerFindLoadOption: candidate 1 (UEFI )
EBMFLO: FilePath mismatch
EfiBootManagerFindLoadOption: candidate 2 (UEFI PXEv4 (MAC:020820138D79))
EBMFLO: Key->Description UEFI != UEFI PXEv4 (MAC:020820138D79)
EfiBootManagerFindLoadOption: candidate 3 (EFI Internal Shell)
EBMFLO: Key->Description UEFI != EFI Internal Shell
EmuVariablesUpdatedCallback
FSOpen: Open 'NvVars' Success
Saved NV Variables to NvVars file
EmuVariablesUpdatedCallback
FSOpen: Open 'NvVars' Success
Saved NV Variables to NvVars file
EfiBootManagerFindLoadOption: UEFI 2
EfiBootManagerFindLoadOption: candidate 0 (UiApp)
EBMFLO: Key->Attributes 1 != 265
EfiBootManagerFindLoadOption: candidate 1 (UEFI )
EBMFLO: Key->Description UEFI 2 != UEFI
EfiBootManagerFindLoadOption: candidate 2 (UEFI PXEv4 (MAC:020820138D79))
EBMFLO: Key->Description UEFI 2 != UEFI PXEv4 (MAC:020820138D79)
EfiBootManagerFindLoadOption: candidate 3 (EFI Internal Shell)
EBMFLO: Key->Description UEFI 2 != EFI Internal Shell
EmuVariablesUpdatedCallback
FSOpen: Open 'NvVars' Success
Saved NV Variables to NvVars file
EmuVariablesUpdatedCallback
FSOpen: Open 'NvVars' Success
Saved NV Variables to NvVars file
EfiBootManagerFindLoadOption: UEFI PXEv4 (MAC:020820138D79)
EfiBootManagerFindLoadOption: candidate 0 (UiApp)
EBMFLO: Key->Attributes 1 != 265
EfiBootManagerFindLoadOption: candidate 1 (UEFI )
EBMFLO: Key->Description UEFI PXEv4 (MAC:020820138D79) != UEFI
EfiBootManagerFindLoadOption: candidate 2 (UEFI PXEv4 (MAC:020820138D79))
EBMFLO: Matched!
EfiBootManagerFindLoadOption: EFI Internal Shell
EfiBootManagerFindLoadOption: candidate 0 (UiApp)
EBMFLO: Key->Attributes 1 != 265
EfiBootManagerFindLoadOption: candidate 1 (UEFI PXEv4 (MAC:020820138D79))
EBMFLO: Key->Description EFI Internal Shell != UEFI PXEv4 (MAC:020820138D79)
EfiBootManagerFindLoadOption: candidate 2 (EFI Internal Shell)
EBMFLO: Matched!
Select Item: 0x19
[Bds]OsIndication: 0000000000000000
[Bds]=============Begin Load Options Dumping ...=============
Driver Options:
SysPrep Options:
Boot Options:
Boot0000: UiApp 0x0109
Boot0002: UEFI PXEv4 (MAC:020820138D79) 0x0001
Boot0003: EFI Internal Shell 0x0001
Boot0001: UEFI 0x0001
Boot0004: UEFI 2 0x0001
PlatformRecovery Options:
PlatformRecovery0000: Default PlatformRecovery 0x0001
[Bds]=============End Load Options Dumping=============
Notice that the "blank" description gets applied by the NVMe handler twice (so the disambiguating 2
probably came from elsewhere). Then we see that the UEFI
entry ends up being discarded due to a FilePath mismatch
, and the UEFI 2
entry ends up not matching any of the existing descriptions in the nonvolatile variables, so it also gets discarded.
I think is consistent with the following events:
The main thing I think I'm missing at this point is tracing that conclusively demonstrates that the NVMe boot options are getting added/described in PCI slot order--I think the logs above show a lot of smoke, but I'd really like to see the fire.
The main thing I think I'm missing at this point is tracing that conclusively demonstrates that the NVMe boot options are getting added/described in PCI slot order--I think the logs above show a lot of smoke, but I'd really like to see the fire.
The disambiguating integers get added in BmMakeBootOptionDescriptionUnique
, which visits the boot options in the order they were enumerated and adds disambiguating numbers to any options whose descriptions were already used elsewhere.
This fits with the behavior described above provided BmEnumerateBootOptions
visits NVMe devices in PCI slot order. That enumeration is handled by the EFI_LOCATE_HANDLE_BUFFER
function in the EFI boot services table; assuming I have the right implementation of that function, it will search the global handle list in the order that protocols were registered by calls to the boot services' EFI_INSTALL_PROTOCOL_INTERFACE
function. I'll need to do some more reading to figure out when these registrations happen for ESPs on NVMe devices. I'm guessing they're visited in slot order (by the boot device selection code, specifically VisitAllPciInstances
and its callees) but haven't walked through the whole callee tree to be sure.
Reassigning per the discussion at the 22 Aug 2024 hypervisor huddle.
I added a disk to an instance that has been running in the colo for a while, and it failed to boot afterwards, dropping to the UEFI shell. I've replicate this with a fresh instance and the rest of this note is from that replication case.
One notable thing about the VM that I originally saw the problem with is that its two disks were in slots 1 and 2, with nothing present in slot 0. This is likely because it was created before the fix for #5067 was merged.
To replicate the failure, I created a new disk from an image, and then two additional blank ones. By attaching them to a new instance in the right order, then detaching a blank disk again, I was able to end up with an instance in the same configuration, with the boot disk in slot 1 and slot 0 being empty.
I then booted this instance, which was successful, and mounted the EFI System Partition (ESP) to fish out the
NvVars
file which is where the UEFI bootrom stores its persistent variables. Decoding this shows that the bootrom has enumerated all of the possible boot devices, assigned them numbers and configured an initial boot order:So far so good. I rebooted the instance a couple of times to confirm that it booted normally, and that these variables didn't change.
I then shut down the instance and attached a new blank disk to it. This disk was 128G in size and used a 4096 sector size. After this, the database showed that the new disk has been placed in slot 0. This mirrors what happened with the previously failed instance.
On booting the instance back up, it dropped to the EFI shell after failing to boot from
Boot0003
and via PXE:Using the EFI shell to look at the persistent variables now showed something interesting:
The new disk has been enumerated and added as
Boot 0006
, which is not a surprise, but the boot order has been changed so that all three NVMe disks are now at the end. This explains why the instance attempted to boot from Boot0003, which is the cidata volume, and failed, then tried PXE boot and finally dropped to the EFI shell.The bootrom's debug output from this boot also shows this same strange boot order:
To replicate this I faithfully reproduced what happened in the colo -- not all of the steps here may be necessary to trigger it, more experimentation is necessary.