oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
240 stars 36 forks source link

Adding a disk to an instance changed the boot order #5112

Open citrus-it opened 6 months ago

citrus-it commented 6 months ago

I added a disk to an instance that has been running in the colo for a while, and it failed to boot afterwards, dropping to the UEFI shell. I've replicate this with a fresh instance and the rest of this note is from that replication case.

One notable thing about the VM that I originally saw the problem with is that its two disks were in slots 1 and 2, with nothing present in slot 0. This is likely because it was created before the fix for #5067 was merged.

To replicate the failure, I created a new disk from an image, and then two additional blank ones. By attaching them to a new instance in the right order, then detaching a blank disk again, I was able to end up with an instance in the same configuration, with the boot disk in slot 1 and slot 0 being empty.

                 name                | slot
-------------------------------------+-------
  test-omnios-bloody-20240215-e87155 |    1
  blank2                             |    2

I then booted this instance, which was successful, and mounted the EFI System Partition (ESP) to fish out the NvVars file which is where the UEFI bootrom stores its persistent variables. Decoding this shows that the bootrom has enumerated all of the possible boot devices, assigned them numbers and configured an initial boot order:

Variable        Value                    Notes
--------        -----                    ------
Boot0000        UIApp
Boot0001        UEFI                    <-- slot 1
Boot0002        UEFI 2                  <-- slot 2
Boot0003        UEFI Non-Block Device   <-- slot 8 (cidata volume)
Boot0004        UEFI PXE v4
Boot0005        EFI Internal Shell
BootOrder       0, 1, 2, 3, 4, 5

So far so good. I rebooted the instance a couple of times to confirm that it booted normally, and that these variables didn't change.

I then shut down the instance and attached a new blank disk to it. This disk was 128G in size and used a 4096 sector size. After this, the database showed that the new disk has been placed in slot 0. This mirrors what happened with the previously failed instance.

                 name                | slot
-------------------------------------+-------
  test-omnios-bloody-20240215-e87155 |    1
  blank4096                          |    0
  blank2                             |    2

On booting the instance back up, it dropped to the EFI shell after failing to boot from Boot0003 and via PXE:

BdsDxe: failed to load Boot0003 "UEFI Non-Block Boot Device" from PciRoot(0x0)/Pci(0x18,0x0): Not Found

>>Start PXE over IPv4.
  PXE-E16: No valid offer received.
BdsDxe: failed to load Boot0004 "UEFI PXEv4 (MAC:A84025FDD042)" from PciRoot(0x0)/Pci(0x9,0x0)/MAC(A84025FDD042,0x1)/IPv4(0.0.0.0,0x0,DHCP,0.0.0.0,0.0.0.0,0.0.0.0): Not Found
BdsDxe: loading Boot0005 "EFI Internal Shell" from Fv(7CB8BDC9-F8EB-4F34-AAEA-3EE4AF6516A1)/FvFile(7C04A583-9E3E-4F1C-AD65-E05268D0B4D1)
BdsDxe: starting Boot0005 "EFI Internal Shell" from Fv(7CB8BDC9-F8EB-4F34-AAEA-3EE4AF6516A1)/FvFile(7C04A583-9E3E-4F1C-AD65-E05268D0B4D1)
UEFI Interactive Shell v2.2
EDK II
UEFI v2.70 (EDK II, 0x00010000)
Shell>

Using the EFI shell to look at the persistent variables now showed something interesting:

Boot0000        UIApp
Boot0001        UEFI                    <-- slot 1
Boot0002        UEFI 2                  <-- slot 2
Boot0003        UEFI Non-Block Device   <-- slot 8 (cidata volume)
Boot0004        UEFI PXE v4
Boot0005        EFI Internal Shell
Boot0006        UEFI 3                  <-- slot 0 (newly added drive)
Boot Order      0, 3, 4, 5, 1, 2, 6

The new disk has been enumerated and added as Boot 0006, which is not a surprise, but the boot order has been changed so that all three NVMe disks are now at the end. This explains why the instance attempted to boot from Boot0003, which is the cidata volume, and failed, then tried PXE boot and finally dropped to the EFI shell.

The bootrom's debug output from this boot also shows this same strange boot order:

[Bds]=============Begin Load Options Dumping ...=============
  Driver Options:
  SysPrep Options:
  Boot Options:
    Boot0000: UiApp              0x0109
    Boot0003: UEFI Non-Block Boot Device                 0x0001
    Boot0004: UEFI PXEv4 (MAC:A84025FAF1FF)              0x0001
    Boot0005: EFI Internal Shell                 0x0001
    Boot0001: UEFI               0x0001
    Boot0002: UEFI  2            0x0001
    Boot0006: UEFI  3            0x0001
  PlatformRecovery Options:
    PlatformRecovery0000: Default PlatformRecovery               0x0001
[Bds]=============End Load Options Dumping=============

To replicate this I faithfully reproduced what happened in the colo -- not all of the steps here may be necessary to trigger it, more experimentation is necessary.

gjcolombo commented 6 months ago

I was able to repro with a bare Propolis server. I have a hunch as to what's happening:

I suspect that what's happening is that when the new drive is added in slot 0, the descriptions are shifting around: before, slot 1 was labeled UEFI and slot 2 was labeled UEFI 2; after, slot 0 gets the UEFI label and the other two slots shift by one.

The reason this matters is that once EfiBootManagerRefreshAllBootOption loads the boot options, it goes through this code block, which removes "invalid" boot options from the nonvolatile boot order. An option is "invalid" if EfiBootManagerFindLoadOption doesn't find a match for it in the boot options loaded from the nonvolatile variables. One of the reasons this can happen is a mismatch in the searched-for Description or FilePath for the option under consideration. After these entries are pruned, the refresh function calls EfiBootManagerAddLoadOptionVariable to add back all the options that aren't currently accounted for.

I think this would explain what's being seen above: when the new disk gets added, the existing entries for the disks in slots 1 and 2 end up mismatching either on their file paths or descriptions, so they get trimmed; then all three disks get added back at the end of the boot order, where they happen to be after the UEFI shell entry, which makes the instance boot to the shell.


To see this in action, I added some debug prints to the bootrom to show how these comparisons proceed on a functional and non-functional boot sequence. Here's what I get when I boot a VM with a working boot disk attached at 0.17.0:

  Assigned description  via handler 4                                                                                                                                                                                                                                                                                        
  Assigned description PXEv4 (MAC:020820138D79) via handler 2                                                                                                                                                                                                                                                                
  EfiBootManagerFindLoadOption: UEFI                                                                                                                                                                                                                                                                                         
  EfiBootManagerFindLoadOption: candidate 0 (UEFI )                                                                                                                                                                                                                                                                          
  EBMFLO: Matched!                                                                                                                                                                                                                                                                                                           
  EfiBootManagerFindLoadOption: UEFI PXEv4 (MAC:020820138D79)                                                                                                                                                                                                                                                                
  EfiBootManagerFindLoadOption: candidate 0 (UEFI )                                                                                                                                                                                                                                                                          
  EBMFLO: Key->Description UEFI PXEv4 (MAC:020820138D79) != UEFI                                                                                                                                                                                                                                                             
  EfiBootManagerFindLoadOption: candidate 1 (UEFI PXEv4 (MAC:020820138D79))                                                                                                                                                                                                                                                  
  EBMFLO: Matched!                                                                                                                                                                                                                                                                                                           
  EfiBootManagerFindLoadOption: UEFI                                                                                                                                                                                                                                                                                         
  EfiBootManagerFindLoadOption: candidate 0 (UiApp)                                                                                                                                                                                                                                                                          
  EBMFLO: Key->Attributes 1 != 265                                                                                                                                                                                                                                                                                           
  EfiBootManagerFindLoadOption: candidate 1 (UEFI )                                                                                                                                                                                                                                                                          
  EBMFLO: Matched!                                                                                                                                                                                                                                                                                                           
  EfiBootManagerFindLoadOption: UEFI PXEv4 (MAC:020820138D79)                                                                                                                                                                                                                                                                
  EfiBootManagerFindLoadOption: candidate 0 (UiApp)                                                                                                                                                                                                                                                                          
  EBMFLO: Key->Attributes 1 != 265                                                                                                                                                                                                                                                                                           
  EfiBootManagerFindLoadOption: candidate 1 (UEFI )                                                                                                                                                                                                                                                                          
  EBMFLO: Key->Description UEFI PXEv4 (MAC:020820138D79) != UEFI                                                                                                                                                                                                                                                             
  EfiBootManagerFindLoadOption: candidate 2 (UEFI PXEv4 (MAC:020820138D79))                                                                                                                                                                                                                                                  
  EBMFLO: Matched!                                                                                                                                                                                                                                                                                                           
  EfiBootManagerFindLoadOption: EFI Internal Shell                                                                                                                                                                                                                                                                           
  EfiBootManagerFindLoadOption: candidate 0 (UiApp)                                                                                                                                                                                                                                                                          
  EBMFLO: Key->Attributes 1 != 265                                                                                                                                                                                                                                                                                           
  EfiBootManagerFindLoadOption: candidate 1 (UEFI )                                                                                                                                                                                                                                                                          
  EBMFLO: Key->Description EFI Internal Shell != UEFI                                                                                                                                                                                                                                                                        
  EfiBootManagerFindLoadOption: candidate 2 (UEFI PXEv4 (MAC:020820138D79))                                                                                                                                                                                                                                                  
  EBMFLO: Key->Description EFI Internal Shell != UEFI PXEv4 (MAC:020820138D79)                                                                                                                                                                                                                                               
  EfiBootManagerFindLoadOption: candidate 3 (EFI Internal Shell)                                                                                                                                                                                                                                                             
  EBMFLO: Matched!                                                                                                                                                                                                                                                                                                           
  Select Item: 0x19                                                                                                                                                                                                                                                                                                          
  [Bds]OsIndication: 0000000000000000                                                                                                                                                                                                                                                                                        
  [Bds]=============Begin Load Options Dumping ...=============                                                                                                                                                                                                                                                              
    Driver Options:                                                                                                                                                                                                                                                                                                          
    SysPrep Options:                                                                                                                                                                                                                                                                                                         
    Boot Options:                                                                                                                                                                                                                                                                                                            
      Boot0000: UiApp          0x0109                                                                                                                                                                                                                                                                                        
      Boot0001: UEFI           0x0001                                                                                                                                                                                                                                                                                        
      Boot0002: UEFI PXEv4 (MAC:020820138D79)          0x0001                                                                                                                                                                                                                                                                
      Boot0003: EFI Internal Shell         0x0001                                                                                                                                                                                                                                                                            
    PlatformRecovery Options:                                                                                                                                                                                                                                                                                                
      PlatformRecovery0000: Default PlatformRecovery       0x0001                                                                                                                                                                                                                                                            
  [Bds]=============End Load Options Dumping============= 

(Handler 4 is the NVMe device description handler.)

If I now attach a blank disk at 0.16.0 I get the following:

  Assigned description  via handler 4                                                                                                                                                                                                                                                                                        
  Assigned description  via handler 4                                                                                                                                                                                                                                                                                        
  Assigned description PXEv4 (MAC:020820138D79) via handler 2                                                                                                                                                                                                                                                                
  EfiBootManagerFindLoadOption: UEFI                                                                                                                                                                                                                                                                                         
  EfiBootManagerFindLoadOption: candidate 0 (UEFI )                                                                                                                                                                                                                                                                          
  EBMFLO: FilePath mismatch                                                                                                                                                                                                                                                                                                  
  EfiBootManagerFindLoadOption: candidate 1 (UEFI  2)                                                                                                                                                                                                                                                                        
  EBMFLO: Key->Description UEFI  != UEFI  2                                                                                                                                                                                                                                                                                  
  EfiBootManagerFindLoadOption: candidate 2 (UEFI PXEv4 (MAC:020820138D79))                                                                                                                                                                                                                                                  
  EBMFLO: Key->Description UEFI  != UEFI PXEv4 (MAC:020820138D79)                                                                                                                                                                                                                                                            
  EmuVariablesUpdatedCallback                                                                                                                                                                                                                                                                                                
  FSOpen: Open 'NvVars' Success                                                                                                                                                                                                                                                                                              
  Saved NV Variables to NvVars file                                                                                                                                                                                                                                                                                          
  EmuVariablesUpdatedCallback                                                                                                                                                                                                                                                                                                
  FSOpen: Open 'NvVars' Success                                                                                                                                                                                                                                                                                              
  Saved NV Variables to NvVars file                                                                                                                                                                                                                                                                                          
  EfiBootManagerFindLoadOption: UEFI PXEv4 (MAC:020820138D79)                                                                                                                                                                                                                                                                
  EfiBootManagerFindLoadOption: candidate 0 (UEFI )                                                                                                                                                                                                                                                                          
  EBMFLO: Key->Description UEFI PXEv4 (MAC:020820138D79) != UEFI                                                                                                                                                                                                                                                             
  EfiBootManagerFindLoadOption: candidate 1 (UEFI  2)                                                                                                                                                                                                                                                                        
  EBMFLO: Key->Description UEFI PXEv4 (MAC:020820138D79) != UEFI  2                                                                                                                                                                                                                                                          
  EfiBootManagerFindLoadOption: candidate 2 (UEFI PXEv4 (MAC:020820138D79))                                                                                                                                                                                                                                                  
  EBMFLO: Matched!                                                                                                                                                                                                                                                                                                           
  EfiBootManagerFindLoadOption: UEFI                                                                                                                                                                                                                                                                                         
  EfiBootManagerFindLoadOption: candidate 0 (UiApp)                                                                                                                                                                                                                                                                          
  EBMFLO: Key->Attributes 1 != 265                                                                                                                                                                                                                                                                                           
  EfiBootManagerFindLoadOption: candidate 1 (UEFI )                                                                                                                                                                                                                                                                          
  EBMFLO: FilePath mismatch                                                                                                                                                                                                                                                                                                  
  EfiBootManagerFindLoadOption: candidate 2 (UEFI PXEv4 (MAC:020820138D79))                                                                                                                                                                                                                                                  
  EBMFLO: Key->Description UEFI  != UEFI PXEv4 (MAC:020820138D79)                                                                                                                                                                                                                                                            
  EfiBootManagerFindLoadOption: candidate 3 (EFI Internal Shell)                                                                                                                                                                                                                                                             
  EBMFLO: Key->Description UEFI  != EFI Internal Shell                                                                                                                                                                                                                                                                       
  EmuVariablesUpdatedCallback                                                                                                                                                                                                                                                                                                
  FSOpen: Open 'NvVars' Success                                                                                                                                                                                                                                                                                              
  Saved NV Variables to NvVars file                                                                                                                                                                                                                                                                                          
  EmuVariablesUpdatedCallback                                                                                                                                                                                                                                                                                                
  FSOpen: Open 'NvVars' Success                                                                                                                                                                                                                                                                                              
  Saved NV Variables to NvVars file                                                                                                                                                                                                                                                                                          
  EfiBootManagerFindLoadOption: UEFI  2                                                                                                                                                                                                                                                                                      
  EfiBootManagerFindLoadOption: candidate 0 (UiApp)                                                                                                                                                                                                                                                                          
  EBMFLO: Key->Attributes 1 != 265                                                                                                                                                                                                                                                                                           
  EfiBootManagerFindLoadOption: candidate 1 (UEFI )                                                                                                                                                                                                                                                                          
  EBMFLO: Key->Description UEFI  2 != UEFI                                                                                                                                                                                                                                                                                   
  EfiBootManagerFindLoadOption: candidate 2 (UEFI PXEv4 (MAC:020820138D79))                                                                                                                                                                                                                                                  
  EBMFLO: Key->Description UEFI  2 != UEFI PXEv4 (MAC:020820138D79)                                                                                                                                                                                                                                                          
  EfiBootManagerFindLoadOption: candidate 3 (EFI Internal Shell)                                                                                                                                                                                                                                                             
  EBMFLO: Key->Description UEFI  2 != EFI Internal Shell                                                                                                                                                                                                                                                                     
  EmuVariablesUpdatedCallback                                                                                                                                                                                                                                                                                                
  FSOpen: Open 'NvVars' Success                                                                                                                                                                                                                                                                                              
  Saved NV Variables to NvVars file                                                                                                                                                                                                                                                                                          
  EmuVariablesUpdatedCallback                                                                                                                                                                                                                                                                                                
  FSOpen: Open 'NvVars' Success                                                                                                                                                                                                                                                                                              
  Saved NV Variables to NvVars file                                                                                                                                                                                                                                                                                          
  EfiBootManagerFindLoadOption: UEFI PXEv4 (MAC:020820138D79)                                                                                                                                                                                                                                                                
  EfiBootManagerFindLoadOption: candidate 0 (UiApp)                                                                                                                                                                                                                                                                          
  EBMFLO: Key->Attributes 1 != 265                                                                                                                                                                                                                                                                                           
  EfiBootManagerFindLoadOption: candidate 1 (UEFI )                                                                                                                                                                                                                                                                          
  EBMFLO: Key->Description UEFI PXEv4 (MAC:020820138D79) != UEFI                                                                                                                                                                                                                                                             
  EfiBootManagerFindLoadOption: candidate 2 (UEFI PXEv4 (MAC:020820138D79)) 
EBMFLO: Matched!                                                                                                                                                                                                                                                                                                           
  EfiBootManagerFindLoadOption: EFI Internal Shell                                                                                                                                                                                                                                                                           
  EfiBootManagerFindLoadOption: candidate 0 (UiApp)                                                                                                                                                                                                                                                                          
  EBMFLO: Key->Attributes 1 != 265                                                                                                                                                                                                                                                                                           
  EfiBootManagerFindLoadOption: candidate 1 (UEFI PXEv4 (MAC:020820138D79))                                                                                                                                                                                                                                                  
  EBMFLO: Key->Description EFI Internal Shell != UEFI PXEv4 (MAC:020820138D79)                                                                                                                                                                                                                                               
  EfiBootManagerFindLoadOption: candidate 2 (EFI Internal Shell)                                                                                                                                                                                                                                                             
  EBMFLO: Matched!                                                                                                                                                                                                                                                                                                           
  Select Item: 0x19                                                                                                                                                                                                                                                                                                          
  [Bds]OsIndication: 0000000000000000                                                                                                                                                                                                                                                                                        
  [Bds]=============Begin Load Options Dumping ...=============                                                                                                                                                                                                                                                              
    Driver Options:                                                                                                                                                                                                                                                                                                          
    SysPrep Options:                                                                                                                                                                                                                                                                                                         
    Boot Options:                                                                                                                                                                                                                                                                                                            
      Boot0000: UiApp          0x0109                                                                                                                                                                                                                                                                                        
      Boot0002: UEFI PXEv4 (MAC:020820138D79)          0x0001                                                                                                                                                                                                                                                                
      Boot0003: EFI Internal Shell         0x0001                                                                                                                                                                                                                                                                            
      Boot0001: UEFI           0x0001                                                                                                                                                                                                                                                                                        
      Boot0004: UEFI  2        0x0001                                                                                                                                                                                                                                                                                        
    PlatformRecovery Options:                                                                                                                                                                                                                                                                                                
      PlatformRecovery0000: Default PlatformRecovery       0x0001                                                                                                                                                                                                                                                            
  [Bds]=============End Load Options Dumping=============

Notice that the "blank" description gets applied by the NVMe handler twice (so the disambiguating 2 probably came from elsewhere). Then we see that the UEFI entry ends up being discarded due to a FilePath mismatch, and the UEFI 2 entry ends up not matching any of the existing descriptions in the nonvolatile variables, so it also gets discarded.

I think is consistent with the following events:

  1. System boots for the first time with just the boot disk in 0.17.0; there are no boot options to load from the nonvolatile variables since this is the first boot
  2. EDK2 enumerates a boot entry for 0.17.0 with description "UEFI" and a file path pointing to the boot application on that disk
  3. This entry gets added to the nonvolatile variables on the disk at 0.17.0
  4. System shuts down; configuration changes to include a blank disk at 0.16.0
  5. EDK2 loads the BootOrder nonvolatile variables from 0.17.0
  6. EDK2 enumerates the disk at 0.16.0, assigns it description "UEFI", and decides it has no boot application (makes sense since the disk is blank and has no ESP)
  7. EDK2 enumerates the disk at 0.17.0, assigns it description "UEFI 2", and finds its boot application
  8. EDK2 tries to match the "UEFI" entry in NV storage to one of the enumerated entries; this fails because the enumerated entry from step 6 doesn't have a file path
  9. EDK2 prunes the "UEFI" entry from the boot order
  10. EDK2 adds back the newly-enumerated entries for 0.16.0 and 0.17.0, but now they're at the end of the boot order

The main thing I think I'm missing at this point is tracing that conclusively demonstrates that the NVMe boot options are getting added/described in PCI slot order--I think the logs above show a lot of smoke, but I'd really like to see the fire.

gjcolombo commented 6 months ago

The main thing I think I'm missing at this point is tracing that conclusively demonstrates that the NVMe boot options are getting added/described in PCI slot order--I think the logs above show a lot of smoke, but I'd really like to see the fire.

The disambiguating integers get added in BmMakeBootOptionDescriptionUnique, which visits the boot options in the order they were enumerated and adds disambiguating numbers to any options whose descriptions were already used elsewhere.

This fits with the behavior described above provided BmEnumerateBootOptions visits NVMe devices in PCI slot order. That enumeration is handled by the EFI_LOCATE_HANDLE_BUFFER function in the EFI boot services table; assuming I have the right implementation of that function, it will search the global handle list in the order that protocols were registered by calls to the boot services' EFI_INSTALL_PROTOCOL_INTERFACE function. I'll need to do some more reading to figure out when these registrations happen for ESPs on NVMe devices. I'm guessing they're visited in slot order (by the boot device selection code, specifically VisitAllPciInstances and its callees) but haven't walked through the whole callee tree to be sure.

gjcolombo commented 3 weeks ago

Reassigning per the discussion at the 22 Aug 2024 hypervisor huddle.