Want stable PCI slot assignments for disks (to avoid invalidating boot order on disk attach/detach)

I couldn't replicate this issue consistently but this is the instance which I ran into such an issue.

Order of events:

Created instance with two disks - disk0: Ubuntu focal, disk1: blank disk

After starting the vm, formatted and mounted disk1

ubuntu@focal3:~$ lsblk
NAME         MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
loop0          7:0    0 91.9M  1 loop /snap/lxd/24061
loop1          7:1    0 63.3M  1 loop /snap/core20/1822
loop2          7:2    0 49.9M  1 loop /snap/snapd/18357
vda          252:0    0   21K  1 disk 
nvme0n1      259:0    0    3G  0 disk 
├─nvme0n1p1  259:2    0  2.9G  0 part /
├─nvme0n1p14 259:3    0    4M  0 part 
└─nvme0n1p15 259:4    0  106M  0 part /boot/efi
nvme1n1      259:1    0   30G  0 disk /data

Stopped vm. Added 2 more blank disks (disk2 and disk3) via the console which invoked the disk attach API. Started vm again:

ubuntu@focal3:~$ lsblk
NAME         MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
loop0          7:0    0 63.3M  1 loop /snap/core20/1822
loop1          7:1    0 63.3M  1 loop /snap/core20/1879
loop2          7:2    0 91.9M  1 loop /snap/lxd/24061
loop3          7:3    0 49.9M  1 loop /snap/snapd/18357
vda          252:0    0   21K  1 disk 
nvme0n1      259:0    0   30G  0 disk 
nvme1n1      259:1    0   30G  0 disk 
nvme2n1      259:2    0    3G  0 disk 
├─nvme2n1p1  259:3    0  2.9G  0 part /
├─nvme2n1p14 259:4    0    4M  0 part 
└─nvme2n1p15 259:5    0  106M  0 part /boot/efi
nvme3n1      259:6    0   30G  0 disk /data4096

I noticed the boot disk changed from nvme0n1 to nvme2n1. Assuming that nvme3n1 was one of the newly attached disks, I tried to format it but then found that it was actually disk1 (i.e. the device previously named nvme1n1)

ubuntu@focal3:~$ sudo mkfs -t ext4 /dev/nvme3n1
mke2fs 1.45.5 (07-Jan-2020)
/dev/nvme3n1 contains a ext4 file system
last mounted on /data on Sat May  6 04:57:54 2023
Proceed anyway? (y,N) N

Stopped vm again. Removed disk1 via console. Started vm again and it stayed in starting state forever, presumably because it used the incorrect disk to boot?

If we can repro this, it'd be interesting to see the output of ls -la /dev/disk/by-path (in addition to lsblk) before and after the disks are added to see how the guest is assigning PCI slots to disks. The control plane doesn't take any steps to guarantee that disks are assigned the same PCI slots between attempts to boot an instance (slots are assigned based on the order in which they're returned from the CockroachDB query for disks attached to an instance). I would be interested to see if those assignments have changed here in a way that's made the guest think that it should relabel its disks.

Started vm again and it stayed in starting state forever, presumably because it used the incorrect disk to boot?

We'll probably need sled agent and Propolis logs for this one. Propolis moves instances to Running as soon as their vCPUs start; an instance with a missing or misconfigured boot disk will usually just get stuck in a firmware boot menu. If an instance goes to Starting but never leaves it, it's likely that Propolis either got stuck in some VM initialization step or panicked before it successfully started to run.

I have another inaccessible instance from SSH after adding a disk. Looking at the serial console, it appears to use the data disk as opposed to the boot disk when starting up:

Unfortunately, I can't get into the instance to locate the /dev/disk/by-path content.

This is consistent with the boot disk getting assigned a different PCI slot from the one it was assigned when it was first booted from. I'll try to take a closer look at the instance's logs to confirm.

What are the names of the disks that you have attached to your instance.

I found that on older Omicron/Console, the disks were attached to the instance based on their alphabetical name order. Changing the disk names may help you get your instance back online.

There are disks attached to this instance at PCI BDFs 0.16.0 and 0.17.0. It looks like 0.16.0 is the data disk (note that figuring this out is annoying because you have to match the volume ID from backend creation time to the startup-complete message later on in initialization; filed oxidecomputer/propolis#421 to improve this):

May 22 19:57:26.103 INFO Creating storage device data-for-debian-host of kind Nvme
May 22 19:57:26.103 INFO Creating Crucible disk from request Volume { id: 39f891a2-1f03-4a3a-9e7c-07d07beca223, block_size: 4096, sub_volumes: [Region { block_size: 4096, blocks_per_extent: 16384, extent_count: 160, opts: CrucibleOpts { id: 39f891a2-1f03-4a3a-9e7c-07d07beca223, target: [[fd00:1122:3344:105::b]:19000, [fd00:1122:3344:107::5]:19000, [fd00:1122:3344:107::8]:19000], lossy: false, flush_timeout: None, key: Some("B7qx6JPR8ihobQV8a05nB9aPR8EM9o4reriHcnZRJU0="), cert_pem: None, key_pem: None, root_cert_pem: None, control: None, read_only: false }, gen: 1 }], read_only_parent: None }

...

May 22 19:57:26.822 INFO Sending startup complete to pci-nvme-0.16.0, component: vm_controller
// gjc: note that the Crucible IDs match; I'm assuming the Propolis inventory iterator is walking these in device-backend pairs
May 22 19:57:26.822 INFO Sending startup complete to block-crucible-39f891a2-1f03-4a3a-9e7c-07d07beca223, component: vm_controller

The other disk (run-something-debian11-2a6bb6) appears to have been assigned to 0.17.0. This is consistent with the disks getting assigned PCI slots in alphabetical order, as @leftwo points out above. (IIUC, the reason this dumps you into a UEFI shell is that on first boot, the bootrom writes a boot device order to the EFI system partition on the boot disk, and the disk entries in that boot order are identified by PCI BDF; if you then move the disk to another slot and try to boot, the bootrom finds the saved boot order, but doesn't find a bootable disk with the specified PCI device number, and so dumps you into a shell.)

@leftwo I was thinking over the weekend that we should consider teaching Nexus to assign stable PCI slot numbers to an instance's attached disks. Nexus already does this for NICs, so there's some precedent there (and some code we can reuse to allocate unused slots when new disks are attached). I suspect that would help prevent this kind of problem. WDYT?

Another option is to implement the QEMU fwcfg boot device enlightenment. With this, Nexus could indicate to Propolis which disk is supposed to be the boot disk, and Propolis would surface that information to guest firmware so that it can boot from whatever disk Propolis indicated. The catch there is that then the Nexus DB model needs a way to label a disk as a boot disk; if that concept doesn't exist today, we'd have to add it, design APIs to set/clear it, etc., which might require a bigger lift than just ensuring that disks have stable PCI slot assignments.

@gjcolombo I think we (not you and I specifically, but us as a company) have discussed a few ways to do this along with the pitfalls associated with each.

Is a given PCI slot number always going to be the "boot disk", or is part of that decision what someone could change? Do all OS's try first to boot from a specific PCI slot number?

It does seem like what you suggest about stable PCI slot number and just exposing what PCI slot number a disk will be located at, will get us 90% of the way there, and at least not allow a new disk to change the numbering for existing ones.

My understanding of our guest firmware's behavior is somewhat limited, but I think the answers are as follows, based on my hasty reading of the EDK2 code (all the below is AIUI):

Is a given PCI slot number always going to be the "boot disk"?

Sometimes!

The guest firmware image has some logic to look for PCI mass storage devices that look like they might contain a properly-formatted EFI system partition (GPT-style partition table with an appropriately named FAT partition). I believe these get searched in PCI slot order, but I haven't confirmed this.

Once one of these disks and partitions is found, the firmware will try to load nonvolatile EFI variables, including the boot order variables, from a file named NvVars in that partition. If no NvVars are found (and so no persistent boot order is present), I believe the firmware will try to (a) load the boot application on the selected disk, and (b) write a new NvVars file containing an EFI_LOAD_OPTION that describes the device and file thereon that the boot manager is currently loading from.

So there's nothing in firmware that says, e.g., that slot 16 is always going to be the boot disk. The firmware will try to load the boot applications specified in the load options in whatever nonvolatile variables it finds. If it finds none, it will fall back to the first viable-looking boot application it found, irrespective of its PCI slot assignment.

To be clear, all this is just what the firmware does. Once it loads the boot application (e.g. GRUB, Windows bootmgr, what-have-you), that application may further enumerate devices/disks/partitions/etc. to present another array of boot options to the user.

Is part of [the boot device] decision what someone could change?

Strictly speaking, yes: our firmware image has a user interface that lets you change the persistent boot options. (In @askfongjojo's screenshot above I think the exit command will drop out to it.)

However, working with this menu is a cruddy experience. Users won't generally see it unless something is very wrong (as happened above). If they do see it, they have to know to use the serial console to try to fix the problem.

Users might have more control over what the boot application does (e.g. they can more easily edit their GRUB configuration), but that's not what we're looking at here.

Do all OS's try first to boot from a specific PCI slot number?

By the time you're running an OS boot application, I think the danger above has passed--we've already chosen a PCI device to attempt to boot from.

In this specific case, the problem we're dealing with comes from the first part of the above. I think the sequence here is:

World begins: instance has a bootable disk in slot 16 (GPT partition table, EFI system partition, the works) and a data disk (totally unformatted) in slot 17. (This assignment is determined by the order in which the VolumeConstructionRequests are presented to sled agent, which passes them on to Propolis: disks[0] gets slot 16, disks[1] gets 17, etc., up to slot 23.)
Instance starts; guest firmware finds a properly-formatted EFI-partition-bearing disk at BDF 0.16.0
Guest firmware decides the disk has no nonvolatile variables, so it writes a new set of variables with a boot option that says "load bootx64.efi from the device at 0.16.0", then boots into the guest boot application, which loads the guest OS
Instance shuts down and restarts
On the next boot, Nexus enumerates the two attached disks in the other order, such that the data disk is now in slot 16 and the OS disk is now in slot 17. This is possible because the instance start path just queries CRDB for the set of disks that are attached to the instance; the order in which CRDB returns them is the order in which they'll be furnished to sled agent, which determines their PCI slot assignments.
Guest firmware looks for a disk with an EFI partition and finds the disk at 0.17.0
The disk now has non-volatile variables (from step 3) that say to load a boot application from the disk at 0.16.0
But that disk is the data disk and it has no boot application. Uh-oh!
Firmware gets upset and dumps the user out to the firmware shell

If Nexus chose PCI slot assignments explicitly in step 1 and stored them during instance creation, then they would be present in step 5, and the order in which the disk requests were passed to sled agent wouldn't matter.

The guest firmware image has some logic to look for PCI mass storage devices that look like they might contain a properly-formatted EFI system partition (GPT-style partition table with an appropriately named FAT partition). I believe these get searched in PCI slot order, but I haven't confirmed this. ...

Thanks for writing this up, Greg. It matches my expectations about the UEFI bootrom behavior today.

Once one of these disks and partitions is found, the firmware will try to load nonvolatile EFI variables, including the boot order variables, from a file named NvVars in that partition. If no NvVars are found (and so no persistent boot order is present), I believe the firmware will try to (a) load the boot application on the selected disk, and (b) write a new NvVars file containing an EFI_LOAD_OPTION that describes the device and file thereon that the boot manager is currently loading from.

We may also expect to have the persistent vars storied by the ROM interface (OVMF_vars.fd), but that has not been implemented yet, and would require some integration with upstack services to ferry that data around.

Is part of [the boot device] decision what someone could change?

Strictly speaking, yes: our firmware image has a user interface that lets you change the persistent boot options. (In @askfongjojo's screenshot above I think the exit command will drop out to it.)

We could also probably influence this via the fw_cfg bootorder interface, but I don't know how effective that is given the complexities of UEFI boot.

We may also expect to have the persistent vars storied by the ROM interface (OVMF_vars.fd), but that has not been implemented yet, and would require some integration with upstack services to ferry that data around.

I agree; we should do this Someday (tm) but I think it's out of scope for us for the time being.

We could also probably influence this via the fw_cfg bootorder interface, but I don't know how effective that is given the complexities of UEFI boot.

I assume this takes precedence over the nonvolatile variables (it seems like it would be hard to use if it didn't), but I'd have to do more EDK2 source diving to be sure. Or we could just try it with QEMU.

I've been tempted to implement this enlightenment anyway, but I haven't (yet) worked out how Nexus would decide which disk is the boot disk (once it decides it's presumably just an instance spec/disk request flag).

@gjcolombo - I appreciate your getting to the bottom of this issue! The use case of adding a second disk is very common for database and log management apps and customers are going to hit this problem as soon as they deploy real workload.

The nexus part, i.e. the attribute for denoting boot disks, was discussed in https://github.com/oxidecomputer/omicron/issues/1417. It was deemed lower priority because user doesn't need the system to tell them which disk is the boot disk. But it didn't occur to me that the lack of such a flag could cause the information to be lost afterwards. I had assumed that the labels in /etc/fstab were the source of truth and would be used by the hypervisor.

At today's hypervisor huddle we agreed to try to address this by assigning PCI slots to attached disks in the control plane instead of having the control plane determine slot assignments on the fly. I'm accordingly going to transfer this to the Omicron repo and will link some more context there.

The relevant bit of Nexus is here: https://github.com/oxidecomputer/omicron/blob/abfca0b50430593f9aca90902c75f3a3297130c2/nexus/src/app/instance.rs#L623-L636

The disk request has room for a PCI slot, but the slot selection is determined from the disk enumeration order and not from an assignment in CRDB. We should assign persistent slots for disks instead (in their disk records). I think there are DB helpers to do this for NICs, so there's probably a lot of code we can borrow to do that.

@zephraph, @ahl, and I discussed this today. Because this issue... materially reduces the utility of the disk attach/detach APIs, to the point where if we don't fix it we might not be able comfortably to expose those APIs at FCS, we're planning to move it to that milestone.

To fix this specific problem we need to do the following:

[ ] Add a slot column to the disks table in CRDB; this is probably going to be a nullable integer that can take values from 0-7 inclusive
[ ] Construct a disk attachment query that uses the NextItem helper in nexus::db_queries to select a slot for a disk when attaching it. There's precedent here in the existing queries for network interfaces; the general idea is to construct a CTE with a WHERE clause with the semantics "given an instance ID, return the lowest slot number for which there is no disk attached to the instance that has been assigned that slot number" and then write that into the slot column when updating the disk's attachment state.
[ ] Add similar logic to clear the slot assignment when a disk is detached.
[ ] Fix the disk_reqs construction logic in the previous comment to populate DiskRequest::slot from the checked-out volume record instead of from the index supplied by enumerate.

No Propolis changes should be required here--once the slot in the DiskRequest is populated consistently, Propolis should take care of the rest.

In addition to the work above, we'd like to change the instance create API so that one of the disks is labeled as the "preferred boot disk." The instance create saga would then ensure that this disk is always attached first so that it ends up in slot 0. If the disk is bootable and my reading of the EDK2 code is correct (see above), this will cause its boot application to be chosen before the boot applications on any other attached bootable disks. We'll file a separate issue for this.

In the longer term (MVP or even later), we can also consider supporting the QEMU fw_cfg bootorder interface. I'll capture that idea into a short RFD when I have a moment of free time and inspiration.

@gjcolombo I'm familiar with the NextItem query and its use in the NIC allocation stuff, so don't hesitate to give a shout if I can help on that bit of this!

I've looked at this in a little more detail. The NextItem subquery mechanics make sense to me in general. The challenge is working out whether we can use them more or less as-is in the "attach to collection" CTE. I don't currently see an easy way to do that. If that's correct, I think we'll have to do one of the following:

Restructure the "attach to collection" CTE to allow it to function more like the NIC attachment queries while maintaining its generic "attaching X to container Y" character. This would entail allowing callers to supply a subquery similar to the interface validation CTE (see push_interface_validation_cte) and then giving them a way to refer to it in the generic update statement (see the select_from_cte helper routine).
Abandon the generic collection-attachment CTE and hand-roll the necessary CTEs for disks.

Approach #1 might let us unify some disk and network interface code (and could possibly be reused in the future if we have other kinds of resources we want to attach), but it seems like a lot more work than approach #2, if it's even feasible at all (I would rate my Diesel skills at about a 2 out of 10, just enough to know that I have no idea whether this could actually work). I also wonder if YAGNI applies here--there just aren't that many kinds of resources that we attach to other resources this way; is the juice of a highly-configurable, very generic attachment mechanism worth the squeeze given how few resources we try to pair up this way?

I think I've found a way forward here that lets us use the existing NextItem logic alongside the existing DatastoreAttachTarget logic (it turns out to be relatively straightforward to handcraft the SET clause of the UPDATE statement in the disk attach CTE so that the SET can leverage NextItem). The resulting change is not quite done (needs more tests and has some TODOs), but I should be able to get it mostly polished up and posted to GH tomorrow before I go OOO next week.

Here's a sample debug_query dump of the queries the new code is generating; this at least looks like what I intended, and the basic instance creation saga test passes with it:

"WITH collection_by_id AS 
(SELECT \"instance\".\"id\",
    \"instance\".\"name\",
    \"instance\".\"description\",
    \"instance\".\"time_created\",
    \"instance\".\"time_modified\",
    \"instance\".\"time_deleted\",
    \"instance\".\"project_id\",
    \"instance\".\"user_data\",
    \"instance\".\"state\",
    \"instance\".\"time_state_updated\",
    \"instance\".\"state_generation\",
    \"instance\".\"active_sled_id\",
    \"instance\".\"active_propolis_id\",
    \"instance\".\"active_propolis_ip\",
    \"instance\".\"target_propolis_id\",
    \"instance\".\"migration_id\",
    \"instance\".\"propolis_generation\",
    \"instance\".\"ncpus\",
    \"instance\".\"memory\",
    \"instance\".\"hostname\"
    FROM \"instance\"
    WHERE ((\"instance\".\"id\" = $1) AND
              (\"instance\".\"time_deleted\" IS NULL)) FOR UPDATE),
resource_by_id AS
(SELECT \"disk\".\"id\",
    \"disk\".\"name\",
    \"disk\".\"description\",
    \"disk\".\"time_created\",
    \"disk\".\"time_modified\",
    \"disk\".\"time_deleted\",
    \"disk\".\"rcgen\",
    \"disk\".\"project_id\",
    \"disk\".\"volume_id\",
    \"disk\".\"disk_state\",
    \"disk\".\"attach_instance_id\",
    \"disk\".\"state_generation\",
    \"disk\".\"time_state_updated\",
    \"disk\".\"slot\",
    \"disk\".\"size_bytes\",
    \"disk\".\"block_size\",
    \"disk\".\"origin_snapshot\",
    \"disk\".\"origin_image\",
    \"disk\".\"pantry_address\"
    FROM \"disk\"
    WHERE ((\"disk\".\"id\" = $2) AND
          (\"disk\".\"time_deleted\" IS NULL)) FOR UPDATE),
resource_count AS
(SELECT COUNT(*)
 FROM \"disk\"
 WHERE ((\"disk\".\"attach_instance_id\" = $3) AND (\"disk\".\"time_deleted\" IS NULL))),
collection_info AS
(SELECT \"instance\".\"id\",
    \"instance\".\"name\",
    \"instance\".\"description\",
    \"instance\".\"time_created\",
    \"instance\".\"time_modified\",
    \"instance\".\"time_deleted\",
    \"instance\".\"project_id\",
    \"instance\".\"user_data\",
    \"instance\".\"state\",
    \"instance\".\"time_state_updated\",
    \"instance\".\"state_generation\",
    \"instance\".\"active_sled_id\",
    \"instance\".\"active_propolis_id\",
    \"instance\".\"active_propolis_ip\",
    \"instance\".\"target_propolis_id\",
    \"instance\".\"migration_id\",
    \"instance\".\"propolis_generation\",
    \"instance\".\"ncpus\",
    \"instance\".\"memory\",
    \"instance\".\"hostname\"
    FROM \"instance\"
    WHERE (((\"instance\".\"state\" = ANY($4)) AND
          (\"instance\".\"id\" = $5)) AND
          (\"instance\".\"time_deleted\" IS NULL)) FOR UPDATE),
resource_info AS
(SELECT \"disk\".\"id\",
    \"disk\".\"name\",
    \"disk\".\"description\",
    \"disk\".\"time_created\",
    \"disk\".\"time_modified\",
    \"disk\".\"time_deleted\",
    \"disk\".\"rcgen\",
    \"disk\".\"project_id\",
    \"disk\".\"volume_id\",
    \"disk\".\"disk_state\",
    \"disk\".\"attach_instance_id\",
    \"disk\".\"state_generation\",
    \"disk\".\"time_state_updated\",
    \"disk\".\"slot\",
    \"disk\".\"size_bytes\",
    \"disk\".\"block_size\",
    \"disk\".\"origin_snapshot\",
    \"disk\".\"origin_image\",
    \"disk\".\"pantry_address\"
    FROM \"disk\"
    WHERE ((((\"disk\".\"disk_state\" = ANY($6)) AND
          (\"disk\".\"id\" = $7)) AND
          (\"disk\".\"time_deleted\" IS NULL)) AND
          (\"disk\".\"attach_instance_id\" IS NULL)) FOR UPDATE),
do_update AS
(SELECT IF(EXISTS(SELECT \"id\" FROM collection_info) AND
           EXISTS(SELECT \"id\" FROM resource_info) AND
       (SELECT * FROM resource_count) < 8, TRUE, FALSE)),
updated_resource AS
(UPDATE \"disk\" SET \"attach_instance_id\" = $8,
             \"disk_state\" = $9,
             \"slot\" = (SELECT $10 + \"shift\" AS \"slot\" FROM 
                    (SELECT generate_series(0, $11) AS \"shift\"
                UNION ALL SELECT generate_series($12, -1) AS \"shift\")
                LEFT OUTER JOIN \"disk\"
                ON (\"attach_instance_id\", \"slot\", \"time_deleted\" IS NULL) = 
                   ($13, $14 + \"shift\", TRUE)
                WHERE \"slot\" IS NULL
                LIMIT 1)
WHERE (\"disk\".\"id\" = $15)
AND (SELECT * FROM do_update)
RETURNING \"disk\".\"id\",
      \"disk\".\"name\",
      \"disk\".\"description\",
      \"disk\".\"time_created\",
      \"disk\".\"time_modified\",
      \"disk\".\"time_deleted\",
      \"disk\".\"rcgen\",
      \"disk\".\"project_id\",
      \"disk\".\"volume_id\",
      \"disk\".\"disk_state\",
      \"disk\".\"attach_instance_id\",
      \"disk\".\"state_generation\",
      \"disk\".\"time_state_updated\",
      \"disk\".\"slot\",
      \"disk\".\"size_bytes\",
      \"disk\".\"block_size\",
      \"disk\".\"origin_snapshot\",
      \"disk\".\"origin_image\",
      \"disk\".\"pantry_address\")
SELECT * FROM 
    (SELECT * FROM resource_count) LEFT JOIN
    (SELECT * FROM collection_by_id) ON TRUE LEFT JOIN
    (SELECT * FROM resource_by_id) ON TRUE LEFT JOIN
    (SELECT * FROM updated_resource) ON TRUE;",

    binds: [
        // $1: instance ID
        66894238-988f-4ab1-aa85-4cb44a0c94f7,
    // $2: disk ID
        aa1d8ae3-dc72-4c78-a94d-5c013d1034f0,
    // $3: instance ID
        66894238-988f-4ab1-aa85-4cb44a0c94f7,
    // $4: valid instance states for attach
        [
            InstanceState(
                Creating,
            ),
            InstanceState(
                Stopped,
            ),
        ],
    // $5: instance ID
        66894238-988f-4ab1-aa85-4cb44a0c94f7,
    // $6: valid disk states for attach
        [
            "creating",
            "detached",
        ],
    // $7: disk ID
        aa1d8ae3-dc72-4c78-a94d-5c013d1034f0,
    // $8: attach instance ID to set
        66894238-988f-4ab1-aa85-4cb44a0c94f7,
    // $9: disk state to set
        "attached",
    // $10: shift base
        0,
    // $11: shift max
        8,
    // $12: shift min
        0,
    // $13: expected attach instance ID
        66894238-988f-4ab1-aa85-4cb44a0c94f7,
    // $14: shift base
        0,
    // $15: disk ID
        aa1d8ae3-dc72-4c78-a94d-5c013d1034f0,
    ],

oxidecomputer / omicron

Want stable PCI slot assignments for disks (to avoid invalidating boot order on disk attach/detach) #3224