Closed askfongjojo closed 1 year ago
If we can repro this, it'd be interesting to see the output of ls -la /dev/disk/by-path
(in addition to lsblk
) before and after the disks are added to see how the guest is assigning PCI slots to disks. The control plane doesn't take any steps to guarantee that disks are assigned the same PCI slots between attempts to boot an instance (slots are assigned based on the order in which they're returned from the CockroachDB query for disks attached to an instance). I would be interested to see if those assignments have changed here in a way that's made the guest think that it should relabel its disks.
Started vm again and it stayed in starting state forever, presumably because it used the incorrect disk to boot?
We'll probably need sled agent and Propolis logs for this one. Propolis moves instances to Running as soon as their vCPUs start; an instance with a missing or misconfigured boot disk will usually just get stuck in a firmware boot menu. If an instance goes to Starting but never leaves it, it's likely that Propolis either got stuck in some VM initialization step or panicked before it successfully started to run.
I have another inaccessible instance from SSH after adding a disk. Looking at the serial console, it appears to use the data disk as opposed to the boot disk when starting up:
Unfortunately, I can't get into the instance to locate the /dev/disk/by-path content.
This is consistent with the boot disk getting assigned a different PCI slot from the one it was assigned when it was first booted from. I'll try to take a closer look at the instance's logs to confirm.
What are the names of the disks that you have attached to your instance.
I found that on older Omicron/Console, the disks were attached to the instance based on their alphabetical name order. Changing the disk names may help you get your instance back online.
There are disks attached to this instance at PCI BDFs 0.16.0 and 0.17.0. It looks like 0.16.0 is the data disk (note that figuring this out is annoying because you have to match the volume ID from backend creation time to the startup-complete message later on in initialization; filed oxidecomputer/propolis#421 to improve this):
May 22 19:57:26.103 INFO Creating storage device data-for-debian-host of kind Nvme
May 22 19:57:26.103 INFO Creating Crucible disk from request Volume { id: 39f891a2-1f03-4a3a-9e7c-07d07beca223, block_size: 4096, sub_volumes: [Region { block_size: 4096, blocks_per_extent: 16384, extent_count: 160, opts: CrucibleOpts { id: 39f891a2-1f03-4a3a-9e7c-07d07beca223, target: [[fd00:1122:3344:105::b]:19000, [fd00:1122:3344:107::5]:19000, [fd00:1122:3344:107::8]:19000], lossy: false, flush_timeout: None, key: Some("B7qx6JPR8ihobQV8a05nB9aPR8EM9o4reriHcnZRJU0="), cert_pem: None, key_pem: None, root_cert_pem: None, control: None, read_only: false }, gen: 1 }], read_only_parent: None }
...
May 22 19:57:26.822 INFO Sending startup complete to pci-nvme-0.16.0, component: vm_controller
// gjc: note that the Crucible IDs match; I'm assuming the Propolis inventory iterator is walking these in device-backend pairs
May 22 19:57:26.822 INFO Sending startup complete to block-crucible-39f891a2-1f03-4a3a-9e7c-07d07beca223, component: vm_controller
The other disk (run-something-debian11-2a6bb6
) appears to have been assigned to 0.17.0. This is consistent with the disks getting assigned PCI slots in alphabetical order, as @leftwo points out above. (IIUC, the reason this dumps you into a UEFI shell is that on first boot, the bootrom writes a boot device order to the EFI system partition on the boot disk, and the disk entries in that boot order are identified by PCI BDF; if you then move the disk to another slot and try to boot, the bootrom finds the saved boot order, but doesn't find a bootable disk with the specified PCI device number, and so dumps you into a shell.)
@leftwo I was thinking over the weekend that we should consider teaching Nexus to assign stable PCI slot numbers to an instance's attached disks. Nexus already does this for NICs, so there's some precedent there (and some code we can reuse to allocate unused slots when new disks are attached). I suspect that would help prevent this kind of problem. WDYT?
Another option is to implement the QEMU fwcfg boot device enlightenment. With this, Nexus could indicate to Propolis which disk is supposed to be the boot disk, and Propolis would surface that information to guest firmware so that it can boot from whatever disk Propolis indicated. The catch there is that then the Nexus DB model needs a way to label a disk as a boot disk; if that concept doesn't exist today, we'd have to add it, design APIs to set/clear it, etc., which might require a bigger lift than just ensuring that disks have stable PCI slot assignments.
@gjcolombo I think we (not you and I specifically, but us as a company) have discussed a few ways to do this along with the pitfalls associated with each.
Is a given PCI slot number always going to be the "boot disk", or is part of that decision what someone could change? Do all OS's try first to boot from a specific PCI slot number?
It does seem like what you suggest about stable PCI slot number and just exposing what PCI slot number a disk will be located at, will get us 90% of the way there, and at least not allow a new disk to change the numbering for existing ones.
My understanding of our guest firmware's behavior is somewhat limited, but I think the answers are as follows, based on my hasty reading of the EDK2 code (all the below is AIUI):
Is a given PCI slot number always going to be the "boot disk"?
Sometimes!
The guest firmware image has some logic to look for PCI mass storage devices that look like they might contain a properly-formatted EFI system partition (GPT-style partition table with an appropriately named FAT partition). I believe these get searched in PCI slot order, but I haven't confirmed this.
Once one of these disks and partitions is found, the firmware will try to load nonvolatile EFI variables, including the boot order variables, from a file named NvVars
in that partition. If no NvVars
are found (and so no persistent boot order is present), I believe the firmware will try to (a) load the boot application on the selected disk, and (b) write a new NvVars
file containing an EFI_LOAD_OPTION
that describes the device and file thereon that the boot manager is currently loading from.
So there's nothing in firmware that says, e.g., that slot 16 is always going to be the boot disk. The firmware will try to load the boot applications specified in the load options in whatever nonvolatile variables it finds. If it finds none, it will fall back to the first viable-looking boot application it found, irrespective of its PCI slot assignment.
To be clear, all this is just what the firmware does. Once it loads the boot application (e.g. GRUB, Windows bootmgr
, what-have-you), that application may further enumerate devices/disks/partitions/etc. to present another array of boot options to the user.
Is part of [the boot device] decision what someone could change?
Strictly speaking, yes: our firmware image has a user interface that lets you change the persistent boot options. (In @askfongjojo's screenshot above I think the exit
command will drop out to it.)
However, working with this menu is a cruddy experience. Users won't generally see it unless something is very wrong (as happened above). If they do see it, they have to know to use the serial console to try to fix the problem.
Users might have more control over what the boot application does (e.g. they can more easily edit their GRUB configuration), but that's not what we're looking at here.
Do all OS's try first to boot from a specific PCI slot number?
By the time you're running an OS boot application, I think the danger above has passed--we've already chosen a PCI device to attempt to boot from.
In this specific case, the problem we're dealing with comes from the first part of the above. I think the sequence here is:
VolumeConstructionRequest
s are presented to sled agent, which passes them on to Propolis: disks[0] gets slot 16, disks[1] gets 17, etc., up to slot 23.)If Nexus chose PCI slot assignments explicitly in step 1 and stored them during instance creation, then they would be present in step 5, and the order in which the disk requests were passed to sled agent wouldn't matter.
The guest firmware image has some logic to look for PCI mass storage devices that look like they might contain a properly-formatted EFI system partition (GPT-style partition table with an appropriately named FAT partition). I believe these get searched in PCI slot order, but I haven't confirmed this. ...
Thanks for writing this up, Greg. It matches my expectations about the UEFI bootrom behavior today.
Once one of these disks and partitions is found, the firmware will try to load nonvolatile EFI variables, including the boot order variables, from a file named
NvVars
in that partition. If noNvVars
are found (and so no persistent boot order is present), I believe the firmware will try to (a) load the boot application on the selected disk, and (b) write a newNvVars
file containing anEFI_LOAD_OPTION
that describes the device and file thereon that the boot manager is currently loading from.
We may also expect to have the persistent vars storied by the ROM interface (OVMF_vars.fd), but that has not been implemented yet, and would require some integration with upstack services to ferry that data around.
Is part of [the boot device] decision what someone could change?
Strictly speaking, yes: our firmware image has a user interface that lets you change the persistent boot options. (In @askfongjojo's screenshot above I think the
exit
command will drop out to it.)
We could also probably influence this via the fw_cfg
bootorder interface, but I don't know how effective that is given the complexities of UEFI boot.
We may also expect to have the persistent vars storied by the ROM interface (OVMF_vars.fd), but that has not been implemented yet, and would require some integration with upstack services to ferry that data around.
I agree; we should do this Someday (tm) but I think it's out of scope for us for the time being.
We could also probably influence this via the fw_cfg bootorder interface, but I don't know how effective that is given the complexities of UEFI boot.
I assume this takes precedence over the nonvolatile variables (it seems like it would be hard to use if it didn't), but I'd have to do more EDK2 source diving to be sure. Or we could just try it with QEMU.
I've been tempted to implement this enlightenment anyway, but I haven't (yet) worked out how Nexus would decide which disk is the boot disk (once it decides it's presumably just an instance spec/disk request flag).
@gjcolombo - I appreciate your getting to the bottom of this issue! The use case of adding a second disk is very common for database and log management apps and customers are going to hit this problem as soon as they deploy real workload.
The nexus part, i.e. the attribute for denoting boot disks, was discussed in https://github.com/oxidecomputer/omicron/issues/1417. It was deemed lower priority because user doesn't need the system to tell them which disk is the boot disk. But it didn't occur to me that the lack of such a flag could cause the information to be lost afterwards. I had assumed that the labels in /etc/fstab
were the source of truth and would be used by the hypervisor.
At today's hypervisor huddle we agreed to try to address this by assigning PCI slots to attached disks in the control plane instead of having the control plane determine slot assignments on the fly. I'm accordingly going to transfer this to the Omicron repo and will link some more context there.
The relevant bit of Nexus is here: https://github.com/oxidecomputer/omicron/blob/abfca0b50430593f9aca90902c75f3a3297130c2/nexus/src/app/instance.rs#L623-L636
The disk request has room for a PCI slot, but the slot selection is determined from the disk enumeration order and not from an assignment in CRDB. We should assign persistent slots for disks instead (in their disk records). I think there are DB helpers to do this for NICs, so there's probably a lot of code we can borrow to do that.
@zephraph, @ahl, and I discussed this today. Because this issue... materially reduces the utility of the disk attach/detach APIs, to the point where if we don't fix it we might not be able comfortably to expose those APIs at FCS, we're planning to move it to that milestone.
To fix this specific problem we need to do the following:
slot
column to the disks
table in CRDB; this is probably going to be a nullable integer that can take values from 0-7 inclusiveNextItem
helper in nexus::db_queries
to select a slot for a disk when attaching it. There's precedent here in the existing queries for network interfaces; the general idea is to construct a CTE with a WHERE clause with the semantics "given an instance ID, return the lowest slot number for which there is no disk attached to the instance that has been assigned that slot number" and then write that into the slot
column when updating the disk's attachment state.slot
assignment when a disk is detached.disk_reqs
construction logic in the previous comment to populate DiskRequest::slot
from the checked-out volume record instead of from the index supplied by enumerate
.No Propolis changes should be required here--once the slot in the DiskRequest
is populated consistently, Propolis should take care of the rest.
In addition to the work above, we'd like to change the instance create API so that one of the disks is labeled as the "preferred boot disk." The instance create saga would then ensure that this disk is always attached first so that it ends up in slot 0. If the disk is bootable and my reading of the EDK2 code is correct (see above), this will cause its boot application to be chosen before the boot applications on any other attached bootable disks. We'll file a separate issue for this.
In the longer term (MVP or even later), we can also consider supporting the QEMU fw_cfg bootorder interface. I'll capture that idea into a short RFD when I have a moment of free time and inspiration.
@gjcolombo I'm familiar with the NextItem
query and its use in the NIC allocation stuff, so don't hesitate to give a shout if I can help on that bit of this!
I've looked at this in a little more detail. The NextItem
subquery mechanics make sense to me in general. The challenge is working out whether we can use them more or less as-is in the "attach to collection" CTE. I don't currently see an easy way to do that. If that's correct, I think we'll have to do one of the following:
push_interface_validation_cte
) and then giving them a way to refer to it in the generic update statement (see the select_from_cte
helper routine).Approach #1 might let us unify some disk and network interface code (and could possibly be reused in the future if we have other kinds of resources we want to attach), but it seems like a lot more work than approach #2, if it's even feasible at all (I would rate my Diesel skills at about a 2 out of 10, just enough to know that I have no idea whether this could actually work). I also wonder if YAGNI applies here--there just aren't that many kinds of resources that we attach to other resources this way; is the juice of a highly-configurable, very generic attachment mechanism worth the squeeze given how few resources we try to pair up this way?
I think I've found a way forward here that lets us use the existing NextItem
logic alongside the existing DatastoreAttachTarget
logic (it turns out to be relatively straightforward to handcraft the SET
clause of the UPDATE
statement in the disk attach CTE so that the SET can leverage NextItem
). The resulting change is not quite done (needs more tests and has some TODOs), but I should be able to get it mostly polished up and posted to GH tomorrow before I go OOO next week.
Here's a sample debug_query
dump of the queries the new code is generating; this at least looks like what I intended, and the basic instance creation saga test passes with it:
"WITH collection_by_id AS
(SELECT \"instance\".\"id\",
\"instance\".\"name\",
\"instance\".\"description\",
\"instance\".\"time_created\",
\"instance\".\"time_modified\",
\"instance\".\"time_deleted\",
\"instance\".\"project_id\",
\"instance\".\"user_data\",
\"instance\".\"state\",
\"instance\".\"time_state_updated\",
\"instance\".\"state_generation\",
\"instance\".\"active_sled_id\",
\"instance\".\"active_propolis_id\",
\"instance\".\"active_propolis_ip\",
\"instance\".\"target_propolis_id\",
\"instance\".\"migration_id\",
\"instance\".\"propolis_generation\",
\"instance\".\"ncpus\",
\"instance\".\"memory\",
\"instance\".\"hostname\"
FROM \"instance\"
WHERE ((\"instance\".\"id\" = $1) AND
(\"instance\".\"time_deleted\" IS NULL)) FOR UPDATE),
resource_by_id AS
(SELECT \"disk\".\"id\",
\"disk\".\"name\",
\"disk\".\"description\",
\"disk\".\"time_created\",
\"disk\".\"time_modified\",
\"disk\".\"time_deleted\",
\"disk\".\"rcgen\",
\"disk\".\"project_id\",
\"disk\".\"volume_id\",
\"disk\".\"disk_state\",
\"disk\".\"attach_instance_id\",
\"disk\".\"state_generation\",
\"disk\".\"time_state_updated\",
\"disk\".\"slot\",
\"disk\".\"size_bytes\",
\"disk\".\"block_size\",
\"disk\".\"origin_snapshot\",
\"disk\".\"origin_image\",
\"disk\".\"pantry_address\"
FROM \"disk\"
WHERE ((\"disk\".\"id\" = $2) AND
(\"disk\".\"time_deleted\" IS NULL)) FOR UPDATE),
resource_count AS
(SELECT COUNT(*)
FROM \"disk\"
WHERE ((\"disk\".\"attach_instance_id\" = $3) AND (\"disk\".\"time_deleted\" IS NULL))),
collection_info AS
(SELECT \"instance\".\"id\",
\"instance\".\"name\",
\"instance\".\"description\",
\"instance\".\"time_created\",
\"instance\".\"time_modified\",
\"instance\".\"time_deleted\",
\"instance\".\"project_id\",
\"instance\".\"user_data\",
\"instance\".\"state\",
\"instance\".\"time_state_updated\",
\"instance\".\"state_generation\",
\"instance\".\"active_sled_id\",
\"instance\".\"active_propolis_id\",
\"instance\".\"active_propolis_ip\",
\"instance\".\"target_propolis_id\",
\"instance\".\"migration_id\",
\"instance\".\"propolis_generation\",
\"instance\".\"ncpus\",
\"instance\".\"memory\",
\"instance\".\"hostname\"
FROM \"instance\"
WHERE (((\"instance\".\"state\" = ANY($4)) AND
(\"instance\".\"id\" = $5)) AND
(\"instance\".\"time_deleted\" IS NULL)) FOR UPDATE),
resource_info AS
(SELECT \"disk\".\"id\",
\"disk\".\"name\",
\"disk\".\"description\",
\"disk\".\"time_created\",
\"disk\".\"time_modified\",
\"disk\".\"time_deleted\",
\"disk\".\"rcgen\",
\"disk\".\"project_id\",
\"disk\".\"volume_id\",
\"disk\".\"disk_state\",
\"disk\".\"attach_instance_id\",
\"disk\".\"state_generation\",
\"disk\".\"time_state_updated\",
\"disk\".\"slot\",
\"disk\".\"size_bytes\",
\"disk\".\"block_size\",
\"disk\".\"origin_snapshot\",
\"disk\".\"origin_image\",
\"disk\".\"pantry_address\"
FROM \"disk\"
WHERE ((((\"disk\".\"disk_state\" = ANY($6)) AND
(\"disk\".\"id\" = $7)) AND
(\"disk\".\"time_deleted\" IS NULL)) AND
(\"disk\".\"attach_instance_id\" IS NULL)) FOR UPDATE),
do_update AS
(SELECT IF(EXISTS(SELECT \"id\" FROM collection_info) AND
EXISTS(SELECT \"id\" FROM resource_info) AND
(SELECT * FROM resource_count) < 8, TRUE, FALSE)),
updated_resource AS
(UPDATE \"disk\" SET \"attach_instance_id\" = $8,
\"disk_state\" = $9,
\"slot\" = (SELECT $10 + \"shift\" AS \"slot\" FROM
(SELECT generate_series(0, $11) AS \"shift\"
UNION ALL SELECT generate_series($12, -1) AS \"shift\")
LEFT OUTER JOIN \"disk\"
ON (\"attach_instance_id\", \"slot\", \"time_deleted\" IS NULL) =
($13, $14 + \"shift\", TRUE)
WHERE \"slot\" IS NULL
LIMIT 1)
WHERE (\"disk\".\"id\" = $15)
AND (SELECT * FROM do_update)
RETURNING \"disk\".\"id\",
\"disk\".\"name\",
\"disk\".\"description\",
\"disk\".\"time_created\",
\"disk\".\"time_modified\",
\"disk\".\"time_deleted\",
\"disk\".\"rcgen\",
\"disk\".\"project_id\",
\"disk\".\"volume_id\",
\"disk\".\"disk_state\",
\"disk\".\"attach_instance_id\",
\"disk\".\"state_generation\",
\"disk\".\"time_state_updated\",
\"disk\".\"slot\",
\"disk\".\"size_bytes\",
\"disk\".\"block_size\",
\"disk\".\"origin_snapshot\",
\"disk\".\"origin_image\",
\"disk\".\"pantry_address\")
SELECT * FROM
(SELECT * FROM resource_count) LEFT JOIN
(SELECT * FROM collection_by_id) ON TRUE LEFT JOIN
(SELECT * FROM resource_by_id) ON TRUE LEFT JOIN
(SELECT * FROM updated_resource) ON TRUE;",
binds: [
// $1: instance ID
66894238-988f-4ab1-aa85-4cb44a0c94f7,
// $2: disk ID
aa1d8ae3-dc72-4c78-a94d-5c013d1034f0,
// $3: instance ID
66894238-988f-4ab1-aa85-4cb44a0c94f7,
// $4: valid instance states for attach
[
InstanceState(
Creating,
),
InstanceState(
Stopped,
),
],
// $5: instance ID
66894238-988f-4ab1-aa85-4cb44a0c94f7,
// $6: valid disk states for attach
[
"creating",
"detached",
],
// $7: disk ID
aa1d8ae3-dc72-4c78-a94d-5c013d1034f0,
// $8: attach instance ID to set
66894238-988f-4ab1-aa85-4cb44a0c94f7,
// $9: disk state to set
"attached",
// $10: shift base
0,
// $11: shift max
8,
// $12: shift min
0,
// $13: expected attach instance ID
66894238-988f-4ab1-aa85-4cb44a0c94f7,
// $14: shift base
0,
// $15: disk ID
aa1d8ae3-dc72-4c78-a94d-5c013d1034f0,
],
I couldn't replicate this issue consistently but this is the instance which I ran into such an issue.
Order of events:
I noticed the boot disk changed from
nvme0n1
tonvme2n1
. Assuming thatnvme3n1
was one of the newly attached disks, I tried to format it but then found that it was actually disk1 (i.e. the device previously namednvme1n1
)starting
state forever, presumably because it used the incorrect disk to boot?