oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
252 stars 40 forks source link

Tracking: Instance Lifecycle Overhaul #3742

Open smklein opened 1 year ago

smklein commented 1 year ago
gjcolombo commented 1 year ago

2315 also tracks the "instances without sleds" work. It probably depends on #2824, since starting an instance with no resource reservation is a multi-step process.

Nexus can use an RPW to look for instances that are marked as "failed + auto_boot_on_fault", and re-provision them in the background.

If we use the existing Failed state for this, we'll need to make sure that

We might decide to have different failure reasons to help us distinguish these cases.

smklein commented 1 year ago

See also: https://github.com/oxidecomputer/omicron/issues/2825

hawkw commented 2 months ago

Most of the stuff described in "Updating Instance State Within Nexus" was implemented in a combination of #5611, #5759, and #6503. The proactive registration of sled-agents with Nexus isn't something we've done yet.