Open bnaecker opened 1 year ago
I can dig into this in a bit, but as a very short-term workaround: If you delete /var/oxide
, you'll remove the sled agent's notion of:
Yeah, we're going to reboot to get a fresh state. But we can definitely delete those files if we hit this again. Thanks @smklein
@leftwo and I are getting Omicron running on the dogfood rack. We're currently on
BRM42220070
. I made a quick update to Omicron using this patch. We can't runomicron-package uninstall
, since that currently destroys thecxgbe{0,1}
links and IP addresses that we currently need to log in to the machine. We instead rebuilt, and then ranOMICRON_NO_UNINSTALL=1 omicron-package install
. That unpacks the tarballs into the correct place, and then callssvccfg import
with the sled-agent manifest. We see this in the logfile:That's because
svccfg import
only updates any changed SMF properties and does not actually do anything likesvcadm restart
. So Alan and I ransvcadm restart
manually. At this point, the sled agent started up. One of the first things it currently does is destroy any existing Oxide zones, VNICs and IP addresses. (Changing this to be more idempotent is tracked in #724.) It then recreates those objects. We then see this in the logs:You can see the sled agent load a service manifest from
/var/oxide/service.toml
. It then goes on to start Nexus, Oximeter, and the internal DNS service in zones. Towards the end, we then see some messages about loading an RSS plan from disk, and handing off to Nexus. We hit an unwrap atsled-agent/src/rack_setup/service.rs:451
, where we're apparently looking up Nexus's IP by service name, and panicking if that fails.I'm not sure what should happen here, or if Alan and I have put us in some unexpected situation by unpacking / uninstalling / reinstalling manually.