oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
239 stars 34 forks source link

sled-agent cold boot: "dataset does not exist" #3791

Open jgallagher opened 1 year ago

jgallagher commented 1 year ago

While cold-booting madrid, the sled-agent on the scrimlet bounced twice during startup. The first bounce is covered by #3789 and #3790. The error message on the second bounce was:

sled-agent: Error managing sled agent: Could not start sled agent server: Failed to initialize zone: oxz_internal_dns_68827de9-15c7-494d-b729-97ef376afe21 errored with Failed to install zone 'oxz_internal_dns_68827de9-15c7-494d-b729-97ef376afe21' from '/pool/int/ab65d7ba-d52f-45a9-a428-83dcc480c0b6/install/internal_dns.tar.gz': Failed to execute zoneadm command 'Install' for zone 'oxz_internal_dns_68827de9-15c7-494d-b729-97ef376afe21': Failed to parse command output: exit code 1
stdout:

stderr:
could not verify zfs dataset oxp_b9b2c5ee-136f-4f33-a97e-a61fc742575a/internal_dns: dataset does not exist
zoneadm: zone oxz_internal_dns_68827de9-15c7-494d-b729-97ef376afe21 failed to verify
[ Dec 28 00:05:57 Stopping because all processes in service exited. ]

Searching back for that dataset UUID, it looks like sled-agent found it:

00:05:32.630Z INFO SledAgent (BootstrapAgent): Automatically destroying dataset: oxp_b9b2c5ee-136f-4f33-a97e-a61fc742575a/crypt/zone
...
00:05:35.855Z INFO SledAgent (BootstrapAgent): Storage manager processing zpool: ZpoolInfo {
        name: "oxp_b9b2c5ee-136f-4f33-a97e-a61fc742575a",
        size: 3195455668224,
        allocated: 1948992512,
        free: 3193506675712,
        health: Online,
    }

so it isn't clearly why zoneadm would have failed. After smf restarted sled-agent, it came up successfully the third time, so it seems likely that adding retries around service startup (#3790) will address this too.

The full log from this sled-agent is at /net/catacomb/data/staff/core/madrid/omicron-3789.

jclulow commented 1 year ago

I'm not sure I would move straight to a retry, if we're not sure why this is happening. It seems like there could be a race of some kind, or we're not correctly waiting for a pool import or dataset mount to complete? I suspect that could be bad if sled agent is removing old things and that's racing with zoneadm creating new ones?