Open rcgoodfellow opened 9 months ago
Prior to this, one of my destroy_virtual_hardware.sh
commands had failed in a way that is probably related to this.
Verified ZFS pool and vdev oxp_f4b4dc87-ab46-49fb-a4b4-d361ae214c03 does not exist
+ for ZPOOL_TYPE in "${ZPOOL_TYPES[@]}"
+ readarray -t ZPOOLS
++ zfs list -d 0 -o name
++ grep '^oxi_'
+ for ZPOOL in "${ZPOOLS[@]}"
+ VDEV_FILE=/home/ry/src/omicron/oxi_a462a7f7-b628-40fe-80ff-4e4189e2d62b.vdev
+ zfs destroy -r oxi_a462a7f7-b628-40fe-80ff-4e4189e2d62b
cannot unmount '/var/fm/fmd': Device busy
+ warn 'Failed to remove ZFS pool and vdev: oxi_a462a7f7-b628-40fe-80ff-4e4189e2d62b'
+ set +x
Failed to remove ZFS pool and vdev: oxi_a462a7f7-b628-40fe-80ff-4e4189e2d62b
I "fixed" this with
pfexec zfs destroy -fr oxi_a462a7f7-b628-40fe-80ff-4e4189e2d62b
and then reran the destroy virtual hardware script which ran to completion without error.
cc @citrus-it who introduced this code in https://github.com/oxidecomputer/omicron/pull/4201
Another thing to note is that @rcgoodfellow had seen this issue several times today, except that each time it resolved itself after about 15 minutes, which wasn't the case here. Ry pointed out that the log messages being spewed from sled-agent were actually 20 minutes in the future based on running date -u
. Once Ry resolved the issue by kicking fmd, the sled-agent progressed and the log timestamps were back to normal. I mention this only for completeness and because it was quite odd.
One crazy hypothesis I had was that the call to usr/sbin/svcadm disable -s -t svc:/system/fmd:default
was hung because fmd disable was waiting for some sort of time related catchup that occurred in the prior runs but not the latest run that we debugged. This is just a WAG, and I have no reason to believe it to be the case, other than that the timestamps were totally out of wack in sled-agent before we kicked FMD. However the timestamp mixup could have been a result of the hang or completely unrelated.
Noting that the
cannot unmount '/var/fm/fmd': Device busy
from the post above seems to be a recurring issue now when destroying virtual hardware.
Running svcadm disable fmd
before running the destroy virtual hardware script allows it to run successfully.
Another thing to note is that @rcgoodfellow had seen this issue several times today, except that each time it resolved itself after about 15 minutes,
I think this delay might actually be zpool import.
I can see these as sled agent is getting going
ry@masaka:~$ pfexec ptree 6263
6262 ctrun -l child -o noorphan,regent /opt/oxide/sled-agent/sled-agent run /opt/
6263 /opt/oxide/sled-agent/sled-agent run /opt/oxide/sled-agent/pkg/config.toml
8562 /usr/sbin/zpool import -f oxp_e4b4dc87-ab46-49fb-a4b4-d361ae214c03
and it goes through them one at a time, each taking a few minutes it seems
FWIW, I don't think we should be dorking with fmd on a workstation. It makes sense on a Gimlet where the system boots from a ramdisk, but on a workstation we're almost certainly going to ruin things (like you see here) and also we're going to discard some of your actual FMA data each time you destroy and recreate the (virtual) pools with those files in them.
I'm now also running into this, but @rcgoodfellow's workaround isn't doing the trick. FMD is not in maintenance mode and I'm getting:
cannot destroy 'oxi_a462a7f7-b628-40fe-80ff-4e4189e2d62b/swap': dataset is busy
I'm now also running into this, but @rcgoodfellow's workaround isn't doing the trick. FMD is not in maintenance mode and I'm getting:
cannot destroy 'oxi_a462a7f7-b628-40fe-80ff-4e4189e2d62b/swap': dataset is busy
I ended up resolving this by deleting the swap
pfexec swap -d /dev/zvol/dsk/oxi_a462a7f7-b628-40fe-80ff-4e4189e2d62b/swap
This may be related https://github.com/oxidecomputer/omicron/issues/3651
Spinning up an omicron development environment on my workstation hung at the following place. Sled agent was running, but not making any progress. It had brought up the switch zone, but had not brought up any other zones.
https://github.com/oxidecomputer/omicron/blob/194889b956abbb3e01ce25b11b733c02598c3215/sled-agent/src/backing_fs.rs#L157-L161
We were able to see this on the host OS as follows.
Looking at the
fmd
service, we found it in maintenance mode.Clearing the service with
svcadm clear fmd
resulted in sled agent springing to life. At this point, it had been stuck for over an hour as I went to work on other things. The control plane came up completely and appears to be functioning normally after getting unstuck.I looked at the logs for
fmd
withcat $(svcs -L fmd)
the contained this