oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
220 stars 33 forks source link

Sled agent hung on disabling `fmd` service. #4252

Open rcgoodfellow opened 9 months ago

rcgoodfellow commented 9 months ago

Spinning up an omicron development environment on my workstation hung at the following place. Sled agent was running, but not making any progress. It had brought up the switch zone, but had not brought up any other zones.

https://github.com/oxidecomputer/omicron/blob/194889b956abbb3e01ce25b11b733c02598c3215/sled-agent/src/backing_fs.rs#L157-L161

We were able to see this on the host OS as follows.

ry@masaka:~/src/omicron$ pgrep -lf sled-agent
15742 /opt/oxide/sled-agent/sled-agent run /opt/oxide/sled-agent/pkg/config.toml
15741 ctrun -l child -o noorphan,regent /opt/oxide/sled-agent/sled-agent run /opt/oxi
ry@masaka:~/src/omicron$ pfexec ptree 15742
15741  ctrun -l child -o noorphan,regent /opt/oxide/sled-agent/sled-agent run /opt/oxi
  15742  /opt/oxide/sled-agent/sled-agent run /opt/oxide/sled-agent/pkg/config.toml
    15903  /usr/sbin/svcadm disable -s -t svc:/system/fmd:default

Looking at the fmd service, we found it in maintenance mode.

ry@masaka:~/src/omicron$ svcs | grep fmd
maintenance    17:05:33 svc:/system/fmd:default

Clearing the service with svcadm clear fmd resulted in sled agent springing to life. At this point, it had been stuck for over an hour as I went to work on other things. The control plane came up completely and appears to be functioning normally after getting unstuck.

I looked at the logs for fmd with cat $(svcs -L fmd) the contained this

[ Sep 18 22:28:05 Method "start" exited with status 0. ]
[ Oct 10 15:59:10 Stopping because service disabled. ]
[ Oct 10 15:59:10 Executing stop method (:kill). ]
[ Oct 10 15:59:11 Enabled. ]
[ Oct 10 15:59:11 Executing start method ("/usr/lib/fm/fmd/fmd"). ]
[ Oct 10 16:00:33 Method "start" exited with status 0. ]
[ Oct 10 17:04:33 Stopping because service disabled. ]
[ Oct 10 17:04:33 Executing stop method (:kill). ]
[ Oct 10 17:05:33 Method or service exit timed out.  Killing contract 61062. ]
rcgoodfellow commented 9 months ago

Prior to this, one of my destroy_virtual_hardware.sh commands had failed in a way that is probably related to this.

Verified ZFS pool and vdev oxp_f4b4dc87-ab46-49fb-a4b4-d361ae214c03 does not exist
+ for ZPOOL_TYPE in "${ZPOOL_TYPES[@]}"
+ readarray -t ZPOOLS
++ zfs list -d 0 -o name
++ grep '^oxi_'
+ for ZPOOL in "${ZPOOLS[@]}"
+ VDEV_FILE=/home/ry/src/omicron/oxi_a462a7f7-b628-40fe-80ff-4e4189e2d62b.vdev
+ zfs destroy -r oxi_a462a7f7-b628-40fe-80ff-4e4189e2d62b
cannot unmount '/var/fm/fmd': Device busy
+ warn 'Failed to remove ZFS pool and vdev: oxi_a462a7f7-b628-40fe-80ff-4e4189e2d62b'
+ set +x
Failed to remove ZFS pool and vdev: oxi_a462a7f7-b628-40fe-80ff-4e4189e2d62b

I "fixed" this with

pfexec zfs destroy -fr oxi_a462a7f7-b628-40fe-80ff-4e4189e2d62b

and then reran the destroy virtual hardware script which ran to completion without error.

andrewjstone commented 9 months ago

cc @citrus-it who introduced this code in https://github.com/oxidecomputer/omicron/pull/4201

Another thing to note is that @rcgoodfellow had seen this issue several times today, except that each time it resolved itself after about 15 minutes, which wasn't the case here. Ry pointed out that the log messages being spewed from sled-agent were actually 20 minutes in the future based on running date -u. Once Ry resolved the issue by kicking fmd, the sled-agent progressed and the log timestamps were back to normal. I mention this only for completeness and because it was quite odd.

One crazy hypothesis I had was that the call to usr/sbin/svcadm disable -s -t svc:/system/fmd:default was hung because fmd disable was waiting for some sort of time related catchup that occurred in the prior runs but not the latest run that we debugged. This is just a WAG, and I have no reason to believe it to be the case, other than that the timestamps were totally out of wack in sled-agent before we kicked FMD. However the timestamp mixup could have been a result of the hang or completely unrelated.

rcgoodfellow commented 9 months ago

Noting that the

cannot unmount '/var/fm/fmd': Device busy

from the post above seems to be a recurring issue now when destroying virtual hardware.

Running svcadm disable fmd before running the destroy virtual hardware script allows it to run successfully.

rcgoodfellow commented 9 months ago

Another thing to note is that @rcgoodfellow had seen this issue several times today, except that each time it resolved itself after about 15 minutes,

I think this delay might actually be zpool import.

I can see these as sled agent is getting going

ry@masaka:~$ pfexec ptree 6263
6262   ctrun -l child -o noorphan,regent /opt/oxide/sled-agent/sled-agent run /opt/
  6263   /opt/oxide/sled-agent/sled-agent run /opt/oxide/sled-agent/pkg/config.toml
    8562   /usr/sbin/zpool import -f oxp_e4b4dc87-ab46-49fb-a4b4-d361ae214c03

and it goes through them one at a time, each taking a few minutes it seems

jclulow commented 9 months ago

FWIW, I don't think we should be dorking with fmd on a workstation. It makes sense on a Gimlet where the system boots from a ramdisk, but on a workstation we're almost certainly going to ruin things (like you see here) and also we're going to discard some of your actual FMA data each time you destroy and recreate the (virtual) pools with those files in them.

andrewjstone commented 8 months ago

I'm now also running into this, but @rcgoodfellow's workaround isn't doing the trick. FMD is not in maintenance mode and I'm getting:

cannot destroy 'oxi_a462a7f7-b628-40fe-80ff-4e4189e2d62b/swap': dataset is busy
andrewjstone commented 8 months ago

I'm now also running into this, but @rcgoodfellow's workaround isn't doing the trick. FMD is not in maintenance mode and I'm getting:

cannot destroy 'oxi_a462a7f7-b628-40fe-80ff-4e4189e2d62b/swap': dataset is busy

I ended up resolving this by deleting the swap

pfexec swap -d /dev/zvol/dsk/oxi_a462a7f7-b628-40fe-80ff-4e4189e2d62b/swap
karencfv commented 8 months ago

This may be related https://github.com/oxidecomputer/omicron/issues/3651