oxidecomputer / helios-engvm

Tools for creating and using Helios images on i86pc (classic PC) physical and virtual machines
Mozilla Public License 2.0
22 stars 1 forks source link

image script can hang; temporary zpool sadness ensues #19

Open jordanhendricks opened 1 year ago

jordanhendricks commented 1 year ago

I've seen this on two machines. When running image.sh, the script will sometimes hang here:

Jul 14 16:47:18.216 INFO STEP 1: create_dataset, from: /home/jordan/src/helios-engvm/image/templates/helios/generic-ttya-base.json
Jul 14 16:47:18.216 INFO CREATE DATASET: TEMPORARY-helios-generic-ttya-base/home
Jul 14 16:47:18.225 INFO SET DATASET PROPERTY ON TEMPORARY-helios-generic-ttya-base/home: mountpoint = /home
Jul 14 16:47:18.268 INFO STEP 1 (create_dataset) COMPLETE

Jul 14 16:47:18.268 INFO STEP 2: unpack_tar, from: /home/jordan/src/helios-engvm/image/templates/helios/generic-ttya-base.json
Jul 14 16:47:18.282 INFO exec: ["/usr/sbin/tar", "xzeEp@/f", "/rpool/images/output/helios-dev-base.tar.gz", "-C", "/rpool/images/work/helios/generic-ttya-base/a"]

The temporary zpool is in a sad state:

  pool: TEMPORARY-helios-generic-ttya-base
 state: UNAVAIL                                                                                                                   
status: One or more devices are faulted in response to IO failures.                 
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://illumos.org/msg/ZFS-8000-HC
  scan: none requested                                           
config:                                                          

        NAME                                STATE     READ WRITE CKSUM                 
        TEMPORARY-helios-generic-ttya-base  UNAVAIL      0     0     0  insufficient replicas
          c6t1d0                            UNAVAIL      0     0     0  cannot open

errors: 250 data errors, use '-v' for a list                                                                                                                  

I was able to work around this as follows. First with a zpool export -f <temp pool name> (which will hang) and a zpool clear of the pool until the export finishes, then deleting the device in lofiadm.

$ pfexec zpool export -f TEMPORARY-helios-generic-ttya-base

# separate terminal
$ pfexec zpool clear TEMPORARY-helios-generic-ttya-base
jordan@dunkin ~/propolis $ lofiadm
Block Device             File                           Options
$ pfexec lofiadm -d /rpool/images/work/helios/generic-ttya-base/lofi.raw

After that, I was able to run the script again and successfully create an image.

rcgoodfellow commented 12 months ago

I believe I've come across a related issue. I was trying to build a helios-2 iso image as follows.

VARIANT=base ./strap.sh -f
VARIANT=ramdisk ./strap.sh -f
./ufs.sh
SERIAL=vga ./iso.sh

This resulted in the following log output.

Jul 20 17:31:23.216 INFO temporary dataset: rpool/images/tmp/helios/eltorito-efi
Jul 20 17:31:23.225 INFO DESTROY DATASET: rpool/images/tmp/helios/eltorito-efi
Jul 20 17:31:23.288 INFO CREATE DATASET: rpool/images/tmp/helios/eltorito-efi
Jul 20 17:31:23.342 INFO SET DATASET PROPERTY ON rpool/images/tmp/helios/eltorito-efi: sync = disabled
Jul 20 17:31:23.398 INFO temporary directory: /rpool/images/tmp/helios/eltorito-efi
Jul 20 17:31:23.398 INFO work dataset: rpool/images/work/helios/eltorito-efi
Jul 20 17:31:23.411 INFO SET DATASET PROPERTY ON rpool/images/work/helios/eltorito-efi: sync = disabled
Jul 20 17:31:23.470 INFO nothing mounted at "/rpool/images/work/helios/eltorito-efi/a"
Jul 20 17:31:23.483 INFO image file: /rpool/images/work/helios/eltorito-efi/lofi.pcfs
Jul 20 17:31:23.483 INFO lofi exists LofiDevice { id: 1, readonly: false, labeled: false, encrypted: false, filename: "/rpool/images/work/helios/eltorito-efi/lofi.pcfs", devpath: Some("/dev/lofi/1"), rdevpath: Some("/dev/rlofi/1") } -- removing...
Jul 20 17:31:23.483 CRIT fatal error: could not stat path: "/dev/lofi/1": No such file or directory (os error 2)

I also noticed at this point that lofiadm would hang for around 2 minutes before returning a result. This result did show what appear to be stale entries hanging around.

ry@masaka:~/src/helios-engvm/image$ pfexec lofiadm
Block Device             File                           Options
/dev/lofi/1              /rpool/images/work/helios/eltorito-efi/lofi.pcfs   -
/dev/dsk/c4t2d0p0        /rpool/images/work/helios/generic-ttya-full/lofi.raw   Labeled

At this point, there was not actually a file /dev/lofi/1. I could delete the entries with lofiadm -d <device> and that would allow things to get a bit further, but then image-builder would hang on another lofi call and ultimately crash.

This led to getting in a state where lofiadm was returningunknown` entries.

ry@masaka:~$ pfexec lofiadm
Block Device             File                           Options
unknown                  /rpool/images/work/helios/generic-vga-iso/output.iso   Labeled
/dev/lofi/2              /rpool/images/work/helios/eltorito-efi/lofi.pcfs   -

@jclulow had me look at the following.

fmdump -IVp -c 'resource.sysevent.EC_dev_*' -n 'driver_name=lofi'

which showed some storage behavior

Dec 12 2022 16:22:32.702689435 (absent)
nvlist version: 0
        phys_path = /pseudo/lofi@1
        dev_name = /dev/rdsk/c4t1d0
        driver_name = lofi
        instance = 1
        class = resource.sysevent.EC_dev_remove.disk
        version = 0x0
        __ttl = 0x1
        __tod = 0x6397c5c8 0x29e2309b

Dec 12 2022 16:22:32.703001387 (absent)
nvlist version: 0
        phys_path = /pseudo/lofi@1
        dev_name = /dev/rdsk/c4t1d0
        driver_name = lofi
        instance = 1
        class = resource.sysevent.EC_dev_remove.disk
        version = 0x0
        __ttl = 0x1
        __tod = 0x6397c5c8 0x29e6f32b

Dec 12 2022 16:22:32.703276047 (absent)
nvlist version: 0
        phys_path = /pseudo/lofi@1
        dev_name = /dev/rdsk/c4t1d0
        driver_name = lofi
        instance = 1
        class = resource.sysevent.EC_dev_remove.disk
        version = 0x0
        __ttl = 0x1
        __tod = 0x6397c5c8 0x29eb240f

Dec 12 2022 16:22:32.703534911 (absent)
nvlist version: 0
        phys_path = /pseudo/lofi@1
        dev_name = /dev/rdsk/c4t1d0
        driver_name = lofi
        instance = 1
        class = resource.sysevent.EC_dev_remove.disk
        version = 0x0
        __ttl = 0x1
        __tod = 0x6397c5c8 0x29ef173f

Dec 12 2022 16:22:32.703785103 (absent)
nvlist version: 0
        phys_path = /pseudo/lofi@1
        dev_name = /dev/rdsk/c4t1d0
        driver_name = lofi
        instance = 1
        class = resource.sysevent.EC_dev_remove.disk
        version = 0x0
        __ttl = 0x1
        __tod = 0x6397c5c8 0x29f2e88f

Dec 12 2022 16:22:32.704027682 (absent)
nvlist version: 0
        phys_path = /pseudo/lofi@1
        dev_name = /dev/rdsk/c4t1d0
        driver_name = lofi
        instance = 1
        class = resource.sysevent.EC_dev_remove.disk
        version = 0x0
        __ttl = 0x1
        __tod = 0x6397c5c8 0x29f69c22

Dec 12 2022 16:22:32.704252953 (absent)
nvlist version: 0
        phys_path = /pseudo/lofi@1
        dev_name = /dev/rdsk/c4t1d0
        driver_name = lofi
        instance = 1
        class = resource.sysevent.EC_dev_remove.disk
        version = 0x0
        __ttl = 0x1
        __tod = 0x6397c5c8 0x29fa0c19

Dec 12 2022 16:22:32.704467186 (absent)
nvlist version: 0
        phys_path = /pseudo/lofi@1
        dev_name = /dev/rdsk/c4t1d0
        driver_name = lofi
        instance = 1
        class = resource.sysevent.EC_dev_remove.disk
        version = 0x0
        __ttl = 0x1
        __tod = 0x6397c5c8 0x29fd50f2

Dec 12 2022 16:22:32.704667303 (absent)
nvlist version: 0
        phys_path = /pseudo/lofi@1
        dev_name = /dev/rdsk/c4t1d0
        driver_name = lofi
        instance = 1
        class = resource.sysevent.EC_dev_remove.disk
        version = 0x0
        __ttl = 0x1
        __tod = 0x6397c5c8 0x2a005ea7

Dec 12 2022 16:22:32.704857210 (absent)
nvlist version: 0
        phys_path = /pseudo/lofi@1
        dev_name = /dev/rdsk/c4t1d0
        driver_name = lofi
        instance = 1
        class = resource.sysevent.EC_dev_remove.disk
        version = 0x0
        __ttl = 0x1
        __tod = 0x6397c5c8 0x2a03447a

Dec 12 2022 16:22:32.705041754 (absent)
nvlist version: 0
        phys_path = /pseudo/lofi@1
        dev_name = /dev/rdsk/c4t1d0
        driver_name = lofi
        instance = 1
        class = resource.sysevent.EC_dev_remove.disk
        version = 0x0
        __ttl = 0x1
        __tod = 0x6397c5c8 0x2a06155a

Dec 12 2022 16:22:32.705209568 (absent)
nvlist version: 0
        phys_path = /pseudo/lofi@1
        dev_name = /dev/rdsk/c4t1d0
        driver_name = lofi
        instance = 1
        class = resource.sysevent.EC_dev_remove.disk
        version = 0x0
        __ttl = 0x1
        __tod = 0x6397c5c8 0x2a08a4e0

Dec 12 2022 16:22:32.705371408 (absent)
nvlist version: 0
        phys_path = /pseudo/lofi@1
        dev_name = /dev/rdsk/c4t1d0
        driver_name = lofi
        instance = 1
        class = resource.sysevent.EC_dev_remove.disk
        version = 0x0
        __ttl = 0x1
        __tod = 0x6397c5c8 0x2a0b1d10

Dec 12 2022 16:22:32.705521513 (absent)
nvlist version: 0
        phys_path = /pseudo/lofi@1
        dev_name = /dev/rdsk/c4t1d0
        driver_name = lofi
        instance = 1
        class = resource.sysevent.EC_dev_remove.disk
        version = 0x0
        __ttl = 0x1
        __tod = 0x6397c5c8 0x2a0d6769

Dec 12 2022 16:22:32.705652682 (absent)
nvlist version: 0
        phys_path = /pseudo/lofi@1
        dev_name = /dev/rdsk/c4t1d0
        driver_name = lofi
        instance = 1
        class = resource.sysevent.EC_dev_remove.disk
        version = 0x0
        __ttl = 0x1
        __tod = 0x6397c5c8 0x2a0f67ca

Dec 12 2022 16:22:32.705775749 (absent)
nvlist version: 0
        phys_path = /pseudo/lofi@1
        dev_name = /dev/rdsk/c4t1d0
        driver_name = lofi
        instance = 1
        class = resource.sysevent.EC_dev_remove.disk
        version = 0x0
        __ttl = 0x1
        __tod = 0x6397c5c8 0x2a114885

Dec 12 2022 16:22:32.705888134 (absent)
nvlist version: 0
        phys_path = /pseudo/lofi@1
        dev_name = /dev/rdsk/c4t1d0
        driver_name = lofi
        instance = 1
        class = resource.sysevent.EC_dev_remove.disk
        version = 0x0
        __ttl = 0x1
        __tod = 0x6397c5c8 0x2a12ff86

Jul 20 2023 15:49:37.005335312 (absent)
nvlist version: 0
        phys_path = /pseudo/lofi@1
        dev_name = /dev/rlofi/1
        driver_name = lofi
        instance = 1
        prop-ddi-kernel-ioctl = 1
        prop-zone = global
        prop-instance = 1
        class = resource.sysevent.EC_dev_add.lofi
        version = 0x0
        __ttl = 0x1
        __tod = 0x64b9ba01 0x516910

Jul 20 2023 16:18:58.388316169 (absent)
nvlist version: 0
        phys_path = /pseudo/lofi@1
        dev_name = /dev/rlofi/1
        driver_name = lofi
        instance = 1
        class = resource.sysevent.EC_dev_remove.lofi
        version = 0x0
        __ttl = 0x1
        __tod = 0x64b9c0e2 0x17253c09

Jul 20 2023 16:19:00.005279457 (absent)
nvlist version: 0
        phys_path = /pseudo/lofi@1
        dev_name = /dev/rlofi/1
        driver_name = lofi
        instance = 1
        prop-ddi-kernel-ioctl = 1
        prop-zone = global
        prop-instance = 1
        class = resource.sysevent.EC_dev_add.lofi
        version = 0x0
        __ttl = 0x1
        __tod = 0x64b9c0e4 0x508ee1

Jul 20 2023 16:19:28.004701940 (absent)
nvlist version: 0
        phys_path = /pseudo/lofi@2
        dev_name = /dev/rdsk/c4t2d0
        driver_name = lofi
        instance = 2
        prop-ddi-kernel-ioctl = 1
        prop-zone = global
        prop-labeled = 1
        prop-instance = 2
        prop-device-nblocks = 9830400
        prop-device-blksize = 512
        prop-device-solid-state = 0
        prop-device-rotational = 0
        class = resource.sysevent.EC_dev_add.disk
        version = 0x0
        __ttl = 0x1
        __tod = 0x64b9c100 0x47bef4

Jul 20 2023 16:19:28.325764145 (absent)
nvlist version: 0
        phys_path = /pseudo/lofi@2
        dev_name = /dev/rdsk/c4t2d0
        driver_name = lofi
        instance = 2
        class = resource.sysevent.EC_dev_remove.disk
        version = 0x0
        __ttl = 0x1
        __tod = 0x64b9c100 0x136ac431

Jul 20 2023 16:19:28.326938797 (absent)
nvlist version: 0
        phys_path = /pseudo/lofi@2
        dev_name = /dev/rdsk/c4t2d0
        driver_name = lofi
        instance = 2
        class = resource.sysevent.EC_dev_remove.disk
        version = 0x0
        __ttl = 0x1
        __tod = 0x64b9c100 0x137cb0ad

Jul 20 2023 16:19:30.008251078 (absent)
nvlist version: 0
        phys_path = /pseudo/lofi@2
        dev_name = /dev/rdsk/c4t2d0
        driver_name = lofi
        instance = 2
        prop-ddi-kernel-ioctl = 1
        prop-zone = global
        prop-labeled = 1
        prop-instance = 2
        prop-device-nblocks = 9830400
        prop-device-blksize = 512
        prop-device-solid-state = 0
        prop-device-rotational = 0
        class = resource.sysevent.EC_dev_add.disk
        version = 0x0
        __ttl = 0x1
        __tod = 0x64b9c102 0x7de6c6

Jul 20 2023 16:19:30.008296120 (absent)
nvlist version: 0
        phys_path = /pseudo/lofi@2
        dev_name = /dev/rdsk/c4t2d0
        driver_name = lofi
        instance = 2
        prop-ddi-kernel-ioctl = 1
        prop-zone = global
        prop-labeled = 1
        prop-instance = 2
        prop-device-nblocks = 9830400
        prop-device-blksize = 512
        prop-device-solid-state = 0
        prop-device-rotational = 0
        class = resource.sysevent.EC_dev_add.disk
        version = 0x0
        __ttl = 0x1
        __tod = 0x64b9c102 0x7e96b8

Jul 20 2023 16:19:30.008328969 (absent)
nvlist version: 0
        phys_path = /pseudo/lofi@2
        dev_name = /dev/rdsk/c4t2d0
        driver_name = lofi
        instance = 2
        prop-ddi-kernel-ioctl = 1
        prop-zone = global
        prop-labeled = 1
        prop-instance = 2
        prop-device-nblocks = 9830400
        prop-device-blksize = 512
        prop-device-solid-state = 0
        prop-device-rotational = 0
        class = resource.sysevent.EC_dev_add.disk
        version = 0x0
        __ttl = 0x1
        __tod = 0x64b9c102 0x7f1709

Jul 20 2023 16:19:30.008349781 (absent)
nvlist version: 0
        phys_path = /pseudo/lofi@2
        dev_name = /dev/rdsk/c4t2d0
        driver_name = lofi
        instance = 2
        prop-ddi-kernel-ioctl = 1
        prop-zone = global
        prop-labeled = 1
        prop-instance = 2
        prop-device-nblocks = 9830400
        prop-device-blksize = 512
        prop-device-solid-state = 0
        prop-device-rotational = 0
        class = resource.sysevent.EC_dev_add.disk
        version = 0x0
        __ttl = 0x1
        __tod = 0x64b9c102 0x7f6855

Jul 20 2023 16:51:06.287388297 (absent)
nvlist version: 0
        phys_path = /pseudo/lofi@1
        dev_name = /dev/rlofi/1
        driver_name = lofi
        instance = 1
        class = resource.sysevent.EC_dev_remove.lofi
        version = 0x0
        __ttl = 0x1
        __tod = 0x64b9c86a 0x11213289

Jul 20 2023 16:51:07.146319384 (absent)
nvlist version: 0
        phys_path = /pseudo/lofi@1
        dev_name = /dev/rlofi/1
        driver_name = lofi
        instance = 1
        class = resource.sysevent.EC_dev_remove.lofi
        version = 0x0
        __ttl = 0x1
        __tod = 0x64b9c86b 0x8b8a818

Jul 20 2023 16:51:09.005261588 (absent)
nvlist version: 0
        phys_path = /pseudo/lofi@1
        dev_name = /dev/rlofi/1
        driver_name = lofi
        instance = 1
        prop-ddi-kernel-ioctl = 1
        prop-zone = global
        prop-instance = 1
        class = resource.sysevent.EC_dev_add.lofi
        version = 0x0
        __ttl = 0x1
        __tod = 0x64b9c86d 0x504914

Jul 20 2023 16:51:55.934344624 (absent)
nvlist version: 0
        phys_path = /pseudo/lofi@1
        dev_name = /dev/rlofi/1
        driver_name = lofi
        instance = 1
        class = resource.sysevent.EC_dev_remove.lofi
        version = 0x0
        __ttl = 0x1
        __tod = 0x64b9c89b 0x37b0f7b0

Jul 20 2023 16:51:57.005890715 (absent)
nvlist version: 0
        phys_path = /pseudo/lofi@1
        dev_name = /dev/rlofi/1
        driver_name = lofi
        instance = 1
        prop-ddi-kernel-ioctl = 1
        prop-zone = global
        prop-instance = 1
        class = resource.sysevent.EC_dev_add.lofi
        version = 0x0
        __ttl = 0x1
        __tod = 0x64b9c89d 0x59e29b

On a call to lofiadm that was hanging, we would see this in mdb

> 0t16130::pid2proc | ::walk thread | ::findstack -v
stack pointer for thread fffffe2789251020 (lofiadm/1): fffffcc28a1113c0
[ fffffcc28a1113c0 _resume_from_idle+0x12b() ]
  fffffcc28a1113f0 swtch+0x139()
  fffffcc28a111470 cv_timedwait_hires+0xd7(fffffffffbcc15d8, fffffffffbcc15d0, 1bf08eb000, f4240, 0)
  fffffcc28a1114c0 cv_timedwait+0x52(fffffffffbcc15d8, fffffffffbcc15d0, f63d817)
  fffffcc28a111620 lofi_copy_devpath+0xfb(fffffe27b0f5b500)
  fffffcc28a1116b0 lofi_get_info+0x218(7a00000000, 804155c, 4c4605, fffffe29d6682cb0, 100401)
  fffffcc28a111c80 lofi_ioctl+0x7de(7a00000000, 4c4605, 804155c, 100401, fffffe29d6682cb0, fffffcc28a111dd8)
  fffffcc28a111cc0 cdev_ioctl+0x3f(7a00000000, 4c4605, 804155c, 100401, fffffe29d6682cb0, fffffcc28a111dd8)
  fffffcc28a111d10 spec_ioctl+0x55(fffffe2634e2ce40, 4c4605, 804155c, 100401, fffffe29d6682cb0, fffffcc28a111dd8, 0)
  fffffcc28a111da0 fop_ioctl+0x40(fffffe2634e2ce40, 4c4605, 804155c, 100401, fffffe29d6682cb0, fffffcc28a111dd8, 0)
  fffffcc28a111ec0 ioctl+0x144(3, 4c4605, 804155c)
  fffffcc28a111f10 sys_syscall32+0x105()

After building and installing an ONU with the following commit reverted, the image build process ran to completion without issue.

4b2d0200492d3957fe8ea5057f7fac75b4aa21d0
5688 blkdev and lofi need to properly init cmlb minors
jordanhendricks commented 12 months ago

In case it's relevant at all, I saw this issue on heliosv1 (as dunkin had not yet been updated yet).

jclulow commented 12 months ago

I have made a small adjustment to illumos/image-builder to more closely align what it does with what the lofiadm command does: https://github.com/illumos/image-builder/commit/3ff1e02ac8cb8de0d0fc07ed51f78e914916f39e

If you gmake setup in the image/ subdirectory it should pull this update in and rebuild.

If you do those things, and pain persists, let me know! I haven't seen the issue again, so far, since doing that...

jclulow commented 9 months ago

This may well still be happening. I believe @jordanhendricks saw it again on her machine yesterday.

jordanhendricks commented 9 months ago

Indeed I did. My machine has version helios-2.0.22094, and helios-engvm is at dfa3f8b6.

labbott commented 4 months ago

I saw this on helios-engvm at cfa1eeba940d85efdaf881a53e56288c5df42b4b