v0.9.2-rc3: zfs fails on reboot

RancherOS Version: v0.9.3-rc3

Where are you running RancherOS? baremetal

At reboot the zfs container is started, but immediately exits. Because of this, zfs is unavailable:

sudo ros service ps | grep zfs
zfs                           /usr/bin/ros entrypoint ./build.sh                       Exited (6) 5 minutes ago

The service is enabled:

sudo ros service list
disabled amazon-ecs-agent
disabled crontab
disabled kernel-extras
disabled kernel-headers
enabled  kernel-headers-system-docker
disabled open-vm-tools
enabled  zfs

A stop/start of the container got the same result (immediate exit). Here are the logs from the container (truncated to the last bit for brevity, if you want the entirety of the logs I can provide them):

zfs_1 | + system-docker build -t zfs-tools arch/
Sending build context to Docker daemon 32.32 MB557.1 kB
zfs_1 | Step 1 : FROM ubuntu:16.04
zfs_1 | 16.04: Pulling from library/ubuntu
zfs_1 | d54efb8db41d: Already exists
zfs_1 | f8b845f45a87: Already exists
zfs_1 | e8db7bf7c39f: Already exists
zfs_1 | 9654c40e9079: Already exists
zfs_1 | 6d9ef359eaaa: Already exists
zfs_1 | Digest: sha256:dd7808d8792c9841d0b460122f1acf0a2dd1f56404f8d1e56298048885e45535
zfs_1 | Status: Downloaded newer image for ubuntu:16.04
zfs_1 |  ---> 0ef2e08ed3fa
zfs_1 | Step 2 : ADD entry.sh /usr/local/bin/
zfs_1 |  ---> 13a37f343d26
zfs_1 | Removing intermediate container bef8558224fc
zfs_1 | Step 3 : ADD sbin/* /sbin/
zfs_1 |  ---> c95b26be9528
zfs_1 | Removing intermediate container 30bb98b97450
zfs_1 | Step 4 : ADD usr/local/sbin/* /usr/local/sbin/
zfs_1 |  ---> aba9ca6a6a75
zfs_1 | Removing intermediate container 276e0d9be21f
zfs_1 | Step 5 : ADD usr/local/bin/* /usr/local/bin/
zfs_1 |  ---> 9e8aba7e6d85
zfs_1 | Removing intermediate container 1e8a7034ddaf
zfs_1 | Step 6 : ADD usr/local/lib/* /usr/local/lib/
zfs_1 |  ---> 810a50805a5d
zfs_1 | Removing intermediate container 97745936117e
zfs_1 | Step 7 : ADD usr/local/libexec/* /usr/local/libexec/
zfs_1 |  ---> 6df0f231deb7
zfs_1 | Removing intermediate container 68b9c6e2ac23
zfs_1 | Step 8 : ADD setup_wonka.sh /setup_wonka.sh
zfs_1 |  ---> cc41ae6c79dc
zfs_1 | Removing intermediate container 65486406bc1a
zfs_1 | Step 9 : RUN ldconfig
zfs_1 |  ---> Running in a96cbcf04c99
zfs_1 | /sbin/ldconfig.real: /usr/local/lib/libzpool.so.2 is not a symbolic link
zfs_1 | 
zfs_1 | /sbin/ldconfig.real: /usr/local/lib/libzfs.so.2 is not a symbolic link
zfs_1 | 
zfs_1 | /sbin/ldconfig.real: /usr/local/lib/libzfs_core.so.1 is not a symbolic link
zfs_1 | 
zfs_1 | /sbin/ldconfig.real: /usr/local/lib/libuutil.so.1 is not a symbolic link
zfs_1 | 
zfs_1 | /sbin/ldconfig.real: /usr/local/lib/libnvpair.so.1 is not a symbolic link
zfs_1 | 
zfs_1 |  ---> 4c6151062a99
zfs_1 | Removing intermediate container a96cbcf04c99
zfs_1 | Step 10 : ENTRYPOINT /usr/local/bin/entry.sh
zfs_1 |  ---> Running in 173175bb3ca1
zfs_1 |  ---> 5c8279913534
zfs_1 | Removing intermediate container 173175bb3ca1
zfs_1 | Successfully built 5c8279913534
zfs_1 | + modprobe zfs
zfs_1 | + /dist/arch/setup_wonka.sh console
zfs_1 | installing wonka bin links in console
zfs_1 | + /dist/arch/setup_wonka.sh zfs
zfs_1 | installing wonka bin links in zfs
zfs_1 | + zpool import -a
zfs_1 | cannot import 'zones': pool may be in use from other system, it was last accessed by  (hostid: 0x4750a9a3) on Sun Sep 18 17:53:14 2016
zfs_1 | use '-f' to import anyway
zfs_1 | > time="2017-04-10T11:31:36Z" level=error msg="symlink /usr/bin/ros /usr/bin/cloud-init-execute: file exists" 
zfs_1 | > time="2017-04-10T11:31:36Z" level=error msg="symlink /usr/bin/ros /usr/bin/cloud-init-save: file exists" 
zfs_1 | > time="2017-04-10T11:31:36Z" level=error msg="symlink /usr/bin/ros /usr/bin/dockerlaunch: file exists" 
zfs_1 | > time="2017-04-10T11:31:36Z" level=error msg="symlink /usr/bin/ros /usr/bin/respawn: file exists" 
zfs_1 | > time="2017-04-10T11:31:36Z" level=error msg="symlink /usr/bin/ros /usr/bin/system-docker: file exists" 
zfs_1 | > time="2017-04-10T11:31:36Z" level=error msg="symlink /usr/bin/ros /usr/sbin/netconf: file exists" 
zfs_1 | > time="2017-04-10T11:31:36Z" level=error msg="symlink /usr/bin/ros /usr/sbin/wait-for-docker: file exists" 
zfs_1 | + uname -r
zfs_1 | + KERNEL_VERSION=4.9.20-rancher
zfs_1 | zfs for 4.9.20-rancher
zfs_1 | + echo zfs for 4.9.20-rancher
zfs_1 | + DIR=/lib/modules/4.9.20-rancher/build
zfs_1 | + STAMP=/lib/modules/4.9.20-rancher/.zfs-done
zfs_1 | + [ -e /lib/modules/4.9.20-rancher/.zfs-done ]
zfs_1 | + ros service enable kernel-headers-system-docker
zfs_1 | + ros service up kernel-headers-system-docker
zfs_1 | > time="2017-04-10T11:31:37Z" level=info msg="Project [os]: Starting project " 
zfs_1 | > time="2017-04-10T11:31:37Z" level=info msg="[0/18] [kernel-headers-system-docker]: Starting " 
zfs_1 | > time="2017-04-10T11:31:38Z" level=info msg="[1/18] [kernel-headers-system-docker]: Started " 
zfs_1 | > time="2017-04-10T11:31:38Z" level=info msg="Project [os]: Project started " 
zfs_1 | + VERSION=0.6.5.9
zfs_1 | + curl -sL https://github.com/zfsonlinux/zfs/releases/download/zfs-0.6.5.9/spl-0.6.5.9.tar.gz
zfs_1 | > time="2017-04-10T11:41:53Z" level=error msg="symlink /usr/bin/ros /usr/bin/cloud-init-execute: file exists" 
zfs_1 | > time="2017-04-10T11:41:53Z" level=error msg="symlink /usr/bin/ros /usr/bin/cloud-init-save: file exists" 
zfs_1 | > time="2017-04-10T11:41:53Z" level=error msg="symlink /usr/bin/ros /usr/bin/dockerlaunch: file exists" 
zfs_1 | > time="2017-04-10T11:41:53Z" level=error msg="symlink /usr/bin/ros /usr/bin/respawn: file exists" 
zfs_1 | > time="2017-04-10T11:41:53Z" level=error msg="symlink /usr/bin/ros /usr/bin/system-docker: file exists" 
zfs_1 | > time="2017-04-10T11:41:53Z" level=error msg="symlink /usr/bin/ros /usr/sbin/netconf: file exists" 
zfs_1 | > time="2017-04-10T11:41:53Z" level=error msg="symlink /usr/bin/ros /usr/sbin/wait-for-docker: file exists" 
zfs_1 | + uname -r
zfs_1 | + KERNEL_VERSION=4.9.20-rancher
zfs_1 | + echo zfs for 4.9.20-rancher
zfs_1 | + DIR=/lib/modules/4.9.20-rancher/build
zfs_1 | + STAMP=/lib/modules/4.9.20-rancher/.zfs-done
zfs_1 | + [ -e /lib/modules/4.9.20-rancher/.zfs-done ]
zfs_1 | zfs for 4.9.20-rancher
zfs_1 | + ros service enable kernel-headers-system-docker
zfs_1 | + ros service up kernel-headers-system-docker
zfs_1 | > time="2017-04-10T11:41:53Z" level=info msg="Project [os]: Starting project " 
zfs_1 | > time="2017-04-10T11:41:53Z" level=info msg="[0/18] [kernel-headers-system-docker]: Starting " 
zfs_1 | > time="2017-04-10T11:41:54Z" level=info msg="[1/18] [kernel-headers-system-docker]: Started " 
zfs_1 | > time="2017-04-10T11:41:54Z" level=info msg="Project [os]: Project started " 
zfs_1 | + VERSION=0.6.5.9
zfs_1 | + curl -sL https://github.com/zfsonlinux/zfs/releases/download/zfs-0.6.5.9/spl-0.6.5.9.tar.gz

the zfs service is supposed to exit - it only installs and builds things - after that the zfs tools should be available in your console

Our testing so far has zfs working - and the log you've sent doesn't look totally wrong (just truncated)

I wonder - did the initial enable&up complete the build and allow you to set up zfs? before rebooting?

what happens when you run sudo zfs list ?

mmm, I wonder - did you follow the procedure in http://docs.rancher.com/os/storage/using-zfs/ ?

if not, I suspect that if you only enable, then reboot, then it will take (some non-trivial amount of) time to build zfs, and once its finished, then zfs will be available.

I'm planning on making pre-built zfs images, but the way it works atm allows it to work with custom kernels too

Yes, I imported my pool after first enabling and bringing the service up. Then I rebooted.

All zfs commands (zfs, zpool) fail with command not found after letting the system sit for ~12 hours

Also, tried a zfs service up just now and an up with --no-build with no luck. Are there any logs I can look at?

I would start with sudo ros service logs zfs - which should not end at the curl line you have above.

what do you get when you run sudo system-docker images and sudo system-docker ps -qa ?

is the zfs kernel module loaded? available?

sudo ros service logs zfs ends at the curl line, I'm assuming that command is essentially just an alias for sudo system-docker logs zfs is that not correct?

sudo system-docker images | grep zfs
zfs-tools              latest              5c8279913534        36 hours ago        167 MB
rancher/os-zfs         v0.9.0              ab542cc62bac        3 weeks ago         484.9 MB

sudo system-docker ps -qa
6c923f44d730
bba7d47857c1
d8fc2cfe8895
0b24378acdcc
4b01f1c264ff
f13658f95549
bc7f1f033879
0c645bf2a52e
3ca3c4bbef92
641cdb8d0f61
5b86164bdea9
f1189cc81661
c0350b7fa181
f6cd58152f15
3a62d02e03f1
1fc929cd9663
9775a7f35c60
4bcbbaa6be83
040fa5cc45f9

sudo system-docker ps -a | grep zfs
0b24378acdcc        rancher/os-zfs:v0.9.0               "/usr/bin/ros entrypo"   36 hours ago        Exited (6) 25 minutes ago                       zfs

lsmod | grep zfs and `modprobe zfs' output is empty - assuming that is the correct way to check for zfs kernel modules in rancherOS. If that's not correct let me know how to do so properly and I'll get that output as well.

At the same time, dmesg shows the module is loaded, though maybe that was from when I first enabled the service?

dmesg | grep -i zfs
[   45.306193] DEBU[0012] Loaded zfs from https://raw.githubusercontent.com/rancher/os-services/v0.9.2-rc1/z/zfs.yml 
[   48.118509] INFO[0014] [15/18] [zfs]: Starting                      
[   49.536149] time="2017-04-10T11:31:37Z" level=debug msg="Loaded zfs from https://raw.githubusercontent.com/rancher/os-services/v0.9.2-rc1/z/zfs.yml" 
[   51.148311] INFO[0017] [17/18] [zfs]: Started                       
[   81.763471] time="2017-04-10T11:32:09Z" level=debug msg="Loaded zfs from https://raw.githubusercontent.com/rancher/os-services/v0.9.2-rc1/z/zfs.yml" 
[  547.714048] DEBU[0000] START: [ros service status zfs] in /home/rancher 
[  577.797528] DEBU[0000] START: [ros service ps zfs] in /home/rancher 
[  577.983543] DEBU[0000] Loaded zfs from https://raw.githubusercontent.com/rancher/os-services/v0.9.2-rc1/z/zfs.yml 
[  608.363011] DEBU[0000] START: [ros service ps zfs] in /home/rancher 
[  608.545300] DEBU[0000] Loaded zfs from https://raw.githubusercontent.com/rancher/os-services/v0.9.2-rc1/z/zfs.yml 
[  630.434287] DEBU[0000] START: [system-docker logs zfs] in /home/rancher 
[  655.813481] DEBU[0000] START: [system-docker stop zfs] in /home/rancher 
[  665.043907] DEBU[0000] START: [system-docker start zfs] in /home/rancher 
[  665.769545] time="2017-04-10T11:41:53Z" level=debug msg="Loaded zfs from https://raw.githubusercontent.com/rancher/os-services/v0.9.2-rc1/z/zfs.yml" 
[  671.166171] DEBU[0000] START: [system-docker logs zfs] in /home/rancher 
[  685.473759] DEBU[0000] START: [system-docker logs zfs] in /home/rancher 
[  700.494119] DEBU[0000] START: [ros service ps zfs] in /home/rancher 
[  700.678942] DEBU[0000] Loaded zfs from https://raw.githubusercontent.com/rancher/os-services/v0.9.2-rc1/z/zfs.yml 
[ 1012.020303] DEBU[0000] Loaded zfs from https://raw.githubusercontent.com/rancher/os-services/v0.9.2-rc1/z/zfs.yml 
[ 1410.396109] DEBU[0000] START: [ros service logs zfs] in /home/rancher 
[ 1410.581570] DEBU[0000] Loaded zfs from https://raw.githubusercontent.com/rancher/os-services/v0.9.2-rc1/z/zfs.yml 
[129919.858960] DEBU[0000] START: [ros service up zfs] in /home/rancher 
[129920.043632] DEBU[0000] Loaded zfs from https://raw.githubusercontent.com/rancher/os-services/v0.9.2-rc1/z/zfs.yml 
[129920.186027] INFO[0000] [0/18] [zfs]: Starting                       
[129921.054415] time="2017-04-11T23:36:16Z" level=debug msg="Loaded zfs from https://raw.githubusercontent.com/rancher/os-services/v0.9.2-rc1/z/zfs.yml" 
[129922.060100] INFO[0002] [1/18] [zfs]: Started                        
[130015.653550] DEBU[0000] START: [ros service up zfs --debug] in /home/rancher 
[130029.638409] DEBU[0000] START: [ros service up zfs --no-build] in /home/rancher 
[130029.817689] DEBU[0000] Loaded zfs from https://raw.githubusercontent.com/rancher/os-services/v0.9.2-rc1/z/zfs.yml 
[130029.961825] INFO[0000] [0/18] [zfs]: Starting                       
[130030.824198] time="2017-04-11T23:38:05Z" level=debug msg="Loaded zfs from https://raw.githubusercontent.com/rancher/os-services/v0.9.2-rc1/z/zfs.yml" 
[130031.817942] INFO[0002] [1/18] [zfs]: Started                        
[131439.951000] DEBU[0000] START: [ros service logs zfs] in /home/rancher 
[131440.137878] DEBU[0000] Loaded zfs from https://raw.githubusercontent.com/rancher/os-services/v0.9.2-rc1/z/zfs.yml 
[131825.530120] ZFS: Loaded module v0.6.5.9-1, ZFS pool version 5000, ZFS filesystem version 5

Based on seconds from epoch, that module was loaded @ Wed, 12 Apr 2017 00:11:23 GMT. Current system time shows:

date
Wed Apr 12 00:10:17 UTC 2017

Thanks for digging into this!

I'm in the middle of building the 0.9.2 release, so I may get the commands wrong :/ but I'd remove the zfs service and image and then re-do it using the steps in the docs

basically, I htink you need to ros service disable zfs , ros service delete zfsand thensystem-docker rm zfsandsystem-docker rm zfs-tools- and then similarlysystem-rmi` the images to make sure they're re-pulled.

I wonder if you could ros os upgrade -i rancher/os:v0.9.2 before re-enabling - I've built the images, but they're not fully available yet :)

once you've re-started the zfs service and let it finish building, can you please save the logs and then reboot?

so on the aws box i just tested on has this as the last parts of the zfs service logs:

zfs_1 | Step 7 : ADD usr/local/libexec/* /usr/local/libexec/
zfs_1 |  ---> c7cf696b84a7
zfs_1 | Removing intermediate container 02c99a8655e0
zfs_1 | Step 8 : ADD setup_wonka.sh /setup_wonka.sh
zfs_1 |  ---> aa49f7115add
zfs_1 | Removing intermediate container ec71ce7a5d75
zfs_1 | Step 9 : RUN ldconfig
zfs_1 |  ---> Running in 0fbf3e353469
zfs_1 | /sbin/ldconfig.real: /usr/local/lib/libuutil.so.1 is not a symbolic link
zfs_1 |
zfs_1 | /sbin/ldconfig.real: /usr/local/lib/libnvpair.so.1 is not a symbolic link
zfs_1 |
zfs_1 | /sbin/ldconfig.real: /usr/local/lib/libzfs_core.so.1 is not a symbolic link
zfs_1 |
zfs_1 | /sbin/ldconfig.real: /usr/local/lib/libzpool.so.2 is not a symbolic link
zfs_1 |
zfs_1 | /sbin/ldconfig.real: /usr/local/lib/libzfs.so.2 is not a symbolic link
zfs_1 |
zfs_1 |  ---> 47dea49acf4d
zfs_1 | Removing intermediate container 0fbf3e353469
zfs_1 | Step 10 : ENTRYPOINT /usr/local/bin/entry.sh
zfs_1 |  ---> Running in a67a45be1a16
zfs_1 |  ---> c06c2eaae7be
zfs_1 | Removing intermediate container a67a45be1a16
zfs_1 | Successfully built c06c2eaae7be
zfs_1 | + modprobe zfs
zfs_1 | + /dist/arch/setup_wonka.sh console
zfs_1 | installing wonka bin links in console
zfs_1 | installing wonka bin links in zfs
zfs_1 | + /dist/arch/setup_wonka.sh zfs
zfs_1 | + zpool import -a
zfs_1 | no pools available to import
zfs_1 | + touch /lib/modules/4.9.21-rancher/.zfs-done
zfs_1 | + echo ZFS for 4.9.21-rancher installed. Delete /lib/modules/4.9.21-rancher/.zfs-done to reinstall
zfs_1 | ZFS for 4.9.21-rancher installed. Delete /lib/modules/4.9.21-rancher/.zfs-done to reinstall
[root@ip-172-31-15-8 conf]# zfs list
no datasets available

Sorry for the delayed response, looks like the USB that I was booting off is dying. I've been struggling to get the system back and so I can't try your suggestions. I'll have a new boot drive in a few days and I'll follow up then.

cool :) I'm on an Airplane to DockerCon this weekend, so the delay works :)

ok, finally able to rebuild the system. In this particular iteration I'm running v1.0.1-rc1 which I'm sure is out of date by now, but I don't feel like dling the latest at the moment. It looks like it worked, but it took somewhere between 10 and 15 minutes to reboot and install the image. You mentioned prebuilding the images - are there any plans to do so in the near future?

Here are the steps I'm executed, starting from install:

sudo ros install -d /dev/sde -c config.yaml -f --debug
[reboot]
sudo ros service enable zfs
sudo ros service up zfs
sudo zpool status
sudo zpool status
  pool: knas1
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: scrub repaired 0 in 8h29m with 0 errors on Sat Apr  1 13:29:04 2017
config:

        NAME                      STATE     READ WRITE CKSUM
        knas1                     DEGRADED     0     0     0
          raidz2-0                DEGRADED     0     0     0
            sdc                   ONLINE       0     0     0
            sdd                   ONLINE       0     0     0
            sdf2                  ONLINE       0     0     0
            12298971764572034139  UNAVAIL      0     0     0  was /dev/sde2
            sdb                   ONLINE       0     0     0
            sda                   ONLINE       0     0     0

errors: No known data errors
sudo reboot
sudo zpool status
[command not found]
sudo watch ps aux | grep zfs
[ Wait for everything only the grep to show in ps]
sudo zpool status
  pool: knas1
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: scrub repaired 0 in 8h29m with 0 errors on Sat Apr  1 13:29:04 2017
config:

        NAME                      STATE     READ WRITE CKSUM
        knas1                     DEGRADED     0     0     0
          raidz2-0                DEGRADED     0     0     0
            sdc                   ONLINE       0     0     0
            sdd                   ONLINE       0     0     0
            sdf2                  ONLINE       0     0     0
            12298971764572034139  UNAVAIL      0     0     0  was /dev/sde2
            sdb                   ONLINE       0     0     0
            sda                   ONLINE       0     0     0

errors: No known data errors

Ok, recreated. After rebooting and having ZFS come back successfully, I set the docker storage driver to zfs (following the rancher page for using ZFS). I then deployed the rancher server and the rancher agent. I then noticed that none of the default services were coming up (healthcheck, network, ipsec, etc). I tried to reboot but kept getting an error from system-docker that a running container couldn't be stopped. Tried to identify it with sudo system-docker ps but the command just hung. Ended up doing a power cycle, and thats where the problems begin.

On reboot, zfs was not available, with the last message in sudo ros service logs zfs being:

+ mv spl-0.6.5.9/ spl/
mv: cannot move 'spl-0.6.5.9/' to 'spl/spl-0.6.5.9': Directory not empty

So, I followed your instructions and removed deleted the zfs stuff. The commands were (roughly) as follows:

sudo ros service disable zfs
for container in `docker ps -a | grep zfs | awk '{print $NF;}'; do docker rm -f $container; done
for image in `docker images | grep zfs | awk '{print $3;}'`; do docker rmi $image; done
sudo ros service enable zfs
sudo ros service up zfs

2nd command - delete all running/stopped containers with zfs 3rd command - delete all images with zfs in the name

At the end, zfs was back up and my pool was automatically mounted. Unfortunately, docker had not waited for all of this so it was in a reboot loop - any attempt to run docker commands against the user docker failed because the socket was not available. I then tried a reboot, zfs came back successfully. The user-docker container appropriately waited for ZFS to be available, but was again stuck in the boot loop. The repeating logs for that are:

> time="2017-04-23T02:25:17Z" level=info msg="Starting Docker in context: console" 
> time="2017-04-23T02:25:17Z" level=info msg="Getting PID for service: console" 
> time="2017-04-23T02:25:17Z" level=info msg="console PID 1205" 
> time="2017-04-23T02:25:17Z" level=info msg="[PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin HOSTNAME=rancher1 HOME=/]" 
> time="2017-04-23T02:25:17Z" level=info msg="Running [docker-runc exec -- 5e24f3a28bb00e32ac2e8fe2731d675c063caf14759cd97900a9f1e8bf4ecb77 env PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin HOSTNAME=rancher1 HOME=/ ros docker-init daemon --graph /mnt/knas1/docker --storage-driver zfs --group docker --log-opt max-file=2 --log-opt max-size=25m --host unix:///var/run/docker.sock]"

Any ideas on how to recover the user-docker container? Is there something special that normally happens during a clean reboot that got hosed when I power cycled?

rancher / os

v0.9.2-rc3: zfs fails on reboot #1764