Open jaknoll opened 7 years ago
the zfs
service is supposed to exit - it only installs and builds things - after that the zfs tools should be available in your console
Our testing so far has zfs
working - and the log you've sent doesn't look totally wrong (just truncated)
I wonder - did the initial enable&up complete the build and allow you to set up zfs? before rebooting?
what happens when you run sudo zfs list
?
mmm, I wonder - did you follow the procedure in http://docs.rancher.com/os/storage/using-zfs/ ?
if not, I suspect that if you only enable, then reboot, then it will take (some non-trivial amount of) time to build zfs, and once its finished, then zfs will be available.
I'm planning on making pre-built zfs images, but the way it works atm allows it to work with custom kernels too
Yes, I imported my pool after first enabling and bringing the service up. Then I rebooted.
All zfs commands (zfs, zpool) fail with command not found after letting the system sit for ~12 hours
Also, tried a zfs service up just now and an up with --no-build with no luck. Are there any logs I can look at?
I would start with sudo ros service logs zfs
- which should not end at the curl line you have above.
what do you get when you run sudo system-docker images
and sudo system-docker ps -qa
?
is the zfs kernel module loaded? available?
sudo ros service logs zfs
ends at the curl line, I'm assuming that command is essentially just an alias for sudo system-docker logs zfs
is that not correct?
sudo system-docker images | grep zfs
zfs-tools latest 5c8279913534 36 hours ago 167 MB
rancher/os-zfs v0.9.0 ab542cc62bac 3 weeks ago 484.9 MB
sudo system-docker ps -qa
6c923f44d730
bba7d47857c1
d8fc2cfe8895
0b24378acdcc
4b01f1c264ff
f13658f95549
bc7f1f033879
0c645bf2a52e
3ca3c4bbef92
641cdb8d0f61
5b86164bdea9
f1189cc81661
c0350b7fa181
f6cd58152f15
3a62d02e03f1
1fc929cd9663
9775a7f35c60
4bcbbaa6be83
040fa5cc45f9
sudo system-docker ps -a | grep zfs
0b24378acdcc rancher/os-zfs:v0.9.0 "/usr/bin/ros entrypo" 36 hours ago Exited (6) 25 minutes ago zfs
lsmod | grep zfs
and `modprobe zfs' output is empty - assuming that is the correct way to check for zfs kernel modules in rancherOS. If that's not correct let me know how to do so properly and I'll get that output as well.
At the same time, dmesg shows the module is loaded, though maybe that was from when I first enabled the service?
dmesg | grep -i zfs
[ 45.306193] DEBU[0012] Loaded zfs from https://raw.githubusercontent.com/rancher/os-services/v0.9.2-rc1/z/zfs.yml
[ 48.118509] INFO[0014] [15/18] [zfs]: Starting
[ 49.536149] time="2017-04-10T11:31:37Z" level=debug msg="Loaded zfs from https://raw.githubusercontent.com/rancher/os-services/v0.9.2-rc1/z/zfs.yml"
[ 51.148311] INFO[0017] [17/18] [zfs]: Started
[ 81.763471] time="2017-04-10T11:32:09Z" level=debug msg="Loaded zfs from https://raw.githubusercontent.com/rancher/os-services/v0.9.2-rc1/z/zfs.yml"
[ 547.714048] DEBU[0000] START: [ros service status zfs] in /home/rancher
[ 577.797528] DEBU[0000] START: [ros service ps zfs] in /home/rancher
[ 577.983543] DEBU[0000] Loaded zfs from https://raw.githubusercontent.com/rancher/os-services/v0.9.2-rc1/z/zfs.yml
[ 608.363011] DEBU[0000] START: [ros service ps zfs] in /home/rancher
[ 608.545300] DEBU[0000] Loaded zfs from https://raw.githubusercontent.com/rancher/os-services/v0.9.2-rc1/z/zfs.yml
[ 630.434287] DEBU[0000] START: [system-docker logs zfs] in /home/rancher
[ 655.813481] DEBU[0000] START: [system-docker stop zfs] in /home/rancher
[ 665.043907] DEBU[0000] START: [system-docker start zfs] in /home/rancher
[ 665.769545] time="2017-04-10T11:41:53Z" level=debug msg="Loaded zfs from https://raw.githubusercontent.com/rancher/os-services/v0.9.2-rc1/z/zfs.yml"
[ 671.166171] DEBU[0000] START: [system-docker logs zfs] in /home/rancher
[ 685.473759] DEBU[0000] START: [system-docker logs zfs] in /home/rancher
[ 700.494119] DEBU[0000] START: [ros service ps zfs] in /home/rancher
[ 700.678942] DEBU[0000] Loaded zfs from https://raw.githubusercontent.com/rancher/os-services/v0.9.2-rc1/z/zfs.yml
[ 1012.020303] DEBU[0000] Loaded zfs from https://raw.githubusercontent.com/rancher/os-services/v0.9.2-rc1/z/zfs.yml
[ 1410.396109] DEBU[0000] START: [ros service logs zfs] in /home/rancher
[ 1410.581570] DEBU[0000] Loaded zfs from https://raw.githubusercontent.com/rancher/os-services/v0.9.2-rc1/z/zfs.yml
[129919.858960] DEBU[0000] START: [ros service up zfs] in /home/rancher
[129920.043632] DEBU[0000] Loaded zfs from https://raw.githubusercontent.com/rancher/os-services/v0.9.2-rc1/z/zfs.yml
[129920.186027] INFO[0000] [0/18] [zfs]: Starting
[129921.054415] time="2017-04-11T23:36:16Z" level=debug msg="Loaded zfs from https://raw.githubusercontent.com/rancher/os-services/v0.9.2-rc1/z/zfs.yml"
[129922.060100] INFO[0002] [1/18] [zfs]: Started
[130015.653550] DEBU[0000] START: [ros service up zfs --debug] in /home/rancher
[130029.638409] DEBU[0000] START: [ros service up zfs --no-build] in /home/rancher
[130029.817689] DEBU[0000] Loaded zfs from https://raw.githubusercontent.com/rancher/os-services/v0.9.2-rc1/z/zfs.yml
[130029.961825] INFO[0000] [0/18] [zfs]: Starting
[130030.824198] time="2017-04-11T23:38:05Z" level=debug msg="Loaded zfs from https://raw.githubusercontent.com/rancher/os-services/v0.9.2-rc1/z/zfs.yml"
[130031.817942] INFO[0002] [1/18] [zfs]: Started
[131439.951000] DEBU[0000] START: [ros service logs zfs] in /home/rancher
[131440.137878] DEBU[0000] Loaded zfs from https://raw.githubusercontent.com/rancher/os-services/v0.9.2-rc1/z/zfs.yml
[131825.530120] ZFS: Loaded module v0.6.5.9-1, ZFS pool version 5000, ZFS filesystem version 5
Based on seconds from epoch, that module was loaded @ Wed, 12 Apr 2017 00:11:23 GMT. Current system time shows:
date
Wed Apr 12 00:10:17 UTC 2017
Thanks for digging into this!
I'm in the middle of building the 0.9.2 release, so I may get the commands wrong :/ but I'd remove the zfs service and image and then re-do it using the steps in the docs
basically, I htink you need to ros service disable zfs
, ros service delete zfsand then
system-docker rm zfsand
system-docker rm zfs-tools- and then similarly
system-rmi` the images to make sure they're re-pulled.
I wonder if you could ros os upgrade -i rancher/os:v0.9.2
before re-enabling - I've built the images, but they're not fully available yet :)
once you've re-started the zfs service and let it finish building, can you please save the logs and then reboot?
so on the aws box i just tested on has this as the last parts of the zfs
service logs:
zfs_1 | Step 7 : ADD usr/local/libexec/* /usr/local/libexec/
zfs_1 | ---> c7cf696b84a7
zfs_1 | Removing intermediate container 02c99a8655e0
zfs_1 | Step 8 : ADD setup_wonka.sh /setup_wonka.sh
zfs_1 | ---> aa49f7115add
zfs_1 | Removing intermediate container ec71ce7a5d75
zfs_1 | Step 9 : RUN ldconfig
zfs_1 | ---> Running in 0fbf3e353469
zfs_1 | /sbin/ldconfig.real: /usr/local/lib/libuutil.so.1 is not a symbolic link
zfs_1 |
zfs_1 | /sbin/ldconfig.real: /usr/local/lib/libnvpair.so.1 is not a symbolic link
zfs_1 |
zfs_1 | /sbin/ldconfig.real: /usr/local/lib/libzfs_core.so.1 is not a symbolic link
zfs_1 |
zfs_1 | /sbin/ldconfig.real: /usr/local/lib/libzpool.so.2 is not a symbolic link
zfs_1 |
zfs_1 | /sbin/ldconfig.real: /usr/local/lib/libzfs.so.2 is not a symbolic link
zfs_1 |
zfs_1 | ---> 47dea49acf4d
zfs_1 | Removing intermediate container 0fbf3e353469
zfs_1 | Step 10 : ENTRYPOINT /usr/local/bin/entry.sh
zfs_1 | ---> Running in a67a45be1a16
zfs_1 | ---> c06c2eaae7be
zfs_1 | Removing intermediate container a67a45be1a16
zfs_1 | Successfully built c06c2eaae7be
zfs_1 | + modprobe zfs
zfs_1 | + /dist/arch/setup_wonka.sh console
zfs_1 | installing wonka bin links in console
zfs_1 | installing wonka bin links in zfs
zfs_1 | + /dist/arch/setup_wonka.sh zfs
zfs_1 | + zpool import -a
zfs_1 | no pools available to import
zfs_1 | + touch /lib/modules/4.9.21-rancher/.zfs-done
zfs_1 | + echo ZFS for 4.9.21-rancher installed. Delete /lib/modules/4.9.21-rancher/.zfs-done to reinstall
zfs_1 | ZFS for 4.9.21-rancher installed. Delete /lib/modules/4.9.21-rancher/.zfs-done to reinstall
[root@ip-172-31-15-8 conf]# zfs list
no datasets available
Sorry for the delayed response, looks like the USB that I was booting off is dying. I've been struggling to get the system back and so I can't try your suggestions. I'll have a new boot drive in a few days and I'll follow up then.
cool :) I'm on an Airplane to DockerCon this weekend, so the delay works :)
ok, finally able to rebuild the system. In this particular iteration I'm running v1.0.1-rc1 which I'm sure is out of date by now, but I don't feel like dling the latest at the moment. It looks like it worked, but it took somewhere between 10 and 15 minutes to reboot and install the image. You mentioned prebuilding the images - are there any plans to do so in the near future?
Here are the steps I'm executed, starting from install:
sudo ros install -d /dev/sde -c config.yaml -f --debug
[reboot]
sudo ros service enable zfs
sudo ros service up zfs
sudo zpool status
sudo zpool status
pool: knas1
state: DEGRADED
status: One or more devices could not be used because the label is missing or
invalid. Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
see: http://zfsonlinux.org/msg/ZFS-8000-4J
scan: scrub repaired 0 in 8h29m with 0 errors on Sat Apr 1 13:29:04 2017
config:
NAME STATE READ WRITE CKSUM
knas1 DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
sdf2 ONLINE 0 0 0
12298971764572034139 UNAVAIL 0 0 0 was /dev/sde2
sdb ONLINE 0 0 0
sda ONLINE 0 0 0
errors: No known data errors
sudo reboot
sudo zpool status
[command not found]
sudo watch ps aux | grep zfs
[ Wait for everything only the grep to show in ps]
sudo zpool status
pool: knas1
state: DEGRADED
status: One or more devices could not be used because the label is missing or
invalid. Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
see: http://zfsonlinux.org/msg/ZFS-8000-4J
scan: scrub repaired 0 in 8h29m with 0 errors on Sat Apr 1 13:29:04 2017
config:
NAME STATE READ WRITE CKSUM
knas1 DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
sdf2 ONLINE 0 0 0
12298971764572034139 UNAVAIL 0 0 0 was /dev/sde2
sdb ONLINE 0 0 0
sda ONLINE 0 0 0
errors: No known data errors
Ok, recreated. After rebooting and having ZFS come back successfully, I set the docker storage driver to zfs (following the rancher page for using ZFS). I then deployed the rancher server and the rancher agent. I then noticed that none of the default services were coming up (healthcheck, network, ipsec, etc). I tried to reboot but kept getting an error from system-docker that a running container couldn't be stopped. Tried to identify it with sudo system-docker ps
but the command just hung. Ended up doing a power cycle, and thats where the problems begin.
On reboot, zfs was not available, with the last message in sudo ros service logs zfs
being:
+ mv spl-0.6.5.9/ spl/
mv: cannot move 'spl-0.6.5.9/' to 'spl/spl-0.6.5.9': Directory not empty
So, I followed your instructions and removed deleted the zfs stuff. The commands were (roughly) as follows:
sudo ros service disable zfs
for container in `docker ps -a | grep zfs | awk '{print $NF;}'; do docker rm -f $container; done
for image in `docker images | grep zfs | awk '{print $3;}'`; do docker rmi $image; done
sudo ros service enable zfs
sudo ros service up zfs
2nd command - delete all running/stopped containers with zfs 3rd command - delete all images with zfs in the name
At the end, zfs was back up and my pool was automatically mounted. Unfortunately, docker had not waited for all of this so it was in a reboot loop - any attempt to run docker commands against the user docker failed because the socket was not available. I then tried a reboot, zfs came back successfully. The user-docker container appropriately waited for ZFS to be available, but was again stuck in the boot loop. The repeating logs for that are:
> time="2017-04-23T02:25:17Z" level=info msg="Starting Docker in context: console"
> time="2017-04-23T02:25:17Z" level=info msg="Getting PID for service: console"
> time="2017-04-23T02:25:17Z" level=info msg="console PID 1205"
> time="2017-04-23T02:25:17Z" level=info msg="[PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin HOSTNAME=rancher1 HOME=/]"
> time="2017-04-23T02:25:17Z" level=info msg="Running [docker-runc exec -- 5e24f3a28bb00e32ac2e8fe2731d675c063caf14759cd97900a9f1e8bf4ecb77 env PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin HOSTNAME=rancher1 HOME=/ ros docker-init daemon --graph /mnt/knas1/docker --storage-driver zfs --group docker --log-opt max-file=2 --log-opt max-size=25m --host unix:///var/run/docker.sock]"
Any ideas on how to recover the user-docker container? Is there something special that normally happens during a clean reboot that got hosed when I power cycled?
RancherOS Version: v0.9.3-rc3
Where are you running RancherOS? baremetal
At reboot the zfs container is started, but immediately exits. Because of this, zfs is unavailable:
The service is enabled:
A stop/start of the container got the same result (immediate exit). Here are the logs from the container (truncated to the last bit for brevity, if you want the entirety of the logs I can provide them):