openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.42k stars 1.73k forks source link

OpenRC Init Scripts Fail when /var is Separate Dataset #3513

Closed jameslikeslinux closed 9 years ago

jameslikeslinux commented 9 years ago

With the new set of OpenRC init scripts (zfs-zed, zfs-import, zfs-mount, zfs-share), when /var is a separate dataset from /, zfs-mount fails with:

 * Mounting ZFS filesystem(s)  ...
cannot mount '/var': directory is not empty
 * 1

The services are configured as follows:

> rc-update show boot default | grep zfs
           zfs-import | boot        
            zfs-mount | boot        
            zfs-share |      default
              zfs-zed | boot

The datasets are structured as follows:

> zfs list
NAME                              USED  AVAIL  REFER  MOUNTPOINT
beaglebone1                      3.55G  3.11G    19K  none
beaglebone1/ROOT                 3.54G  3.11G    19K  none
beaglebone1/ROOT/funtoo          3.54G  3.11G  3.44G  /
beaglebone1/ROOT/funtoo/var      70.2M  3.11G  54.8M  /var
beaglebone1/home                 2.16M  3.11G    19K  /home
beaglebone1/home/jlee            2.14M  3.11G  2.03M  /home/jlee

I put find /var -ls immediately before the zfs mount -a in /etc/init.d/zfs-mount to see what gets written to /var before the /var dataset gets mounted:

     8    2 drwxr-xr-x   5 root     root            7 Jun 22 17:05 /var                                                                                             
331818    1 drwxr-xr-x   4 root     root            4 Jun 22 17:05 /var/cache
331824    1 drwxrwxr-x   2 portage  portage         2 Jun 22 17:05 /var/cache/eix
331826    1 drwxr-sr-x   2 man      root            2 Jun 22 17:05 /var/cache/man
331788    1 drwxr-xr-x   2 root     root            4 Jun 22 17:05 /var/log
331805    1 -rw-rw-r--   1 root     utmp            0 Jun 22 17:05 /var/log/wtmp
331808   14 -rw-r-----   1 root     root        25754 Jun 22 17:05 /var/log/dmesg
331782    1 drwxr-xr-x   5 root     root            5 Jun 22 17:05 /var/lib
331827    1 drwxr-xr-x   5 root     root            5 Jun 22 17:05 /var/lib/nfs
331838    1 drwxr-xr-x   2 root     root            2 Jun 22 17:05 /var/lib/nfs/v4root
331834    1 drwxr-xr-x   2 root     root            2 Jun 22 17:05 /var/lib/nfs/rpc_pipefs
331835    1 drwxr-xr-x   2 root     root            2 Jun 22 17:05 /var/lib/nfs/v4recovery
331789    1 drwxr-xr-x   2 root     root            3 Jun 22 17:05 /var/lib/misc
331843    1 -rw-------   1 root     root          512 Jun 22 17:05 /var/lib/misc/random-seed
331783    1 drwxr-xr-x   2 root     root            3 Jun 22 17:05 /var/lib/run
331784    1 -rw-r--r--   1 root     root           24 Jun 22 17:05 /var/lib/run/zed.state
331794    1 lrwxrwxrwx   1 root     root            9 Jun 22 17:05 /var/lock -> /run/lock
331795    1 lrwxrwxrwx   1 root     root            4 Jun 22 17:05 /var/run -> /run

There is a mix of stuff written by the bootmisc service and the zfs-zed service. Indeed, we see at the top of /etc/init.d/zfs-zed the following:

# bootmisc will log to /var which may be a different zfs than root.
before bootmisc logger zfs-import

so clearly it is known that bootmisc needs to run after /var gets mounted. I added before bootmisc to /etc/init.d/zfs-mount and zfs-mount still fails with the same error, but the set of files is reduced to just that which is written by zfs-zed:

 * Registering already-mounted ZFS filesystems and volumes  ...
 [ ok ]
     8    1 drwxr-xr-x   3 root     root            3 Jun 22 17:01 /var
331781    1 drwxr-xr-x   3 root     root            3 Jun 22 17:01 /var/lib
331782    1 drwxr-xr-x   2 root     root            3 Jun 22 17:01 /var/lib/run
331783    1 -rw-r--r--   1 root     root           24 Jun 22 17:01 /var/lib/run/zed.state
 * Mounting ZFS filesystem(s)  ...
cannot mount '/var': directory is not empty

It seems clear to me that, since zfs-zed runs first, it needs to take care to have mounted everything that it writes to. I'd say it should check for /var, but it is conceivable that you may have separate /var/lib and /var/run datasets as well. I'm not sure what the solution is for that.

It also seems clear to me that zfs-mount should run before bootmisc.

FransUrbo commented 9 years ago

I think this is more complicated than that. We must/want to have zed started before we run import, import needs to run before mount and mount needs to run before share… But we can't start zed TO early

But, for you, this leads to a "Catch 22". You want/need zed to run AFTER the mount, because that "pulls in" /var because your /var/run is in there.

I honestly don't know how to solve this for you. Debian GNU/Linux have solved stuff like this by having a volatile /run directory, which is more appropriate, because it should be empty at every start anyway.

@behlendorf @ryao Any ideas?

FransUrbo commented 9 years ago

Actually, I can see only one solution to this.

The reason we want zed to start first, is that we want it to start consuming events at the very instance the pool(s) is imported. However, if you can live without that for the few seconds it takes to get the pool(s) imported and filesystems mounted, you can put it between mount and share. Or even last if you want…

Let's see what the others think about this, but it's the only one I've been able to come up with.

behlendorf commented 9 years ago

I agree. The simplest solution for now would just be to start the ZED after zfs-mount. No events will be lost since they'll be cached by the kernel until the ZED is started and then consumed them. So that's easy and safe. Alternately, you could specify a different location for the ZED to store it's state and pid files.

FransUrbo commented 9 years ago

Ok, so you agree then @behlendorf ? This is a "Won't Fix - local 'peculiarities'" then?

I guess I could write something up about this in the README, but on the other hand, I HAVE describe the theory and reasoning why we've done it this way, and it is/should be up to the local admin to 'draw conclusions' from that. Also, I bet there's many more of these where this came from, and I don't know it's possible or reasonable to document every single exception?

And if there's another issue like this, we just redirect them to the list, because it's not a bug/issue, it's a support issue?

jameslikeslinux commented 9 years ago

/var as a separate dataset is a default option in Solaris, and with good reason--since the data is more active in /var compared with the rest of /, you may want to have a different snapshotting policy. It's hardly a "local peculiarity."

I agree that ZED can and should run after zfs-mount. Indeed, for systems with a ZFS root pool, that is imported and mounted by the initrd, well before ZED is started; and if, as @behlendorf said, the events are cached and can be consumed when ZED does start, then there should be no issue starting it later in the boot process.

In fact, there is a good argument to be made for starting ZED very late in the boot process, considering it may try to send emails, which could require a mail service or networking to be running.

As a start, this will get things mostly working:

diff --git a/etc/init.d/zfs-import.in b/etc/init.d/zfs-import.in
index dc674c4..0bb16d3 100755
--- a/etc/init.d/zfs-import.in
+++ b/etc/init.d/zfs-import.in
@@ -34,7 +34,7 @@

 do_depend()
 {
-       after sysfs udev zfs-zed
+       after sysfs udev
        keyword -lxc -openvz -prefix -vserver
 }

diff --git a/etc/init.d/zfs-mount.in b/etc/init.d/zfs-mount.in
index 50a0aef..3c9cc1b 100755
--- a/etc/init.d/zfs-mount.in
+++ b/etc/init.d/zfs-mount.in
@@ -46,6 +46,16 @@ chkroot() {

 do_depend()
 {
+       # Try to allow people to mix and match fstab with ZFS in a way that makes sense.
+       if [ "$(mountinfo -s /)" = 'zfs' ]
+       then
+               before localmount
+       else
+               after localmount
+       fi
+
+       # bootmisc will log to /var which may be a different zfs than root.
+       before bootmisc logger
        after procfs zfs-import sysfs procps
        use mtab
        keyword -lxc -openvz -prefix -vserver
diff --git a/etc/init.d/zfs-zed.in b/etc/init.d/zfs-zed.in
index 1458387..1a931bb 100755
--- a/etc/init.d/zfs-zed.in
+++ b/etc/init.d/zfs-zed.in
@@ -45,17 +45,7 @@ ZED_PIDFILE="@runstatedir@/$ZED_NAME.pid"

 do_depend()
 {
-       # Try to allow people to mix and match fstab with ZFS in a way that makes sense.
-       if [ "$(mountinfo -s /)" = 'zfs' ]
-       then
-               before localmount
-       else
-               after localmount
-       fi
-
-       # bootmisc will log to /var which may be a different zfs than root.
-       before bootmisc logger zfs-import
-       after sysfs
+       after sysfs zfs-mount
 }

 do_start()

I say "mostly" because having zfs-mount run before localmount but after procfs results in a dependency issue:

 * ERROR: cannot start procfs as localmount would not start

as procfs has a dependency:

need localmount

I think @ryao made the "before localmount" dependency in the old zfs OpenRC script, so he may have to weigh in to resolve which should go first.

jameslikeslinux commented 9 years ago

Actually, taking a closer look, the procfs service just takes care of mounting things like binfmt_misc and usbfs--neither of which is required for the zfs-mount service. OpenRC itself takes care of mounting /proc at the beginning of the boot process. So we can simplify the dependencies of zfs-mount and get rid of the procfs error message:

diff --git a/etc/init.d/zfs-import.in b/etc/init.d/zfs-import.in
index dc674c4..0bb16d3 100755
--- a/etc/init.d/zfs-import.in
+++ b/etc/init.d/zfs-import.in
@@ -34,7 +34,7 @@

 do_depend()
 {
-       after sysfs udev zfs-zed
+       after sysfs udev
        keyword -lxc -openvz -prefix -vserver
 }

diff --git a/etc/init.d/zfs-mount.in b/etc/init.d/zfs-mount.in
index 50a0aef..ea45de6 100755
--- a/etc/init.d/zfs-mount.in
+++ b/etc/init.d/zfs-mount.in
@@ -46,7 +46,17 @@ chkroot() {

 do_depend()
 {
-       after procfs zfs-import sysfs procps
+       # Try to allow people to mix and match fstab with ZFS in a way that makes sense.
+       if [ "$(mountinfo -s /)" = 'zfs' ]
+       then
+               before localmount
+       else
+               after localmount
+       fi
+
+       # bootmisc will log to /var which may be a different zfs than root.
+       before bootmisc logger
+       after zfs-import sysfs
        use mtab
        keyword -lxc -openvz -prefix -vserver
 }
diff --git a/etc/init.d/zfs-zed.in b/etc/init.d/zfs-zed.in
index 1458387..1a931bb 100755
--- a/etc/init.d/zfs-zed.in
+++ b/etc/init.d/zfs-zed.in
@@ -45,17 +45,7 @@ ZED_PIDFILE="@runstatedir@/$ZED_NAME.pid"

 do_depend()
 {
-       # Try to allow people to mix and match fstab with ZFS in a way that makes sense.
-       if [ "$(mountinfo -s /)" = 'zfs' ]
-       then
-               before localmount
-       else
-               after localmount
-       fi
-
-       # bootmisc will log to /var which may be a different zfs than root.
-       before bootmisc logger zfs-import
-       after sysfs
+       after sysfs zfs-mount
 }

 do_start()

I'd also recommend changing the documentation to say that zfs-zed should be added to the default runlevel, not boot, such that:

> rc-update show boot default | grep zfs 
           zfs-import | boot        
            zfs-mount | boot        
            zfs-share |      default
              zfs-zed |      default

The corresponding runlevels would need to be adjusted for the SVR4 interpretation of the init scripts.

behlendorf commented 9 years ago

I'm just saying this feels like a distribution specific issue. We should focus on providing the needed functionality so each distribution can integrate it with their system as appropriate. If those changes can be merged back in a generic way back in to our tree so much the better.

jameslikeslinux commented 9 years ago

This is no more distribution specific than it's always been. Having a separate /var dataset used to work, and it worked for years, then init scripts were changed, and now it doesn't. That is a bug that needs to be fixed.

And, though I haven't tested it, based on the runlevels and dependencies specified in the init scripts for Debian and Red Hat, this issue would exist on both of those platforms as well.

behlendorf commented 9 years ago

@MrStaticVoid fair point. Could you open a pull request with the fix you've proposed above with one small tweak. Let's start the zfs-zed after the zfs-mount and before zfs-share. This will be important going forward because we'd like to delegate the responsibly for sharing a filesystem to the ZED so it must be running prior to zfs-share.

jameslikeslinux commented 9 years ago

I will submit a pull request this weekend.