PANIC on umount due to iput taskq

We're occasionally seeing stack traces like:

[ 4810.313652] VERIFY(list_is_empty(list)) failed
[ 4810.313672] PANIC at list.h:88:list_destroy()
[ 4810.313681] Showing stack for process 30963
[ 4810.313684] CPU: 0 PID: 30963 Comm: umount Tainted: P           OE  3.19.2-201.fc21.x86_64 #1
[ 4810.313686] Hardware name: Red Hat RHEV Hypervisor, BIOS 0.5.1 01/01/2007
[ 4810.313688]  ffffffff8176e025 000000008756bd80 ffff88005183bc08 ffffffff8176e025
[ 4810.313692]  0000000000000000 ffffffffa05024f7 ffff88005183bc18 ffffffffa02ff8f4
[ 4810.313695]  ffff88005183bda8 ffffffffa02ff9bb 00000000ee83ee83 000000008756bd80
[ 4810.313698] Call Trace:
[ 4810.313706]  [<ffffffff8176e025>] ? dump_stack+0x45/0x57
[ 4810.313709]  [<ffffffff8176e025>] dump_stack+0x45/0x57
[ 4810.313719]  [<ffffffffa02ff8f4>] spl_dumpstack+0x44/0x50 [spl]
[ 4810.313724]  [<ffffffffa02ff9bb>] spl_panic+0xbb/0x100 [spl]
[ 4810.313730]  [<ffffffff81776c37>] ? ftrace_call+0x5/0x34
[ 4810.313737]  [<ffffffffa02ff905>] ? spl_panic+0x5/0x100 [spl]
[ 4810.313957]  [<ffffffffa0464b8c>] ? zfs_fuid_destroy+0x21c/0x2f0 [zfs]
[ 4810.313992]  [<ffffffffa0476547>] zfs_sb_free+0x307/0x310 [zfs]
[ 4810.314027]  [<ffffffffa0476c39>] zfs_umount+0xe9/0x1a0 [zfs]
[ 4810.314075]  [<ffffffffa0476b55>] ? zfs_umount+0x5/0x1a0 [zfs]
[ 4810.314108]  [<ffffffffa04a38fe>] zpl_put_super+0xe/0x40 [zfs]
[ 4810.314113]  [<ffffffff8121ba36>] generic_shutdown_super+0x76/0x100
[ 4810.314115]  [<ffffffff8121b9c5>] ? generic_shutdown_super+0x5/0x100
[ 4810.314118]  [<ffffffff8121bd66>] kill_anon_super+0x16/0x30
[ 4810.314149]  [<ffffffffa04a3a5e>] ? zpl_kill_sb+0x1e/0x30 [zfs]
[ 4810.314180]  [<ffffffffa04a3a5e>] zpl_kill_sb+0x1e/0x30 [zfs]
[ 4810.314187]  [<ffffffff8121c159>] ? deactivate_locked_super+0x49/0x60
[ 4810.314190]  [<ffffffff8121c159>] deactivate_locked_super+0x49/0x60
[ 4810.314193]  [<ffffffff8121c5ac>] deactivate_super+0x6c/0x80
[ 4810.314197]  [<ffffffff8123a353>] ? cleanup_mnt+0x43/0xa0
[ 4810.314200]  [<ffffffff8123a353>] cleanup_mnt+0x43/0xa0
[ 4810.314202]  [<ffffffff8123a402>] ? __cleanup_mnt+0x12/0x20
[ 4810.314205]  [<ffffffff8123a402>] __cleanup_mnt+0x12/0x20
[ 4810.314209]  [<ffffffff810b89a4>] task_work_run+0xd4/0xf0
[ 4810.314214]  [<ffffffff81014cf7>] do_notify_resume+0x97/0xb0
[ 4810.314219]  [<ffffffff81774aa7>] int_signal+0x12/0x17

E.g.: http://buildbot.zfsonlinux.org/builders/centos-7.0-x86_64-builder/builds/1817/steps/shell_17/logs/stdio http://buildbot.zfsonlinux.org/builders/fedora-18-x86_64-builder/builds/2996/steps/shell_17/logs/stdio http://buildbot.zfsonlinux.org/builders/fedora-21-x86_64-builder/builds/920/steps/shell_17/logs/stdio

Commit @fd23720 (with my own Signed-off-by!) resolved a deadlock (#1988) by moving draining the iput taskq outside z_teardown_lock:

zpl_put_super
+ zfs_umount
  + zfs_sb_teardown
    + taskq_wait(dsl_pool_iput_taskq(dmu_objset_pool(zsb->z_os)));    # current
    + rrw_enter(&zsb->z_teardown_lock, RW_WRITER, FTAG);              # ZFS_ENTER blocked
    + taskq_wait(dsl_pool_iput_taskq(dmu_objset_pool(zsb->z_os)));    # original

With the drain outside this lock there's nothing to prevent further work being submitted between the taskq_wait and the z_teardown_lock being obtained, leading to the PANIC above if the taskq isn't empty by the time we're freeing resources in zfs_sb_free.

It seems the correct solution is to move the drain back inside z_teardown_lock and resolve the original deadlock some other way...

The z_teardown_lock is used in ZFS_ENTER/ZFS_EXIT to prevent further processes submitting work to ZFS whilst we're unmounting the filesystem. The only time z_teardown_lock is held for write (i.e. blocking ZFS_ENTER) is in zfs_sb_teardown above.

The original deadlock was caused by dsl_pool_iput_taskq calling iput:

taskq_dispatch(dsl_pool_iput_taskq..., (task_func_t *)iput, ...);
+ iput 
  + iput_final
    + write_inode_now(inode, 1);
      + writeback_single_inode( wbc.sync_mode == WB_SYNC_ALL ) 
        + __writeback_single_inode
          + do_writepages
            + zpl_writepages          # via a_ops->writepages
              + ZFS_ENTER             # deadlock
              + zpl_putpage
                + zfs_putpage
                  + ZFS_ENTER         # deadlock

The ZFS_ENTER in zfs_putpage was added in @4c837f0d, with the comment:

    It also adds the missing ZFS_ENTER()/ZFS_EXIT() macros which
    aren't strictly required due to the VFS holding a reference.
    However, this makes the code more consistent with the upsteam
    code and there's no harm in being extra careful here.

I.e. it seems there's no actual need for ZFS_ENTER in zfs_putpage.

zpl_writepages was added in @119a394:

static int
zpl_writepages(struct address_space *mapping, struct writeback_control *wbc)
{       
        znode_t         *zp = ITOZ(mapping->host);
        zfs_sb_t        *zsb = ITOZSB(mapping->host);
        enum writeback_sync_modes sync_mode;
        int result;     

        ZFS_ENTER(zsb);
        if (zsb->z_os->os_sync == ZFS_SYNC_ALWAYS)
                wbc->sync_mode = WB_SYNC_ALL;
        ZFS_EXIT(zsb);
        sync_mode = wbc->sync_mode;

        /*
         * We don't want to run write_cache_pages() in SYNC mode here, because
         * that would make putpage() wait for a single page to be committed to
         * disk every single time, resulting in atrocious performance. Instead
         * we run it once in non-SYNC mode so that the ZIL gets all the data,
         * and then we commit it all in one go.
         */     
        wbc->sync_mode = WB_SYNC_NONE;
        result = write_cache_pages(mapping, wbc, zpl_putpage, mapping);
        if (sync_mode != wbc->sync_mode) {
                ZFS_ENTER(zsb);
                ZFS_VERIFY_ZP(zp); 
                if (zsb->z_log != NULL)
                        zil_commit(zsb->z_log, zp->z_id);
                ZFS_EXIT(zsb);

                /*
                 * We need to call write_cache_pages() again (we can't just
                 * return after the commit) because the previous call in
                 * non-SYNC mode does not guarantee that we got all the dirty
                 * pages (see the implementation of write_cache_pages() for
                 * details). That being said, this is a no-op in most cases.
                 */
                wbc->sync_mode = sync_mode;
                result = write_cache_pages(mapping, wbc, zpl_putpage, mapping);
        }
        return (result);
}

It's not clear to me why the first ZFS_ENTER instance is there. Perhaps to protect against z_os disappearing? However zpl_writepages is either being called before we start to unmount or it's being called as part of the final cleanup before we unmount, i.e. z_os is always valid. I think this first ZFS_ENTER can be safely removed.

We can see from the iput call chain above, zpl_writepages is called with wbc->sync_mode == WB_SYNC_ALL. This means we get into the if (sync_mode != wbc->sync_mode) section of zpl_writepages where we hit the second ZFS_ENTER protecting the zil_commit. However following this, we change the sync mode back to WB_SYNC_ALL and call write_cache_pages again, which also does a zil_commit (via the zfs_putpage where we've also removing ZFS_ENTER):

write_cache_pages
  zpl_putpage
    zfs_putpage
      if (wbc->sync_mode != WB_SYNC_NONE)
        zil_commit

I.e. I think the second ZFS_ENTER, and the zil_commit it's protecting, can be removed.

Overall, ZFS_ENTER is used to protect against further activity being added to our file system whilst we're tearing it down. It shouldn't be used to prevent "in progress" work being completed, e.g. that being finished off by the iput taskq.

At least that's what I think. It would be good to get some other eyes on this!

@chrisrd nice analysis. What you're proposing I think will work, and may be a good short term fix. But I think we can, and should, go farther. Let me try and summarize what I'm thinking.

So the ZFS_ENTER/ZFS_EXIT macros are there as you know to coordinate a file system unmount. What isn't clear from the code is that this is actually entirely redundant under Linux. Unlike under illumos (and maybe FreeBSD too?) the Linux VFS already makes certain guarantees we can take advantage of. We just aren't.

For all the zpl_* callbacks the Linux VFS ensures that there is a reference held on any passed pointer to a file handle, denty, or inode, In turn these data structures will indirectly hold a reference on the super block. So the super blocks are not going anywhere during the callbacks. The VFS will also never ask ZFS to cleanup the filesystem until this reference count on the super block drops to zero. Basically, this makes the ZFS_ENTER/ZFS_EXIT macros are redundant. The only possible exception here is related to rollback.

What would be nice is to define ZFS_ENTER/ZFS_EXIT as no-ops. Or perhaps tweak them so they instead take/drop a reference on the super block rather than a lock. That would eliminate the locking issue you've observed, likely be more performance, and more Linuxy.

There definitely is one mystery remaining which needs to be explained and that's this exact bug. When entering zfs_sb_teardown() as part of the unmount the reference count on the super block must have dropped to zero. That means all other filesystem consumers will be now locked out from making new system calls so once the taskq is drained out we shouldn't see any new work being added. If we are seeing items being added we need to determine what they are and where they came from.

@behlendorf Re: https://github.com/zfsonlinux/zfs/pull/3282#issuecomment-92492875 "has this patch resolved the issue for you in practice?"

It's too early to tell... I haven't been able to provoke the issue in my own testing, I've only seen this issue on the buildbots, when looking at the failures for the Illumos 5056 pull request (#3254). The latest test run there gave a green board but that doesn't mean much given the infrequency of the problem: I've only seen it 3 times in maybe 5 or 6 buildbot runs x 14? builbots == 70+ individual test runs.

Given that I've only observed the problem with #3254 it could well be related specifically to that pull request. Not being able to reproduce the problem locally mean I was constrained to static analysis, which led me down this path. (Obviously I didn't delve deep enough into it to realise that, per your other comments, the Linux VFS should be preventing this occurring in the first place.)

I don't know if the issue has ever cropped up elsewhere, but google suggests it hasn't been seen in the wild (or at least hasn't been reported). Google doesn't appear to index the buildbot output - is there a way of searching for similar failures in other buildbot runs (i.e. unrelated to that pull request)?

This output displays the problem nicely, with some additional debug logging per @4348dc8a:

http://buildbot.zfsonlinux.org/builders/centos-7.0-x86_64-builder/builds/1817/steps/shell_17/logs/stdio

In that case, when we hit zfs_sb_teardown() the iput queue was enormous, at 17,503 entries, and it still had 1,647 entries by the time we hit zfs_sb_free().

...argh. I've just realised that all the iput queue entries seen in zfs_sb_free() were already there in zfs_sb_teardown(), i.e. there are no new entries. So the observable problem in that case wasn't that additional work being added to the queue, it was that the taskq_wait hadn't waited for all the entries. Or they'd been somehow pushed back onto the queue. That's a completely different problem. Sigh. Back to the drawing board.

Obviously it would be far easier if I could provoke the problem myself: are there any guidelines on configuring and running the buildbot test suite locally?

With respect to your ZFS_ENTER/ZFS_EXIT comments, "more Linuxy" sounds like a good approach, but I'd have to understand what's happening on the Linux side before I could do anything about it.

Just to clarify...

In that case, when we hit zfs_sb_teardown() the iput queue was enormous, at 17,503 entries, and it still had 1,647 entries by the time we hit zfs_sb_free().

and:

I've just realised that all the iput queue entries seen in zfs_sb_free() were already there in zfs_sb_teardown(), i.e. there are no new entries.

...are incorrect and a fundamental mischaracterisation of the issue. The issue is not that the iput taskq contains these entries, it's that the zfs_all_znodes list contains these entries. Although they are related, there's no immediate first order link between the iput taskq and the zfs_all_znodes list.

With stock zfs-0.6.4 plus debug logging per @4348dc8, and using the test script and trivial file-create C code below, I can regularly see 10-20,000 entries (and once, nearly 80,000!) remaining on the zfs_all_znodes list in zfs_sb_teardown() (note: this is after we've waited on the taskq). However in several hundred test runs I've yet to run into the VERIFY(list_is_empty(list)) problem, indicating the list has been emptied by the time we destroy it in zfs_sb_free(). But I can't see anything between the list having entries in zfs_sb_teardown() and the list being destroyed in zfs_sb_free() which guarantees the list is emptied. In fact I don't know what's actually still working on the list to make it (almost?) always empty by the time we hit zfs_sb_free(), even when there were thousands of entries in zfs_sb_teardown().

It probably means somthing - but I don't know what - that this only happens on a multi-core VM: on a single core VM I haven't seen any entries in the zfs_all_znodes list in zfs_sb_teardown().

Separately, it looks like the iput taskq can run:

iput
 iput_final
  evict
   destroy_inode
    zpl_inode_destroy
     zfs_inode_destroy
      zfs_iput_async(ZTOI(zp->z_xattr_parent))
       taskq_dispatch(dsl_pool_iput_taskq..., iput, ...)

i.e. it seems, if we're iput-ing a (dir-based) xattr, we'll put it's parent on the iput taskq. Should we be waiting for the taskq twice? E.g. in zfs_sb_teardown():

if (zsb->z_os) {
  /*
   * Twice, because the first lot of iputs may add the parents of (dir-based) xattrs
   * to the taskq (it'll be clean after the second round because we don't have xattrs
   * on xattrs).
   */
  taskq_wait(dsl_pool_iput_taskq(dmu_objset_pool(zsb->z_os)));
  taskq_wait(dsl_pool_iput_taskq(dmu_objset_pool(zsb->z_os)));
}

Test script and file-create C code to provoke zfs_all_znodes list entries during file system shutdown:

#/bin/bash
# test.sh
if ! zpool list test1 >& /dev/null
then
        echo "creating pool test1"
        truncate -s 256M /var/tmp/file-vdev{0..3}
        zpool create test1 /var/tmp/file-vdev{0..3}
fi
zfs create -o xattr=on test1/fs
for i in {1..30}
do
        mkdir /test1/fs/$i && cd /test1/fs/$i && file-create &
done
sleep 60
killall file-create
killall file-create  # sometimes there are some left over...
zfs destroy test1/fs

/* file-create.c */
#include <stdlib.h>
#include <stdio.h>
#include <fcntl.h>
#include <errno.h>

int
main(int argc, char *argv[])
{
        char fn[1024];
        int i = 0;
        int fd;
        while (1) {
                sprintf(fn, "%d", ++i);
                if ((fd = open(fn, O_CREAT, 0444)) < 0) {
                        fprintf(stderr, "open failed: %s\n", strerror(errno));
                        exit(1);
                }
                if (fsetxattr(fd, "user.a", "a", 1, 0) < 0) {
                        fprintf(stderr, "fsetxattr failed: %s\n", strerror(errno));
                        exit(1);
                }
                if (close(fd) < 0) {
                        fprintf(stderr, "close failed: %s\n", strerror(errno));
                        exit(1);
                }
                /*
                 * the numbers don't mean anything, removing
                 * some xattrs, and some files, "just because"
                 */
                if (i % 3 == 0) {
                        sprintf(fn, "%d", i / 3);
                        if (removexattr(fn, "user.a") < 0) {
                          fprintf(stderr, "removexattr %s failed: %s\n", fn, strerror(errno));
                          exit(1);
                        }
                }
                if (i % 5 == 0) {
                        sprintf(fn, "%d", i / 5);
                        if (unlink(fn, "user.a") < 0) {
                          fprintf(stderr, "unlink %s failed: %s\n", fn, strerror(errno));
                          exit(1);
                        }
                }
        }
}

i.e. it seems, if we're iput-ing a (dir-based) xattr, we'll put it's parent on the iput taskq. Should we be waiting for the taskq twice? E.g. in zfs_sb_teardown():

Ahh, that's it! You've struck on a reasonable root cause. That's right, it is possible to to redispatch to the taskq an iput for the parent inode. This is also exactly the sort of thing the zfs-stress torture tests would expose but would be amazingly unlikely on real systems.

I would think a reasonable fix for for this would be to simply block in zfs_sb_teardown in the unmount case until zdb->z_all_znodes drops to zero.

@behlendorf Like that?

openzfs / zfs

PANIC on umount due to iput taskq #3281