openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.55k stars 1.74k forks source link

zfs_ioc_rollback() ASSERTION(atomic_read(&ZTOI(zp)->i_count) > 0) failed #1417

Closed chrisrd closed 10 years ago

chrisrd commented 11 years ago

linux-3.8.7, zfsonlinux/spl@352bd194, zfsonlinux/zfs@b28e57cb + behlendorf/zfs@d08c3f7

# zfs rollback tank/xxxx@yyyyyy

Apr 22 16:37:17 b4 kernel: [271873.024645] SPLError: 30820:0:(zfs_vfsops.c:1097:zfs_sb_teardown()) ASSERTION(atomic_read(&ZTOI(zp)->i_count) > 0) failed
Apr 22 16:37:17 b4 kernel: [271873.024736] SPLError: 30820:0:(zfs_vfsops.c:1097:zfs_sb_teardown()) SPL PANIC
Apr 22 16:37:17 b4 kernel: [271873.025928] SPL: Showing stack for process 30820
Apr 22 16:37:17 b4 kernel: [271873.025970] Pid: 30820, comm: zfs Tainted: P           O 3.8.7-otn-00004-g70bd04f #1
Apr 22 16:37:17 b4 kernel: [271873.026044] Call Trace:
Apr 22 16:37:17 b4 kernel: [271873.026090]  [<ffffffffa01ae557>] spl_debug_dumpstack+0x27/0x40 [spl]
Apr 22 16:37:17 b4 kernel: [271873.026140]  [<ffffffffa01af6cf>] spl_debug_bug+0x7f/0xe0 [spl]
Apr 22 16:37:17 b4 kernel: [271873.026221]  [<ffffffffa0921df1>] zfs_sb_teardown+0x5a1/0x5b0 [zfs]
Apr 22 16:37:17 b4 kernel: [271873.026300]  [<ffffffffa0921e18>] zfs_suspend_fs+0x18/0x40 [zfs]
Apr 22 16:37:17 b4 kernel: [271873.026378]  [<ffffffffa0910bdf>] zfs_ioc_rollback+0x11f/0x200 [zfs]
Apr 22 16:37:17 b4 kernel: [271873.026461]  [<ffffffffa091792d>] zfsdev_ioctl+0xdd/0x1d0 [zfs]
Apr 22 16:37:17 b4 kernel: [271873.026517]  [<ffffffff8114cd67>] do_vfs_ioctl+0x97/0x530
Apr 22 16:37:17 b4 kernel: [271873.026566]  [<ffffffff81146926>] ? final_putname+0x26/0x50
Apr 22 16:37:17 b4 kernel: [271873.026614]  [<ffffffff81146b2b>] ? putname+0x2b/0x40
Apr 22 16:37:17 b4 kernel: [271873.026661]  [<ffffffff8114d250>] sys_ioctl+0x50/0x90
Apr 22 16:37:17 b4 kernel: [271873.026711]  [<ffffffff813a2e19>] system_call_fastpath+0x16/0x1b
behlendorf commented 11 years ago

Can you reproduce this easily?

chrisrd commented 11 years ago

Not at the moment, that machine is running other stuff that I don't want to interrupt just yet. But I'll reorganise resources so I can prod at that machine at will and see if I can readily reproduce.

chrisrd commented 11 years ago

Yes, I can reproduce this easily (four times in a row) with:

rsync -a remote:/xxxx/ /tank/xxxx/ &&
zfs snapshot tank/xxxx@yyyyy &&
zfs rollback tank/xxxx@yyyyy

Same trace each time, e.g.:

Apr 30 12:52:15 b4 kernel: [  964.618094] SPLError: 5695:0:(zfs_vfsops.c:1097:zfs_sb_teardown()) ASSERTION(atomic_read(&ZTOI(zp)->i_count) > 0) failed
Apr 30 12:52:15 b4 kernel: [  964.618157] SPLError: 5695:0:(zfs_vfsops.c:1097:zfs_sb_teardown()) SPL PANIC
Apr 30 12:52:15 b4 kernel: [  964.618205] SPL: Showing stack for process 5695
Apr 30 12:52:15 b4 kernel: [  964.618252] Pid: 5695, comm: zfs Tainted: P           O 3.8.7-otn-00004-g70bd04f #1
Apr 30 12:52:15 b4 kernel: [  964.618298] Call Trace:
Apr 30 12:52:15 b4 kernel: [  964.618353]  [<ffffffffa05f7557>] spl_debug_dumpstack+0x27/0x40 [spl]
Apr 30 12:52:15 b4 kernel: [  964.618401]  [<ffffffffa05f86cf>] spl_debug_bug+0x7f/0xe0 [spl]
Apr 30 12:52:15 b4 kernel: [  964.618490]  [<ffffffffa0f1ddf1>] zfs_sb_teardown+0x5a1/0x5b0 [zfs]
Apr 30 12:52:15 b4 kernel: [  964.618561]  [<ffffffffa0f1de18>] zfs_suspend_fs+0x18/0x40 [zfs]
Apr 30 12:52:15 b4 kernel: [  964.618630]  [<ffffffffa0f0cbdf>] zfs_ioc_rollback+0x11f/0x200 [zfs]
Apr 30 12:52:15 b4 kernel: [  964.618701]  [<ffffffffa0f1392d>] zfsdev_ioctl+0xdd/0x1d0 [zfs]
Apr 30 12:52:15 b4 kernel: [  964.618749]  [<ffffffff8114cd67>] do_vfs_ioctl+0x97/0x530
Apr 30 12:52:15 b4 kernel: [  964.618793]  [<ffffffff81146926>] ? final_putname+0x26/0x50
Apr 30 12:52:15 b4 kernel: [  964.618837]  [<ffffffff81146b2b>] ? putname+0x2b/0x40
Apr 30 12:52:15 b4 kernel: [  964.618880]  [<ffffffff8114d250>] sys_ioctl+0x50/0x90
Apr 30 12:52:15 b4 kernel: [  964.618925]  [<ffffffff813a2e19>] system_call_fastpath+0x16/0x1b
chrisrd commented 11 years ago

For completeness, same thing on linux-3.8.10, zfsonlinux/spl@352bd19, zfsonlinux/zfs@57f5a20:

Apr 30 14:44:35 b4 kernel: [  831.189563] SPLError: 5555:0:(zfs_vfsops.c:1097:zfs_sb_teardown()) ASSERTION(atomic_read(&ZTOI(zp)->i_count) > 0) failed
Apr 30 14:44:35 b4 kernel: [  831.189626] SPLError: 5555:0:(zfs_vfsops.c:1097:zfs_sb_teardown()) SPL PANIC
Apr 30 14:44:35 b4 kernel: [  831.189675] SPL: Showing stack for process 5555
Apr 30 14:44:35 b4 kernel: [  831.189721] Pid: 5555, comm: zfs Tainted: G           O 3.8.10-otn-00009-ga955612 #1
Apr 30 14:44:35 b4 kernel: [  831.189767] Call Trace:
Apr 30 14:44:35 b4 kernel: [  831.189822]  [<ffffffffa01a4557>] spl_debug_dumpstack+0x27/0x40 [spl]
Apr 30 14:44:35 b4 kernel: [  831.189871]  [<ffffffffa01a56cf>] spl_debug_bug+0x7f/0xe0 [spl]
Apr 30 14:44:35 b4 kernel: [  831.189962]  [<ffffffffa09eedf1>] zfs_sb_teardown+0x5a1/0x5b0 [zfs]
Apr 30 14:44:35 b4 kernel: [  831.190034]  [<ffffffffa09eee18>] zfs_suspend_fs+0x18/0x40 [zfs]
Apr 30 14:44:35 b4 kernel: [  831.190104]  [<ffffffffa09ddbdf>] zfs_ioc_rollback+0x11f/0x200 [zfs]
Apr 30 14:44:35 b4 kernel: [  831.190175]  [<ffffffffa09e492d>] zfsdev_ioctl+0xdd/0x1d0 [zfs]
Apr 30 14:44:35 b4 kernel: [  831.190223]  [<ffffffff8114ce47>] do_vfs_ioctl+0x97/0x530
Apr 30 14:44:35 b4 kernel: [  831.190268]  [<ffffffff8106ffca>] ? finish_task_switch+0x4a/0xf0
Apr 30 14:44:35 b4 kernel: [  831.190315]  [<ffffffff81399636>] ? __schedule+0x2c6/0x750
Apr 30 14:44:35 b4 kernel: [  831.190358]  [<ffffffff8114d330>] sys_ioctl+0x50/0x90
Apr 30 14:44:35 b4 kernel: [  831.190403]  [<ffffffff813a2e99>] system_call_fastpath+0x16/0x1b
behlendorf commented 10 years ago

@chrisrd There have been a quite a large number of changes/fixes to the rollback code since this was reported. Do you know if this is still an issue in the latest master source?

chrisrd commented 10 years ago

Unfortunately, still happening: linux-3.10.18, zfsonlinux/zfs@64ad2b26e24ae9f70d3a41c786144552c2e6ac12 (+ chrisrd/zfs@bb145f1cd2881d8b2901776013301a52f4798645), zfsonlinux/spl@dd33a169ef1e071c2eb777328fe91df30ce542ae

Nov 20 01:50:36 b5 kernel: [34178.970127] SPLError: 16877:0:(zfs_vfsops.c:1148:zfs_sb_teardown()) ASSERTION(atomic_read(&ZTOI(zp)->i_count) > 0) failed
Nov 20 01:50:36 b5 kernel: [34178.970175] SPLError: 16877:0:(zfs_vfsops.c:1148:zfs_sb_teardown()) SPL PANIC
Nov 20 01:50:36 b5 kernel: [34178.970202] SPL: Showing stack for process 16877
Nov 20 01:50:36 b5 kernel: [34178.970223] CPU: 7 PID: 16877 Comm: zfs Tainted: G           O 3.10.18-otn-00011-g883dc78 #1
Nov 20 01:50:36 b5 kernel: [34178.970282] Hardware name: Supermicro X8DTH-i/6/iF/6F/X8DTH, BIOS 2.0c       07/19/11  
Nov 20 01:50:36 b5 kernel: [34178.970357]  ffff8807605fc528 ffff88019ef8bcf8 ffffffff813c611e ffff88019ef8bd08
Nov 20 01:50:36 b5 kernel: [34178.970441]  ffffffffa006d557 ffff88019ef8bd48 ffffffffa006e6ba ffffffffa00823dd
Nov 20 01:50:36 b5 kernel: [34178.970524]  ffff8807605fc000 ffff8807605fc528 ffff8807605fc4f8 0000000000000000
Nov 20 01:50:36 b5 kernel: [34178.970611] Call Trace:
Nov 20 01:50:36 b5 kernel: [34178.970653]  [<ffffffff813c611e>] dump_stack+0x19/0x1b
Nov 20 01:50:36 b5 kernel: [34178.970705]  [<ffffffffa006d557>] spl_debug_dumpstack+0x27/0x40 [spl]
Nov 20 01:50:36 b5 kernel: [34178.970755]  [<ffffffffa006e6ba>] spl_debug_bug+0x7a/0xe0 [spl]
Nov 20 01:50:36 b5 kernel: [34178.970847]  [<ffffffffa0d00851>] zfs_sb_teardown+0x561/0x570 [zfs]
Nov 20 01:50:36 b5 kernel: [34178.970911]  [<ffffffffa0c543be>] ? dmu_objset_rele+0x3e/0x50 [zfs]
Nov 20 01:50:36 b5 kernel: [34178.970984]  [<ffffffffa0d00870>] zfs_suspend_fs+0x10/0x20 [zfs]
Nov 20 01:50:36 b5 kernel: [34178.971058]  [<ffffffffa0cf22a5>] zfs_ioc_rollback+0x35/0xb0 [zfs]
Nov 20 01:50:36 b5 kernel: [34178.971131]  [<ffffffffa0cf56f1>] zfsdev_ioctl+0x1a1/0x5b0 [zfs]
Nov 20 01:50:36 b5 kernel: [34178.971179]  [<ffffffff813ce5cc>] ? __do_page_fault+0x26c/0x4f0
Nov 20 01:50:36 b5 kernel: [34178.971227]  [<ffffffff81149af0>] do_vfs_ioctl+0x90/0x520
Nov 20 01:50:36 b5 kernel: [34178.971273]  [<ffffffff8110b44b>] ? SyS_mmap_pgoff+0x5b/0x280
Nov 20 01:50:36 b5 kernel: [34178.971319]  [<ffffffff81149fd0>] SyS_ioctl+0x50/0x90
Nov 20 01:50:36 b5 kernel: [34178.971363]  [<ffffffff813d3082>] system_call_fastpath+0x16/0x1b
chrisrd commented 10 years ago

It looks like this the same problem as #1536

dweeezil commented 10 years ago

Yes, it certainly appears to be so. Your test case is effectively the same as my original reproducer with the exception that you're using rollback which needs to remount a dirtied filesystem. In my case, the receive was causing it. In either case, the results seem to be the same.

As @behlendorf mentioned in his comments in #1853, nobody has been able to explain why the assertion isn't correct (effectively, why i_count can be zero). I'll see if I can re-try my producer within the next couple of days and do a bit more thoroughexamination to see why it can happen. I can't seem to dig up the comment, but I'm pretty sure I discovered that FreeBSD changed the assertion to >=0 and that helped me believe it was a correct fix but I certainly couldn't provide a rational explanation of it.

chrisrd commented 10 years ago

@dweeezil Is this what you were referring to?

http://svnweb.freebsd.org/base/head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c?r1=196297&r2=196296&pathrev=196297

Revision 196297 - (view) (download) (annotate) - [select for diffs] 
Modified Mon Aug 17 09:13:22 2009 UTC (4 years, 3 months ago) by pjd 
File length: 36364 byte(s) 
Diff to previous 193440
Fix panic in zfs recv code. The last vnode (mountpoint's vnode) can have
0 usecount.

Reported by:    Thomas Backman <serenity@exscape.org>
Approved by:    re (kib)

http://svnweb.freebsd.org/base/head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c?view=patch&r1=193440&r2=196297&pathrev=196297

--- head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c    2009/06/04 16:18:07 193440
+++ head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c    2009/08/17 09:13:22 196297
@@ -917,7 +917,7 @@
    for (zp = list_head(&zfsvfs->z_all_znodes); zp != NULL;
        zp = list_next(&zfsvfs->z_all_znodes, zp))
        if (zp->z_dbuf) {
-           ASSERT(ZTOV(zp)->v_count > 0);
+           ASSERT(ZTOV(zp)->v_count >= 0);
            zfs_znode_dmu_fini(zp);
        }
    mutex_exit(&zfsvfs->z_znodes_lock);

Does the comment "The last vnode (mountpoint's vnode) can have 0 usecount" sound plausible/reasonable?

If that's the case, is i_count going negative something worth checking for, or should the ASSERT be removed altogether?

Or maybe the ASSERT could perhaps be changed to "i_count > 0 unless we're looking at the mountpoint's inode"?

behlendorf commented 10 years ago

@chrisrd Yes, that's the patch. I took a little time to look at this again and after a little reflection I do see a case where the assertion is just wrong. And it's somewhat specific to the Linux VFS so it's possible Illumos doesn't suffer fro this. It's also possible that it's just rare and this ASSERT is disabled by default on FreeBSD/Illumos so it has gone unnoticed.

When the VFS destroys an inode through destroy_inode()->zpl_inode_destroy()->zfs_inode_destroy() it drops the last reference on the inode in destroy_inode(). That is ip->i_count drops to 0. However, the inode isn't removed from the zsb->z_all_znodes list until zfs_inode_destroy() is called. That means there is a small window where it may be on this list with an i_count of 0. If zfs_sb_teardown() where to run in that window we'd hit the assertion.

The good news is that because zfs_sb_teardown() does all of its work under the zsb->z_znodes_lock the code is safe. The inode can't be free'd from underneath us while the lock is held. On Linux its also impossible for the ip->i_count to ever drop below zero. So it looks to me like the right thing to do is to simply remove this assertion entirely from the code. There is clearly a safe case on Linux which can occur where it's just wrong.

@chrisrd can you just drop the ASSERT from your code and make sure everything still works as expected. It's a very low risk change since the ASSERT is already disabled in the vast majority of builds. If everything works as expected for you we'll just drop the ASSERT entirely from our tree.

diff --git a/module/zfs/zfs_vfsops.c b/module/zfs/zfs_vfsops.c
index 5c65b96..2c0e923 100644
--- a/module/zfs/zfs_vfsops.c
+++ b/module/zfs/zfs_vfsops.c
@@ -1144,10 +1144,8 @@ zfs_sb_teardown(zfs_sb_t *zsb, boolean_t unmounting)
        mutex_enter(&zsb->z_znodes_lock);
        for (zp = list_head(&zsb->z_all_znodes); zp != NULL;
            zp = list_next(&zsb->z_all_znodes, zp)) {
-               if (zp->z_sa_hdl) {
-                       ASSERT(atomic_read(&ZTOI(zp)->i_count) > 0);
+               if (zp->z_sa_hdl)
                        zfs_znode_dmu_fini(zp);
-               }
        }
        mutex_exit(&zsb->z_znodes_lock);
chrisrd commented 10 years ago

@behlendorf I agree about that window between i_count being decremented to 0 in iput() and the node being removed from z_all_znodes in zfs_inode_destroy(), however that seems to be a pretty small window for me to be able to hit consistently. E.g. I tried:

--- a/module/zfs/zfs_vfsops.c
+++ b/module/zfs/zfs_vfsops.c
@@ -1145,7 +1145,11 @@ zfs_sb_teardown(zfs_sb_t *zsb, boolean_t unmounting)
    for (zp = list_head(&zsb->z_all_znodes); zp != NULL;
        zp = list_next(&zsb->z_all_znodes, zp)) {
        if (zp->z_sa_hdl) {
-           ASSERT(atomic_read(&ZTOI(zp)->i_count) > 0);
+           /* ASSERT(atomic_read(&ZTOI(zp)->i_count) > 0); */
+           if (atomic_read(&ZTOI(zp)->i_count) == 0) {
+               cmn_err(CE_WARN, "zfs_sb_teardown: i_count is zero for inode %lu",
+                   ZTOI(zp)->i_ino);
+           }
            zfs_znode_dmu_fini(zp);
        }
    }

It seems I can hit it every time, with:

# zfs create test/foo
# touch /test/foo/bar
# ls -lai /test/foo
total 7
 4 drwxr-xr-x 2 root root    3 Nov 25 11:28 .
27 drwxr-xr-x 4 root root 4096 Nov 25 11:28 ..
 7 -rw-r--r-- 1 root root    0 Nov 25 11:28 bar
# zfs snapshot test/foo@snap1
# zfs rollback test/foo@snap1
# tail /var/log/kern.log
Nov 25 11:28:37 b5 kernel: [419660.082881] SPLError: 14136:0:(spl-err.c:67:vcmn_err()) WARNING: zfs_sb_teardown: i_count is zero for inode 7

The inode number shows it's definitely not, per the FreeBSD commit comment, "the last vnode (mountpoint's vnode)" that's causing the problem either.

So whilst it certainly seems the ASSERT is bogus, I'm a bit uncomfortable that we don't really seem to have a handle on why.

Hmmm, looky here... I think we're returning from iput_final() before getting to the evict(), and hence we don't get to zfs_inode_destroy() where we remove the node from our z_all_znodes list:

static void iput_final(struct inode *inode)
{
    if (op->drop_inode)
        drop = op->drop_inode(inode);
    else
        drop = generic_drop_inode(inode);

    if (!drop && (sb->s_flags & MS_ACTIVE)) {
        inode->i_state |= I_REFERENCED;
        inode_add_lru(inode);
        spin_unlock(&inode->i_lock);
        return;                          ### returning here?
    }

    ...

    evict(inode);
}

Oh, here we go:

static int
zfs_ioc_rollback(const char *fsname, nvlist_t *args, nvlist_t *outnvl)
{
    ...
    if (get_zfs_sb(fsname, &zsb) == 0) {
        error = zfs_suspend_fs(zsb);     ### ASSERT hit in here
        ...
        deactivate_super(zsb->z_sb);     ### MS_ACTIVE unset here

where MS_ACTIVE is unset via:

deactivate_super
  deactivate_locked_super
    zpl_kill_sb           ### via fs->kill_sb(s) 
      kill_anon_super
        generic_shutdown_super

void generic_shutdown_super(struct super_block *sb)
{
        const struct super_operations *sop = sb->s_op;

        if (sb->s_root) {
            shrink_dcache_for_umount(sb);
            sync_filesystem(sb);
            sb->s_flags &= ~MS_ACTIVE;       ###

So then the question is, is the correct fix to remove the ASSERT, or unset MS_ACTIVE somewhere earlier when we're unmounting?

I think, given the window you've outlined would still be present, removing the ASSERT is the correct path - it's certainly the simplest!

behlendorf commented 10 years ago

@chrisrd Interesting, I see what you're say. It looks to me like the simplest and safest thing to do for now is to just remove the assert. We now know of at least two call paths where the ASSERT is simply wrong. Longer term we should revisit this code and see if it can be refactored for Linux. For example, I'd love to eliminate the zsb->z_all_znodes list entirely, this information is already being tracked in the super block so this list in redundant under Linux. It just leads to strange interactions like this with the VFS.

behlendorf commented 10 years ago

@dweeezil @chrisrd I've removed the assertion from master. There's clearly additional improvements which can be made here but this is clearly a safe first step.