openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.63k stars 1.75k forks source link

EBUSY upon ZPOOL_EXPORT #1045

Closed stephane-chazelas closed 11 years ago

stephane-chazelas commented 12 years ago

I got a second occurrence of the issue described at http://thread.gmane.org/gmane.linux.file-systems.zfs.user/4661

I've been doing an "offsite backup" every week, whereby I zfs-send|zfs-recv a number of datasets from one zpool onto another zpool on a pair of hard drives (well luks devices on top of hard drives). I do a zfs export, luksClose before taking the drives offsite.

Today, for some reason, the zfs export fails with:

zpool export offsite-backup-05
cannot export 'offsite-backup-05': pool is busy

There is no zfs command running, nothing mounted (zpool export managed to do that part) on there (checked /proc/mounts as well), nothing uses the zvols in there, no loop device or anything. I've tried to killall -STOP udevd in case it was somehow accessing stuff while the export was trying to tidy them away.

I've got a sysrq-t output, not sure what to look for to see what may be holding it.

Trying to "zfs mount -a" to see if I can mount it back, it says for every mount point:

filesystem 'offsite-backup-05/main/servers/skywalker/shadow_nbd/c' is already mounted cannot mount 'offsite-backup-05/main/servers/skywalker/shadow_nbd/c': Resource temporarily unavailable

While "grep offsite-backup-05 /proc/mounts" returns nothing.

So there's something definitely going wrong there.

I can still read the zvols on there, though.

I have the zevents going to the console (zfs_zevent_console=1) and there has been nothing (no IO error, no nothing, I used to get a lot of oops, but since upgrading the memory to 48GB, it has been quite stable until now).

Before rebooting, I also tried to export the other zpool (the one I was "zfs send"ing from) and got the same EBUSY error (succesful umount but EBUSY upon the ioctl(ZPOOL_EXPORT) as for the other one).

I noticed (in top) an arc_adapt taking 100% of 1 CPU. Running a sysrq-l a few times showed each time it being in:

Pid: 477, comm: arc_adapt Tainted: P           O 3.2.0-29-generic #46-Ubuntu Dell Inc. PowerEdge R515/03X0MN
RIP: 0010:[<ffffffff81179f4d>]  [<ffffffff81179f4d>] __put_super+0x6d/0x80
RSP: 0018:ffff8806470b5dc0  EFLAGS: 00000202
RAX: 0000000000000001 RBX: ffff880ab13b9c00 RCX: 0000000000000001
RDX: 000000000000bec5 RSI: ffff880653a41700 RDI: ffff880ab13b9c00
RBP: ffff8806470b5dd0 R08: ffff8806470b4000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff880ab13b9c00
R13: ffffffffa0214850 R14: ffffffff81f03c20 R15: ffffffffa01e7f20
FS:  00007f235cb4b700(0000) GS:ffff880c7f600000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000000207f6d0 CR3: 0000000001c05000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process arc_adapt (pid: 477, threadinfo ffff8806470b4000, task ffff880653a41700)
Stack:
 ffff8806470b5dd0 ffff880ab13b9c00 ffff8806470b5e20 ffffffff8117a0d7
 ffff8806470b5e38 ffff880ab13b9c68 ffffffffa02197e0 ffff8806470a5740
 0000000000000000 ffff8806470a5760 ffffffffa01e7f50 ffffffffffffffff
Call Trace:
 [<ffffffff8117a0d7>] iterate_supers_type+0xa7/0xe0
 [<ffffffffa01e7f50>] ? zpl_prune_sb+0x30/0x30 [zfs]
 [<ffffffffa01e7f8f>] zpl_prune_sbs+0x3f/0x50 [zfs]
 [<ffffffffa01489b1>] arc_adjust_meta+0x121/0x1e0 [zfs]
 [<ffffffffa0148a70>] ? arc_adjust_meta+0x1e0/0x1e0 [zfs]
 [<ffffffffa0148a70>] ? arc_adjust_meta+0x1e0/0x1e0 [zfs]
 [<ffffffffa0148ada>] arc_adapt_thread+0x6a/0xd0 [zfs]
 [<ffffffffa00830b8>] thread_generic_wrapper+0x78/0x90 [spl]
 [<ffffffffa0083040>] ? __thread_create+0x310/0x310 [spl]
 [<ffffffff81089fbc>] kthread+0x8c/0xa0
 [<ffffffff81664034>] kernel_thread_helper+0x4/0x10
 [<ffffffff81089f30>] ? flush_kthread_worker+0xa0/0xa0
 [<ffffffff81664030>] ? gs_change+0x13/0x13

In case that talks to anybody.

dechamps commented 12 years ago

Your call trace matches #861.

behlendorf commented 12 years ago

Right this looks like a duplicate of #861.

stephane-chazelas commented 12 years ago

Well, it is different in that I don't get any "rcu_sched detected stall", the umount returns fine, the export doesn't hang but returns with EBUSY, but indeed they look similar (and to #790).

Any recommendation on what I should try and do the next time it happens?

behlendorf commented 12 years ago

Once it happens there's nothing really which can be done. We needs to happen is for us to identify the exact flaw and see if/how it can be worked around and then properly fixed.