Closed eqvinox closed 7 years ago
Add. note: deduplication is disabled everywhere, but some snapshot/clones are in use as thin-provisioned VM containers.
@eqvinox has deduplication ever been enabled on this pool? For this crash to make some sense it would need to have been enabled at some point in the past because it crashed in the deduplication code. The crash also suggests that either the checksum type in the block pointer is invalid or perhaps there's an issue with the in-core deduplication table. Can you run zdb -DD rpool
to dump the deduplication table stats.
@behlendorf let's put it like this: I've never consciously enabled it. And the pool is only a few weeks old.
# zdb -DD saspool
All DDTs are empty
Since I had to reboot the box, I've added a little bit of guard code around the section in question:
diff --git a/module/zfs/zio.c b/module/zfs/zio.c
index 0ba167c..0ee0986 100644
--- a/module/zfs/zio.c
+++ b/module/zfs/zio.c
@@ -2434,6 +2434,13 @@ zio_ddt_free(zio_t *zio)
ASSERT(BP_GET_DEDUP(bp));
ASSERT(zio->io_child_type == ZIO_CHILD_LOGICAL);
+ if (unlikely(!ddt)) {
+ printk(KERN_CRIT "PAN PAN PAN: zfs hit ddt==NULL in zio_ddt_free(%p), checksum=%d\n",
+ zio, (int)(BP_GET_CHECKSUM(bp)));
+ dump_stack();
+ return (ZIO_PIPELINE_CONTINUE);
+ }
+
ddt_enter(ddt);
freedde = dde = ddt_lookup(ddt, bp, B_TRUE);
if (dde) {
I'll see if it triggers again. (For the record, I have no clue what the code does, this is just blindly adding a safety check.)
@eqvinox OK, that's interesting. I was going to suggest a debugging patch like this but you beat me too it. Definitely let us know if you manage to trip the debugging.
@behlendorf well, I'm not sure my debug patch actually prints any useful information (like where the taskq was scheduled from, not sure if that information is available) - if you can improve on that - :)
For now my best guess is that it's somehow related to the large background copy happening in the back onto a zfs that has sync=disabled, while at the same time doing fsync on a file in another zfs, but this is solely based on my observation that it froze in the exact instant my mail client tried to send an e-mail.
@behlendorf Heh. I decided to resume my background copy an it just triggered again, except my debug code did apparently not work (probably loaded the unmodified version of the kernel module by accident)
Trace looks the same:
[62078.332571] BUG: unable to handle kernel NULL pointer dereference at 00000000000000a9
[62078.332602] IP: [<ffffffffc03143f3>] ddt_enter+0x3/0x30 [zfs]
[62078.332602] PGD 0
[62078.332604] Oops: 0000 [#1] SMP
[62078.332614] Modules linked in: xfs wp512 target_core_pscsi target_core_file target_core_iblock tcm_fc libfc tcm_qla2xxx tcm_loop iscsi_target_mod ib_srpt target_core_mod intel_powerclamp mei_me mei zfs(O) zunicode(O) zcommon(O) znvpair(O) zavl(O) spl(O) mlx4_en isci mlx4_core qla2xxx scsi_transport_fc scsi_tgt efivarfs
[62078.332617] CPU: 4 PID: 8116 Comm: z_fr_iss_3/3 Tainted: G O 3.14.22-grsec-00006-g3c1449f #12
[62078.332617] Hardware name: Supermicro X9DR7/E-(J)LN4F/X9DR7/E-(J)LN4F, BIOS 3.0 07/31/2013
[62078.332618] task: ffff8808aded95d0 ti: ffff8808aded9c40 task.ti: ffff8808aded9c40
[62078.332635] RIP: 0010:[<ffffffffc03143f3>] [<ffffffffc03143f3>] ddt_enter+0x3/0x30 [zfs]
[62078.332636] RSP: 0000:ffffc9073281bd10 EFLAGS: 00010246
[62078.332637] RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000000
[62078.332637] RDX: 00000000006c6562 RSI: 0000000000000000 RDI: 0000000000000001
[62078.332638] RBP: ffffc9073281bd38 R08: 0000000000000000 R09: 0000000000000001
[62078.332638] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88072bdd0660
[62078.332639] R13: ffff88072bdd05f0 R14: ffff8808aded95d0 R15: 0000000000000000
[62078.332640] FS: 0000000000000000(0000) GS:ffff88103e600000(0000) knlGS:0000000000000000
[62078.332640] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[62078.332641] CR2: 00000000000000a9 CR3: 000000000dad7000 CR4: 00000000001607f0
[62078.332641] Stack:
[62078.332643] ffffc9073281bd38 ffffffffc03abd1c 0000000000000400 ffff88072bdd05f0
[62078.332644] 8000000000000000 ffffc9073281bd78 ffffffffc03acd13 ffff88072bdd0ba8
[62078.332646] ffff8808adf8fc00 ffff8808adf8fc10 ffff8808aded95d0 ffff88072bdd0ba8
[62078.332646] Call Trace:
[62078.332678] [<ffffffffc03abd1c>] ? zio_ddt_free+0x2c/0xb0 [zfs]
[62078.332706] [<ffffffffc03acd13>] zio_execute+0xc3/0x170 [zfs]
[62078.332714] [<ffffffffc021bfc9>] taskq_thread+0x229/0x4a0 [spl]
[62078.332718] [<ffffffff8d12dc70>] ? wake_up_state+0x20/0x20
[62078.332723] [<ffffffffc021bda0>] ? taskq_cancel_id+0x1e0/0x1e0 [spl]
[62078.332725] [<ffffffff8d11cb2d>] kthread+0xfd/0x120
[62078.332727] [<ffffffff8d11ca30>] ? kthread_create_on_node+0x250/0x250
[62078.332730] [<ffffffff8dabc464>] ret_from_fork+0x74/0xa0
[62078.332731] [<ffffffff8d11ca30>] ? kthread_create_on_node+0x250/0x250
[62078.332746] Code: 00 00 48 89 e5 48 89 d1 48 c1 e9 27 83 e1 01 75 07 48 c1 ea 28 0f b6 c2 48 8b 84 c7 28 13 00 00 5d 48 0f ba 2c 24 3f c3 55 31 f6 <48> 8b 87 a8 00 00 00 48 89 e5 53 48 89 fb e8 9a b0 79 cd 65 48
[62078.332763] RIP [<ffffffffc03143f3>] ddt_enter+0x3/0x30 [zfs]
[62078.332763] RSP <ffffc9073281bd10>
[62078.332763] CR2: 00000000000000a9
No userspace process got stuck this time, but the txg_sync
kernel thread is now in D
state.
My debug code didn't trigger because it's not actually a NULL ddt, instead it's something at 1 + an offset of 0xa8 = 0xa9.
ddt_enter():
15d0: 55 push %rbp
15d1: 31 f6 xor %esi,%esi
15d3: 48 8b 87 a8 00 00 00 mov 0xa8(%rdi),%rax <--------------- rdi = 1
15da: 48 89 e5 mov %rsp,%rbp
15dd: 53 push %rbx
15de: 48 89 fb mov %rdi,%rbx
15e1: e8 00 00 00 00 callq 15e6 <ddt_enter+0x16>
So now the question is: where do we get "1" as a pointer from? Any ideas?
Here we go: (just restarted the rsync from external device onto saspool once again)
(checksum=62 looks wrong to me, access beyond end of array?)
[ 1192.362404] PAN PAN PAN: zfs hit ddt==0000000000000001 in zio_ddt_free(ffff880f30eccd30), spa=ffff881007834000, bp=ffff880f30eccda0, checksum=62
[ 1192.418661] dumping spa->spa_ddt contents:
[ 1192.443936] [0] = ffffc900580e1180
[ 1192.468581] [1] = ffffc900580dafc0
[ 1192.493581] [2] = ffffc900580d4e00
[ 1192.517426] [3] = ffffc900580cec40
[ 1192.539875] [4] = ffffc900580c8a80
[ 1192.561445] [5] = ffffc900580c28c0
[ 1192.582387] [6] = ffffc900580bc700
[ 1192.604568] [7] = ffffc900580b6540
[ 1192.625026] [8] = ffffc900580b0380
[ 1192.645097] [9] = ffffc900580aa1c0
[ 1192.664453] CPU: 3 PID: 2728 Comm: z_fr_iss_4/3 Tainted: G O 3.14.22-grsec-00006-g3c1449f #12
[ 1192.704817] Hardware name: Supermicro X9DR7/E-(J)LN4F/X9DR7/E-(J)LN4F, BIOS 3.0 07/31/2013
[ 1192.730445] ffff880f30eccda0 ffffc90057f43d08 ffffffffa4aa8533 000000000000000a
[ 1192.754065] ffffc90057f43d38 ffffffffc03b4ded 0000000000000400 ffff880f30eccd30
[ 1192.777266] 8000000000000000 ffff880fe2cdc170 ffffc90057f43d78 ffffffffc03b5d83
[ 1192.799630] Call Trace:
[ 1192.816880] [<ffffffffa4aa8533>] dump_stack+0x4d/0x6c
[ 1192.837149] [<ffffffffc03b4ded>] zio_ddt_free+0xfd/0x120 [zfs]
[ 1192.857746] [<ffffffffc03b5d83>] zio_execute+0xc3/0x170 [zfs]
[ 1192.877386] [<ffffffffc0224fc9>] taskq_thread+0x229/0x4a0 [spl]
[ 1192.897233] [<ffffffffa412dc70>] ? wake_up_state+0x20/0x20
[ 1192.917302] [<ffffffffc0224da0>] ? taskq_cancel_id+0x1e0/0x1e0 [spl]
[ 1192.937420] [<ffffffffa411cb2d>] kthread+0xfd/0x120
[ 1192.955425] [<ffffffffa411ca30>] ? kthread_create_on_node+0x250/0x250
[ 1192.974638] [<ffffffffa4abc464>] ret_from_fork+0x74/0xa0
[ 1192.992181] [<ffffffffa411ca30>] ? kthread_create_on_node+0x250/0x250
code used to get this:
@@ -2434,6 +2434,21 @@ zio_ddt_free(zio_t *zio)
ASSERT(BP_GET_DEDUP(bp));
ASSERT(zio->io_child_type == ZIO_CHILD_LOGICAL);
+ if (unlikely((unsigned long)ddt < 4096)) {
+ printk(KERN_CRIT "PAN PAN PAN: zfs hit ddt==%p in zio_ddt_free(%p), spa=%p, bp=%p, checksum=%x\n",
+ ddt, zio, spa, bp, (int)(BP_GET_CHECKSUM(bp)));
+ if (unlikely((unsigned long)spa < 4096)) {
+ printk(KERN_CRIT "can't dump spa->spa_ddt, pointer invalid!\n");
+ } else {
+ int c;
+ printk(KERN_CRIT "dumping spa->spa_ddt contents:\n");
+ for (c = 0; c < ZIO_CHECKSUM_FUNCTIONS; c++)
+ printk(KERN_CRIT " [%d] = %p\n", c, spa->spa_ddt[c]);
+ }
+ dump_stack();
+ return (ZIO_PIPELINE_CONTINUE);
+ }
+
ddt_enter(ddt);
@behlendorf I think that's it on my end, if I understand correctly this is zfs tripping up because the checksum type in some bp (whatever that is) is stored wrong on disk?
I guess there need to be safety checks against invalid input added in zfs so it doesn't trip up on this, and on the other side I'm wondering if I need to somehow fix the pool? Though I'm not entirely sure how I ended up with bad data in the first place... (Memory is ECC, disks are encrypted so a random mis-read would result in a value off way farther than "62")
@eqvinox it definitely looks like the block pointer (bp) has been damaged. It would be helpful if we could inspect the entire block pointer. Can you update you're debugging patch to call the snprintf_blkptr
function and then print the resulting buffer. This should give us an idea of what kind of block pointer we're dealing with and how exactly it's damaged. And hopeful it will show why we ended up in the dedup code.
In the meanwhile, it's should be safe to handle the error as you've already done.
@behlendorf sorry for the delay, was otherwise busy...
I had already added a call to snprintf_blkptr
, but then I looked at what it actually does, and it does:
checksum =
zio_checksum_table[BP_GET_CHECKSUM(bp)].ci_name;
which would again due to missing error checking cause another kernel oops.
So, no, I'm afraid I won't do that, and I think this is about as far as I can go on debugging this issue... sorry.
@eqvinox No problem, thanks for the additional debugging.
Closing as stale.
spl de2a22fcb3e7e8d7c4f695aeb0caf9cdc8f82c9a zfs e82cdc3acc4afe52c3ddc3fe9e4f456fdef80f41 Linux 3.14.22 (+ grsecurity)
System has 64GB of ECC memory. pool25 is being decomissioned, so it's unlikely to have been the origin of the issue. backing devices ("Cvdev-*") for ZFS are dm-crypt(LUKS) -> lvmcache (SSD) -> LVM -> SAS disks.
problem occured when trying to use mutt to send an e-mail, while a lot of background copying was happening (from external onto a zfs in saspool with sync=disabled); also one zvol is exported through a qla2462 FC card in target mode. several processes left hanging in kernel, among them the imap-server accessing the mailspool. accessing one particular subdirectory in the maildir hangs the userspace process (for example,
ls /saspool/d/m/m/e
ongetdents()
syscall).