kernal oops at zrl_add+0xc/0xa0

pipitone commented 10 years ago

System was running fine up until the oops. Users tell me there were a few copy and tar operations going on at the time. NFS suddenly stopped responding. On the server zfs list would hang, but zfs list tank would return.

Jul 14 13:25:23 kimsrv kernel: [84083.028139] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
Jul 14 13:25:23 kimsrv kernel: [84083.028144] IP: [<ffffffffa01f7a9c>] zrl_add+0xc/0xa0 [zfs]
Jul 14 13:25:23 kimsrv kernel: [84083.028189] PGD a1cb83067 PUD 9b9c5b067 PMD 0 
Jul 14 13:25:23 kimsrv kernel: [84083.028192] Oops: 0000 [#1] SMP 
Jul 14 13:25:23 kimsrv kernel: [84083.028195] Modules linked in: xfs libcrc32c nfsv3 pci_stub vboxpci(OF) vboxnetadp(OF) vboxnetflt(OF) vboxdrv(OF) autofs4 nfsd nfs_acl auth_rpcgss nfs fscache lockd sunrpc bonding psmouse dcdbas gpio_ich l
pc_ich i7core_edac edac_core serio_raw joydev wmi mac_hid acpi_power_meter ipmi_si lp parport zfs(POF) zcommon(POF) znvpair(POF) zavl(POF) zunicode(POF) spl(OF) ses enclosure hid_generic usbhid hid usb_storage megaraid_sas bnx2
Jul 14 13:25:23 kimsrv kernel: [84083.028222] CPU: 13 PID: 19607 Comm: perl Tainted: PF         IO 3.13.0-30-generic #55~precise1-Ubuntu
Jul 14 13:25:23 kimsrv kernel: [84083.028224] Hardware name: Dell Inc. PowerEdge R710/00NH4P, BIOS 6.4.0 07/23/2013
Jul 14 13:25:23 kimsrv kernel: [84083.028226] task: ffff8808f2db17f0 ti: ffff8807cda5c000 task.ti: ffff8807cda5c000
Jul 14 13:25:23 kimsrv kernel: [84083.028228] RIP: 0010:[<ffffffffa01f7a9c>]  [<ffffffffa01f7a9c>] zrl_add+0xc/0xa0 [zfs]
Jul 14 13:25:23 kimsrv kernel: [84083.028259] RSP: 0018:ffff8807cda5da68  EFLAGS: 00010213
Jul 14 13:25:23 kimsrv kernel: [84083.028260] RAX: 0000000000000005 RBX: ffff8807cf13db28 RCX: 000000000000002c
Jul 14 13:25:23 kimsrv kernel: [84083.028261] RDX: 0000000000000003 RSI: ffff8807cda5dcf8 RDI: 0000000000000000
Jul 14 13:25:23 kimsrv kernel: [84083.028263] RBP: ffff8807cda5da78 R08: 00000000000000c3 R09: 0000000000000000
Jul 14 13:25:23 kimsrv kernel: [84083.028264] R10: 0000000001980000 R11: 0000000000000013 R12: 0000000000000001
Jul 14 13:25:23 kimsrv kernel: [84083.028265] R13: 0000000000020000 R14: 0000000000004000 R15: 00000000000000a8
Jul 14 13:25:23 kimsrv kernel: [84083.028267] FS:  00007f05d38fd700(0000) GS:ffff88091bcc0000(0000) knlGS:0000000000000000
Jul 14 13:25:23 kimsrv kernel: [84083.028268] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 14 13:25:23 kimsrv kernel: [84083.028270] CR2: 0000000000000028 CR3: 000000113851b000 CR4: 00000000000007e0
Jul 14 13:25:23 kimsrv kernel: [84083.028271] Stack:
Jul 14 13:25:23 kimsrv kernel: [84083.028272]  ffff8807cf13db28 0000000000000001 ffff8807cda5da98 ffffffffa0154854
Jul 14 13:25:23 kimsrv kernel: [84083.028276]  ffff8807cf13db28 ffff88004c0f8fc0 ffff8807cda5dd68 ffffffffa0158f66
Jul 14 13:25:23 kimsrv kernel: [84083.028279]  0000000000000001 000000000000028a 0000000000a28000 0000000000004000
Jul 14 13:25:23 kimsrv kernel: [84083.028282] Call Trace:
Jul 14 13:25:23 kimsrv kernel: [84083.028300]  [<ffffffffa0154854>] dbuf_is_metadata+0x34/0x80 [zfs]
Jul 14 13:25:23 kimsrv kernel: [84083.028314]  [<ffffffffa0158f66>] __dbuf_stats_hash_table_data+0x1c6/0x410 [zfs]
Jul 14 13:25:23 kimsrv kernel: [84083.028330]  [<ffffffffa0159269>] dbuf_stats_hash_table_data+0xb9/0x130 [zfs]
Jul 14 13:25:23 kimsrv kernel: [84083.028342]  [<ffffffffa00b6299>] kstat_seq_show+0xc9/0x460 [spl]
Jul 14 13:25:23 kimsrv kernel: [84083.028349]  [<ffffffffa00b5e25>] ? kstat_seq_data_addr+0x185/0x1a0 [spl]
Jul 14 13:25:23 kimsrv kernel: [84083.028355]  [<ffffffffa00ab6cb>] ? vmem_free_debug+0x4b/0x150 [spl]
Jul 14 13:25:23 kimsrv kernel: [84083.028362]  [<ffffffffa00b5eb8>] ? kstat_seq_next+0x78/0x150 [spl]
Jul 14 13:25:23 kimsrv kernel: [84083.028367]  [<ffffffff811ed526>] seq_read+0x256/0x3e0
Jul 14 13:25:23 kimsrv kernel: [84083.028371]  [<ffffffff8122fdc3>] proc_reg_read+0x43/0x70
Jul 14 13:25:23 kimsrv kernel: [84083.028375]  [<ffffffff811ca02b>] vfs_read+0xab/0x180
Jul 14 13:25:23 kimsrv kernel: [84083.028378]  [<ffffffff811ca2b2>] SyS_read+0x52/0xa0
Jul 14 13:25:23 kimsrv kernel: [84083.028383]  [<ffffffff8176693f>] tracesys+0xe1/0xe6
Jul 14 13:25:23 kimsrv kernel: [84083.028384] Code: 66 66 66 90 55 b8 ff ff ff ff 48 89 e5 f0 0f c1 47 28 5d c3 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 89 e5 41 54 53 <8b> 77 28 48 89 fb 83 fe ff 74 39 8d 7e 01 48 8d 4b 28 89
 f0 f0 
Jul 14 13:25:23 kimsrv kernel: [84083.028409] RIP  [<ffffffffa01f7a9c>] zrl_add+0xc/0xa0 [zfs]
Jul 14 13:25:23 kimsrv kernel: [84083.028438]  RSP <ffff8807cda5da68>
Jul 14 13:25:23 kimsrv kernel: [84083.028439] CR2: 0000000000000028
Jul 14 13:25:23 kimsrv kernel: [84083.028441] ---[ end trace 9dc31912869fdd8f ]---

Running Ubuntu 12.04.4 LTS with EEC ram.

$ uname -a
Linux kandel 3.13.0-30-generic #55~precise1-Ubuntu SMP Fri Jul 4 21:52:00 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

$ dmesg | grep '0\.6\.3'
[   22.347980] SPL: Loaded module v0.6.3-1~precise
[   22.604470] ZFS: Loaded module v0.6.3-2~precise, ZFS pool version 5000, ZFS filesystem version 5

$ cat /etc/modprobe.d/zfs.conf 
options zfs zfs_arc_min=16106127360
options zfs zfs_arc_max=64424509440

I'm running with two pools:

$ sudo zpool status -v
  pool: local
 state: ONLINE
  scan: none requested
config:

    NAME        STATE     READ WRITE CKSUM
    local       ONLINE       0     0     0
      sdd       ONLINE       0     0     0
    logs
      mirror-1  ONLINE       0     0     0
        sdr     ONLINE       0     0     0
        sds     ONLINE       0     0     0
    cache
      sde1      ONLINE       0     0     0

errors: No known data errors

  pool: tank
 state: ONLINE
  scan: none requested
config:

    NAME        STATE     READ WRITE CKSUM
    tank        ONLINE       0     0     0
      raidz2-0  ONLINE       0     0     0
        sdf     ONLINE       0     0     0
        sdg     ONLINE       0     0     0
        sdh     ONLINE       0     0     0
        sdi     ONLINE       0     0     0
        sdj     ONLINE       0     0     0
        sdk     ONLINE       0     0     0
      raidz2-1  ONLINE       0     0     0
        sdl     ONLINE       0     0     0
        sdm     ONLINE       0     0     0
        sdn     ONLINE       0     0     0
        sdo     ONLINE       0     0     0
        sdp     ONLINE       0     0     0
        sdq     ONLINE       0     0     0
    cache
      sdb       ONLINE       0     0     0

errors: No known data errors

dweeezil commented 10 years ago

The stack trace appears to be bogus following the kstat_seq_show(). The perl program must have been reading stuff from /proc/spl/zfs, do you know which items it might have been reading?

pipitone commented 10 years ago

Ah. I was running ./arcstat.pl -f hit%,miss%,read,l2hit%,l2miss%,l2read,ph%,pm%,mm% 1 at the time, so that could be it.

dweeezil commented 10 years ago

I gather it was only reading /proc/spl/kstat/zfs/arcstats from the looks of it. Has this ever happened before? If it happens again, try to get a stack trace of all processes on the system. I just tried a trivial reproducer and couldn't get it to cause any oopses.

chrisrd commented 10 years ago

@dweeezil "The stack trace appears to be bogus following the kstat_seq_show()."

Because you can't see how to get from kstat_seq_show() to dbuf_stats_hash_table_data()?

I think the missing piece of the puzzle is:

dbuf_stats_hash_table_init()
{
  ...
  kstat_set_raw_ops(ksp, dbuf_stats_hash_table_headers,
    dbuf_stats_hash_table_data, dbuf_stats_hash_table_addr);
  ...
}

I.e. ksp->ks_raw_ops.data() == dbuf_stats_hash_table_data(), giving us:

kstat_seq_show()
{
  ...
  rc = ksp->ks_raw_ops.data(
    ksp->ks_raw_buf, ksp->ks_raw_bufsize, p);
  ...
}

=> 

kstat_seq_show()
{
  ...
  rc = dbuf_stats_hash_table_data(
    ksp->ks_raw_buf, ksp->ks_raw_bufsize, p);
  ...
}

...which explains the call trace.

dweeezil commented 10 years ago

@chrisrd Yes, I guess I didn't look very closely. Are you running with spl_kmem_cache_slab_limit=16384 (or set to anything > 0)?

pipitone commented 10 years ago

@chrisrd The only module settings I've made are to the zfs_arc_min/max (listed in above).

dweeezil commented 10 years ago

@pipitone OK, some people would put the spl settings in spl.conf. It looks like the dnode's db_dnode_handle is NULL. I gather the script was reading /proc/spl/kstat/zfs/dbufs. Were you also running dbufstat.py?

In any case, I've got to wonder if simply skipping those dbufs would be a sufficient fix. Presumably it had been dbuf_clear()'d.

pipitone commented 10 years ago

@dweezil I wasn't running dbufstat.py, no. Just arcstat.pl.

So, does any of this explain why ZFS/NFS stopped responding?

behlendorf commented 10 years ago

@dweeezil I agree, for this to have happened the db->db_dnode_handle must be NULL. Also something on the system must have been reading /proc/spl/kstat/zfs/dbufs. It also looks like db->db_dnode_handle can be set to NULL outside the protection of the db->db_mtx. See dbuf_do_evict->dbuf_destroy. I think we should skip any dbuf in the DB_EVICTING state, this flag is set under the lock.

diff --git a/module/zfs/dbuf_stats.c b/module/zfs/dbuf_stats.c
index 0cad9ef..b95f254 100644
--- a/module/zfs/dbuf_stats.c
+++ b/module/zfs/dbuf_stats.c
@@ -147,6 +147,12 @@ dbuf_stats_hash_table_data(char *buf, size_t size, void *data)
                }

                mutex_enter(&db->db_mtx);
+
+               if (db->db_state == DB_EVICTING) {
+                       mutex_exit(&db->db_mtx);
+                       continue;
+               }
+
                mutex_exit(DBUF_HASH_MUTEX(h, dsh->idx));

                length = __dbuf_stats_hash_table_data(buf, size, db);

@pipitone Yes, basically you hit a kernel panic. For now just make sure nothing on your system is accessing /proc/spl/kstat/zfs/dbufs (including monitoring software) and it shouldn't happen again. In the meanwhile we'll fix this race/bug for the next release.

openzfs / zfs

kernal oops at zrl_add+0xc/0xa0 #2495