Linux 4.2.5, ZFS 0.6.5.3, NULL pointer dereference in dsl_deadlist_insert

OrdinaryMagician commented 8 years ago

Today I was just installing system updates as usual when all of a sudden my entire system froze. Naturally I rebooted and when I came back, I was greeted by a kernel oops (which I couldn't do anything about because it interrupted the boot process entirely). I decided to boot into systemrescuecd and check it out, and noticed the oops would show up again if I tried to mount my root filesystem. I lost two weeks of work rolling back to a snapshot and tried again. Same thing happened.

I've managed to isolate the oops message, though, and saved it to a file on an usb stick.

[  689.566502] BUG: unable to handle kernel NULL pointer dereference at 0000000000000060
[  689.566510] IP: [<ffffffffa02391c4>] dsl_deadlist_insert+0x164/0x360 [zfs]
[  689.566537] PGD c0aeb067 PUD c0afd067 PMD 0 
[  689.566542] Oops: 0000 [#1] PREEMPT SMP 
[  689.566546] Modules linked in: nvidia_modeset(PO) iptable_filter ip_tables x_tables sha256_ssse3 sha256_generic hmac drbg ansi_cprng ctr ccm arc4 nvidia(PO) rt2800usb rt2x00usb rt2800lib rt2x00lib mac80211 cfg80211 crc_ccitt rfkill iTCO_wdt coretemp snd_hda_codec_hdmi iTCO_vendor_support kvm_intel kvm ppdev xpad loop ff_memless drm snd_hda_codec_realtek psmouse input_leds snd_hda_codec_generic fuse serio_raw pcspkr atl1c joydev led_class snd_hda_intel evdev lpc_ich snd_hda_codec mousedev snd_hda_core mac_hid snd_aloop snd_hwdep intel_agp parport_pc asus_atk0110 rng_core intel_gtt parport shpchp snd_pcm acpi_cpufreq snd_timer snd soundcore processor button uas usb_storage hid_generic usbhid hid zfs(PO) zunicode(PO) zcommon(PO) znvpair(PO) spl(O) zavl(PO) sr_mod cdrom sd_mod ata_generic pata_acpi atkbd
[  689.566603]  libps2 ata_piix libata scsi_mod ehci_pci xhci_pci uhci_hcd xhci_hcd ehci_hcd usbcore usb_common i8042 serio
[  689.566603] CPU: 2 PID: 1187 Comm: txg_sync Tainted: P           O    4.2.5-1-ARCH #1
[  689.566603] Hardware name: System manufacturer System Product Name/P5G41T-M LX, BIOS 1101    06/08/2012
[  689.566603] task: ffff8800da164c80 ti: ffff880213750000 task.ti: ffff880213750000
[  689.566603] RIP: 0010:[<ffffffffa02391c4>]  [<ffffffffa02391c4>] dsl_deadlist_insert+0x164/0x360 [zfs]
[  689.566603] RSP: 0018:ffff880213753738  EFLAGS: 00010246
[  689.566603] RAX: 0000000000000000 RBX: ffff880214f060f0 RCX: 0000000000000001
[  689.566603] RDX: 0000000000000000 RSI: ffff88021392ae40 RDI: ffff880213959000
[  689.566603] RBP: ffff880213753818 R08: 0000000000000000 R09: 0000000000000000
[  689.566603] R10: ffffea0002841e00 R11: ffff8800a10769a0 R12: ffff8800a1bac040
[  689.566603] R13: ffff8800c9b79a80 R14: 0000000000000000 R15: ffff880214f06168
[  689.566603] FS:  0000000000000000(0000) GS:ffff88021fd00000(0000) knlGS:0000000000000000
[  689.566603] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  689.566603] CR2: 0000000000000060 CR3: 00000000d815e000 CR4: 00000000000406e0
[  689.566603] Stack:
[  689.566603]  ffff880213753768 00000001801b0019 ffff8800da164c80 ffff8800da164c80
[  689.566603]  ffff8802137537e8 ffffffffa0213787 ffff8800bbb9eb90 0000000000000000
[  689.566603]  00000000137537a8 ffff8802138a5600 ffff8802137537a8 ffff8800db499c00
[  689.566603] Call Trace:
[  689.566603]  [<ffffffffa0213787>] ? dbuf_dirty+0x547/0x9b0 [zfs]
[  689.566603]  [<ffffffffa023058d>] dsl_dataset_block_kill+0xed/0x490 [zfs]
[  689.566603]  [<ffffffffa022eacf>] free_blocks+0x10f/0x1d0 [zfs]
[  689.566603]  [<ffffffffa01364c2>] ? spl_kmem_cache_free+0x142/0x1c0 [spl]
[  689.566603]  [<ffffffffa022f10e>] dnode_sync_free_range+0x25e/0x2c0 [zfs]
[  689.566603]  [<ffffffffa022eeb0>] ? free_children+0x320/0x320 [zfs]
[  689.566603]  [<ffffffffa025263c>] range_tree_vacate+0x5c/0xe0 [zfs]
[  689.566603]  [<ffffffffa022f829>] dnode_sync+0x289/0x930 [zfs]
[  689.566603]  [<ffffffffa0214ade>] ? dbuf_sync_list+0xce/0x100 [zfs]
[  689.566603]  [<ffffffffa022f8a6>] ? dnode_sync+0x306/0x930 [zfs]
[  689.566603]  [<ffffffffa021f27e>] dmu_objset_sync_dnodes+0xce/0xf0 [zfs]
[  689.566603]  [<ffffffffa021f45b>] dmu_objset_sync+0x1bb/0x320 [zfs]
[  689.566603]  [<ffffffffa021d8c0>] ? recordsize_changed_cb+0x20/0x20 [zfs]
[  689.566603]  [<ffffffffa021f5c0>] ? dmu_objset_sync+0x320/0x320 [zfs]
[  689.566603]  [<ffffffffa02381a2>] dsl_dataset_sync+0x52/0xa0 [zfs]
[  689.566603]  [<ffffffffa024144f>] dsl_pool_sync+0x9f/0x430 [zfs]
[  689.566603]  [<ffffffffa025cfa9>] spa_sync+0x369/0xb30 [zfs]
[  689.566603]  [<ffffffff810b4c96>] ? autoremove_wake_function+0x16/0x40
[  689.566603]  [<ffffffffa026e5ae>] txg_sync_thread+0x3be/0x630 [zfs]
[  689.566603]  [<ffffffff810a9071>] ? put_prev_entity+0x31/0x420
[  689.566603]  [<ffffffffa026e1f0>] ? txg_delay+0x180/0x180 [zfs]
[  689.566603]  [<ffffffffa0136ed1>] thread_generic_wrapper+0x71/0x80 [spl]
[  689.566603]  [<ffffffffa0136e60>] ? __thread_exit+0x20/0x20 [spl]
[  689.566603]  [<ffffffff81092578>] kthread+0xd8/0xf0
[  689.566603]  [<ffffffff810924a0>] ? kthread_worker_fn+0x170/0x170
[  689.566603]  [<ffffffff8157665f>] ret_from_fork+0x3f/0x70
[  689.566603]  [<ffffffff810924a0>] ? kthread_worker_fn+0x170/0x170
[  689.566603] Code: 4c 89 f7 48 89 85 58 ff ff ff e8 48 cf e4 ff 31 d2 48 85 c0 0f 84 9d 01 00 00 4c 89 f7 48 89 c6 e8 42 ce e4 ff 49 89 c6 48 8b 3b <4d> 8b 7e 60 e8 d3 77 fe ff 4c 3b b8 40 01 00 00 0f 84 c8 00 00 
[  689.566603] RIP  [<ffffffffa02391c4>] dsl_deadlist_insert+0x164/0x360 [zfs]
[  689.566603]  RSP <ffff880213753738>
[  689.566603] CR2: 0000000000000060
[  689.566603] ---[ end trace dfb2e660cce2f45b ]---

It will happen every single time. During the updates it suddenly gets stuck after a few packages, and most running programs just freeze. I've scrubbed the pool just in case, but there are no errors whatsoever.

I will help in any way I can to have this fixed. My only computer is now unusable due to this issue.

behlendorf commented 8 years ago

@OrdinaryMagician I'm sorry to hear you've run in to problems. The good news is that based on the backtrace and the fact that you were able to import the pool it appears to be intact. But before starting to debug this further I suggest you make a backup of your data by booting in to a rescue image and importing the pool read-only. You should be able to copy off all of your data somewhere safe.

zpool import  -o readonly=on <pool>

Once that's done you can either just recreate the pool and copy the data back, or we can debug this further by building a debug version of ZFS and enabling some additional logging.

To build a debug version you'll want to start from source as follows (roughly).

$ git clone https://github.com/zfsonlinux/spl.git
$ git clone https://github.com/zfsonlinux/zfs.git

$ cd spl
$ ./autogen.sh
$ ./configure --enable-debug
$ make -s -j$(nproc)

$ cd ../zfs
$ ./autogen.sh
$ ./configure --enable-debug --with-spl=<path to spl> --with-spl-obj=<path to spl>
$ make -s -j$(nproc)

At this point you'll be able to load the debug modules by running scripts/zfs.sh in the ZFS source tree. You'll also be able to to execute the zfs, zpool, zdb commands directly from the ZFS source tree. So go ahead and load the debug modules and then import the pool. This time you may see an VERIFY error on the console before it panics which should give us some additional information and we can go from there enabling adding logging as needed.

tuxoko commented 8 years ago

The crash point is dsl_deadlist_insert -> dle_enqueue but dle is NULL. Though, I don't know how to interpret it. Maybe bp->blk_birth is corrupted?

OrdinaryMagician commented 8 years ago

It's ok, I already send all my snapshots off to an external disk, so everything should be fine for now. I'll begin building the debug modules, hopefully systemrescuecd has all the tools I need.

Hm... unfortunately no, it does not. It's going to take me a while to build some sort of live cd with all I need, but I'll still do it.

ryao commented 8 years ago

@OrdinaryMagician You can grab a modified System Rescue CD from the "System Rescue CD w/ ZFS Mirrors" mirror links from this page:

https://wiki.gentoo.org/wiki/User:Fearedbliss

OrdinaryMagician commented 8 years ago

I already have one of those, unfortunately, it does not come with anything I can use to build debug modules.

I tried just building the whole thing on the desktop then regenerating the initial ramdisk, and checking what happens at boot. There are no messages on import before it hangs.

Before actually throwing the towel and recreating the pool, I'm having someone build a custom systemrescuecd+zfs iso, since they use gentoo and it's easier to make there.

ryao commented 8 years ago

@OrdinaryMagician Out of curiosity, does this system use ECC? If it does, we can rule out a really unlucky bit flip.

If you have backups, you can rebuild the pool from the backups, although it might be possible to do some analysis with zdb. I suggest talking to people in #zfsonlinux on the freenode IRC network. I hang out there, although I probably will not be free to assist until tomorrow.

behlendorf commented 8 years ago

The crash point is dsl_deadlist_insert -> dle_enqueue but dle is NULL. Though, I don't know how to interpret it. Maybe bp->blk_birth is corrupted?

Yes, very possibly the bp->blk_birth is corrupted in a very specific way. If a bit were cleared in bp->blk_birth such that the txg birth was now lower than any other txg birth in the deadlist then there would be no entry before this in the tree. This is normally impossible because of how the deadlists are managed and so there's no error handling code here. But in the case of a memory or some other very specific damage this could happen. And since it persists across reboots that strongly suggests the block point is damaged. Plus zfs_blkptr_verify() wouldn't catch this since there's no reliable way to check the blk_birth.

@OrdinaryMagician before you throw in the towel could you apply the patch I've just posted in #4089 to your debug build and then reproduce the issue. It will log the damaged txg_birth number to the console so we can see what it is. Also please run the command zdb -u <pool-name> which will report the current txg number for comparison.

You should then be able to prevent the issue by setting the zfs_recover=1 module option and importing your pool again. This will enable the new error handling code which will allow the damaged txg_birth value to be used. This may result in a tiny bit of leaked space in your pool but it should function fine assuming only this one block pointer was damaged. You can check console for messages like this to verify only one block point was effected.

blkptr at X has invalid BLK_BIRTH Y

@ryao @tuxoko can you please review this proposed change. As I mentioned above by default there's no functional change aside from detecting and logging the error unless zfs_recover=1 is set.

OrdinaryMagician commented 8 years ago

@ryao my system's pretty old, so I don't think there's ECC. It's actually not the first time I get some sudden bit flip that screws everything up. I once had a bootloader stop working overnight, just from one byte suddenly having changed.

OrdinaryMagician commented 8 years ago

Ah, there we go.

PANIC: blkptr at ffff8801fc994040 has invalid BLK_BIRTH 0

With zfs_recover=1 I've re-imported and I'm going to go through all the steps that triggered this whole mess all over again, wish me luck.

Everything seems OK so far, no failures. It seems it was only just that one block pointer, after all.

behlendorf commented 8 years ago

@OrdinaryMagician glad to hear that it was just that block pointer. Now that it's been successfully processed you should be able to return to the packages from the distribution whenever you like. You won't need to set zfs_recovery again.

OrdinaryMagician commented 8 years ago

Thank you. Next computer I build, it'll surely have ECC memory.

mstromb commented 8 years ago

I have also encountered an oops in the same place. I don't have ECC memory either so I could also be the victim of an unlucky bit flip as well, but I thought I'd mention it so you have another data point. My pool seems to continue to generally work, though a scrub will hang forever.

[20034.654416] Oops: 0000 [#1] SMP
[20034.654457] Modules linked in: pci_stub vboxpci(OX) vboxnetadp(OX) vboxnetflt(OX) vboxdrv(OX) ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp bridge stp llc ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables x_tables binfmt_misc dm_crypt x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm eeepc_wmi asus_wmi sparse_keymap crct10dif_pclmul crc32_pclmul aesni_intel dm_multipath scsi_dh aes_x86_64 snd_hda_codec_realtek lrw snd_hda_codec_hdmi gf128mul glue_helper ablk_helper cryptd snd_hda_intel snd_hda_codec snd_hwdep snd_pcm serio_raw snd_page_alloc snd_timer snd soundcore shpchp mei_me mei lpc_ich mac_hid parport_pc ppdev lp parport nls_iso8859_1 zfs(POX) zunicode(POX) zcommon(POX) znvpair(POX) spl(OX) zavl(POX) i915 i2c_algo_bit e1000e drm_kms_helper ahci ptp psmouse drm libahci pps_core video wmi
[20034.655567] CPU: 2 PID: 1486 Comm: txg_sync Tainted: P           OX 3.13.0-74-generic #118-Ubuntu
[20034.655659] Hardware name: ASUS All Series/H87I-PLUS, BIOS 1005 01/06/2014
[20034.655732] task: ffff880404b74800 ti: ffff88039c588000 task.ti: ffff88039c588000
[20034.655809] RIP: 0010:[<ffffffffa02ae191>]  [<ffffffffa02ae191>] dsl_deadlist_insert+0x161/0x370 [zfs]
[20034.655957] RSP: 0018:ffff88039c589890  EFLAGS: 00010246
[20034.656013] RAX: 0000000000000000 RBX: ffff8803ffe1b8f0 RCX: 0000000000000001
[20034.656086] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8804044bd000
[20034.656160] RBP: ffff88039c589950 R08: ffff88041fb15fa0 R09: ffff88040f401700
[20034.656233] R10: ffff88040f401700 R11: ffffffffa01d9b10 R12: ffff8800cb763240
[20034.656305] R13: ffff8803a022ad80 R14: 0000000000000000 R15: ffff8803ffe1b968
[20034.656379] FS:  0000000000000000(0000) GS:ffff88041fb00000(0000) knlGS:0000000000000000
[20034.656462] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[20034.656521] CR2: 0000000000000060 CR3: 0000000002c0e000 CR4: 00000000001407e0
[20034.656595] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[20034.656668] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[20034.656740] Stack:
[20034.656763]  ffff880035117a00 ffff880404938c00 ffffffff8172b392 ffff880405e78da0
[20034.656852]  ffff88039c589918 0000000000000000 0000000305e78da0 ffff880405e78ec8
[20034.656940]  ffffffffffff1c00 ffff8803a022ad80 ffff880405e78ea0 ffff880406b45af0
[20034.657028] Call Trace:
[20034.657062]  [<ffffffff8172b392>] ? mutex_lock+0x12/0x2f
[20034.657157]  [<ffffffffa0289316>] ? dmu_buf_will_dirty+0x56/0xa0 [zfs]
[20034.657274]  [<ffffffffa02a5c8b>] dsl_dataset_block_kill+0x24b/0x4c0 [zfs]
[20034.657392]  [<ffffffffa02a3e4f>] free_blocks+0x14f/0x270 [zfs]
[20034.657497]  [<ffffffffa02a468a>] dnode_sync_free_range+0x25a/0x2b0 [zfs]
[20034.657603]  [<ffffffffa0288cd3>] ? dbuf_dirty+0x6e3/0x9c0 [zfs]
[20034.657746]  [<ffffffffa0313cef>] ? zfs_space_delta_cb+0xaf/0x190 [zfs]
[20034.657859]  [<ffffffffa02a4430>] ? free_children+0x330/0x330 [zfs]
[20034.657983]  [<ffffffffa02c7746>] range_tree_vacate+0x56/0x110 [zfs]
[20034.658094]  [<ffffffffa02a4bf4>] dnode_sync+0x274/0x9a0 [zfs]
[20034.658191]  [<ffffffffa0289ded>] ? dbuf_sync_list+0xcd/0x100 [zfs]
[20034.658299]  [<ffffffffa02a4c71>] ? dnode_sync+0x2f1/0x9a0 [zfs]
[20034.658403]  [<ffffffffa0294452>] dmu_objset_sync_dnodes+0xd2/0xf0 [zfs]
[20034.658513]  [<ffffffffa029461c>] dmu_objset_sync+0x1ac/0x2f0 [zfs]
[20034.658617]  [<ffffffffa0292aa0>] ? recordsize_changed_cb+0x20/0x20 [zfs]
[20034.658726]  [<ffffffffa0294760>] ? dmu_objset_sync+0x2f0/0x2f0 [zfs]
[20034.658839]  [<ffffffffa02ad41e>] dsl_dataset_sync+0x4e/0x90 [zfs]
[20034.658954]  [<ffffffffa02b67a8>] dsl_pool_sync+0x98/0x470 [zfs]
[20034.659077]  [<ffffffffa02d12b5>] spa_sync+0x365/0xb20 [zfs]
[20034.659143]  [<ffffffff810ab472>] ? autoremove_wake_function+0x12/0x40
[20034.659216]  [<ffffffff8101b709>] ? read_tsc+0x9/0x20
[20034.659335]  [<ffffffffa02e3639>] txg_sync_thread+0x3b9/0x620 [zfs]
[20034.659464]  [<ffffffffa02e3280>] ? txg_quiesce_thread+0x3f0/0x3f0 [zfs]
[20034.659548]  [<ffffffffa01dbe81>] thread_generic_wrapper+0x71/0x80 [spl]
[20034.659631]  [<ffffffffa01dbe10>] ? __thread_exit+0x20/0x20 [spl]
[20034.659698]  [<ffffffff8108b8a2>] kthread+0xd2/0xf0
[20034.659754]  [<ffffffff8108b7d0>] ? kthread_create_on_node+0x1c0/0x1c0
[20034.659825]  [<ffffffff817356a8>] ret_from_fork+0x58/0x90
[20034.659884]  [<ffffffff8108b7d0>] ? kthread_create_on_node+0x1c0/0x1c0
[20034.659951] Code: 4c 89 f7 48 89 85 68 ff ff ff e8 bb 4f d7 ff 31 d2 48 85 c0 0f 84 b0 01 00 00 4c 89 f7 48 89 c6 e8 75 4e d7 ff 49 89 c6 48 8b 3b <4d> 8b 7e 60 e8 66 6f fe ff 4c 3b b8 40 01 00 00 0f 84 d9 00 00
[20034.660326] RIP  [<ffffffffa02ae191>] dsl_deadlist_insert+0x161/0x370 [zfs]
[20034.660449]  RSP <ffff88039c589890>
[20034.660487] CR2: 0000000000000060
[20034.792087] ---[ end trace 7292a3293341304d ]---

Eventually my system completely hung waiting on IO to the pool. @behlendorf 's patch fixed my problem (although the --with-spl options to zfs's configure caused make to fail),

openzfs / zfs

Linux 4.2.5, ZFS 0.6.5.3, NULL pointer dereference in dsl_deadlist_insert #4086