BUG: unable to handle kernel paging request at ffffc90029404000

openzfs / zfs

OpenZFS on Linux and FreeBSD

https://openzfs.github.io/openzfs-docs

Other

10.64k stars 1.75k forks source link

BUG: unable to handle kernel paging request at ffffc90029404000 #5742

Closed hunbalazs closed 7 years ago

hunbalazs commented 7 years ago

System information

Type	Version/Name
Distribution Name	Ubuntu
Distribution Version	16.04
Linux Kernel	4.4.0-31-generic
Architecture	x86_64
ZFS Version	0.7.0-rc3
SPL Version	0.7.0-rc3

Describe the problem you're observing

Saved a pool with the same version (fully upgraded pool) zfs send -R -D -e -v rpool@beforesend. When I want to recv it back I get the following BUG

Describe how to reproduce the problem

zfs upgrade with 0.7.0-rc3 zfs send -R -D -e -v <pool>@<snap> zfs revc -v -s -F <pool>

Include any warning/errors/backtraces from the system logs

Feb  5 10:20:43 ubuntu kernel: [ 2953.988382] ZFS: Loaded module v0.7.0-rc3_72_g298ec40, ZFS pool version 5000, ZFS filesystem version 5
Feb  5 10:20:43 ubuntu kernel: [ 2954.846751] ZPIOS: Loaded module v0.7.0-rc3_72_g298ec40
Feb  5 10:21:43 ubuntu kernel: [ 3014.796064] SPL: using hostid 0x007f0101
Feb  5 10:23:43 ubuntu kernel: [ 3134.134301] BUG: unable to handle kernel paging request at ffffc90029404000
Feb  5 10:23:43 ubuntu kernel: [ 3134.134330] IP: [<ffffffff813ff224>] __memmove+0x24/0x1a0
Feb  5 10:23:43 ubuntu kernel: [ 3134.134342] PGD 2250a1067 PUD 2250a2067 PMD 15c02e067 PTE 0
Feb  5 10:23:43 ubuntu kernel: [ 3134.134354] Oops: 0000 [#1] SMP 
Feb  5 10:23:43 ubuntu kernel: [ 3134.134361] Modules linked in: zpios(OE) zfs(POE) zcommon(POE) zavl(POE) icp(POE) znvpair(POE) zunicode(POE) splat(OE) spl(OE) binfmt_misc ctr ccm drbg ansi_cprng dm_crypt algif_skcipher af_alg rfcomm bnep snd_hda_codec_hdmi arc4 intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp iwldvm kvm_intel dell_wmi sparse_keymap mac80211 kvm dell_laptop dcdbas irqbypass dell_smm_hwmon crct10dif_pclmul crc32_pclmul aesni_intel btusb btrtl snd_hda_codec_idt btbcm btintel snd_hda_codec_generic bluetooth aes_x86_64 lrw iwlwifi gf128mul snd_hda_intel glue_helper ablk_helper snd_hda_codec joydev input_leds snd_seq_midi snd_seq_midi_event cfg80211 snd_hda_core cryptd serio_raw snd_hwdep snd_rawmidi snd_pcm lpc_ich snd_seq snd_seq_device snd_timer mei_me mei snd 8250_fintek soundcore shpchp mac_hid dell_smo8800 dell_rbtn parport_pc ppdev lp parport autofs4 overlay nls_utf8 isofs nls_iso8859_1 dm_mirror dm_region_hash dm_log uas usb_storage nouveau ahci mxm_wmi libahci i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect psmouse sysimgblt fb_sys_fops e1000e drm ptp wmi pps_core video fjes [last unloaded: spl]
Feb  5 10:23:43 ubuntu kernel: [ 3134.134611] CPU: 6 PID: 25398 Comm: receive_writer Tainted: P           OE   4.4.0-31-generic #50-Ubuntu
Feb  5 10:23:43 ubuntu kernel: [ 3134.134620] Hardware name: Dell Inc. Latitude E6530/0GMVN7, BIOS A17 08/19/2015
Feb  5 10:23:43 ubuntu kernel: [ 3134.134626] task: ffff8801fd97b700 ti: ffff88007f130000 task.ti: ffff88007f130000
Feb  5 10:23:43 ubuntu kernel: [ 3134.134635] RIP: 0010:[<ffffffff813ff224>]  [<ffffffff813ff224>] __memmove+0x24/0x1a0
Feb  5 10:23:43 ubuntu kernel: [ 3134.134644] RSP: 0018:ffff88007f133cc8  EFLAGS: 00010206
Feb  5 10:23:43 ubuntu kernel: [ 3134.134650] RAX: ffffc90014a3f000 RBX: ffff88005cdf5800 RCX: 0000000000015000
Feb  5 10:23:43 ubuntu kernel: [ 3134.134656] RDX: 0000000000020000 RSI: ffffc90029404000 RDI: ffffc90014a4a000
Feb  5 10:23:43 ubuntu kernel: [ 3134.134662] RBP: ffff88007f133d00 R08: 0000000000000001 R09: 00000000090f5000
Feb  5 10:23:43 ubuntu kernel: [ 3134.134668] R10: ffff88011e2dd080 R11: ffffffff81ccf1ea R12: ffffffffc0c89090
Feb  5 10:23:43 ubuntu kernel: [ 3134.134674] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Feb  5 10:23:43 ubuntu kernel: [ 3134.134680] FS:  0000000000000000(0000) GS:ffff88022dd80000(0000) knlGS:0000000000000000
Feb  5 10:23:43 ubuntu kernel: [ 3134.134687] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb  5 10:23:43 ubuntu kernel: [ 3134.134692] CR2: ffffc90029404000 CR3: 00000001e339f000 CR4: 00000000001406e0
Feb  5 10:23:43 ubuntu kernel: [ 3134.134699] Stack:
Feb  5 10:23:43 ubuntu kernel: [ 3134.134702]  ffffffffc0b65db5 0000000000000000 000000027f133ce8 ffff88005cdf5800
Feb  5 10:23:43 ubuntu kernel: [ 3134.134714]  ffffffffc0c89090 0000000000000000 0000000000000000 ffff88007f133d48
Feb  5 10:23:43 ubuntu kernel: [ 3134.134726]  ffffffffc0b66118 ffff88011d3ff1c0 00000000f912e000 ffff88007f133e30
Feb  5 10:23:43 ubuntu kernel: [ 3134.134738] Call Trace:
Feb  5 10:23:43 ubuntu kernel: [ 3134.134803]  [<ffffffffc0b65db5>] ? __dbuf_hold_impl+0x2d5/0x590 [zfs]
Feb  5 10:23:43 ubuntu kernel: [ 3134.134827]  [<ffffffffc0b66118>] dbuf_hold_impl+0xa8/0xd0 [zfs]
Feb  5 10:23:43 ubuntu kernel: [ 3134.134851]  [<ffffffffc0b67676>] dbuf_hold+0x36/0x60 [zfs]
Feb  5 10:23:43 ubuntu kernel: [ 3134.134876]  [<ffffffffc0b6eea5>] dmu_buf_hold_noread+0x85/0xe0 [zfs]
Feb  5 10:23:43 ubuntu kernel: [ 3134.134904]  [<ffffffffc0b6ef2f>] dmu_buf_hold+0x2f/0x80 [zfs]
Feb  5 10:23:43 ubuntu kernel: [ 3134.134930]  [<ffffffffc0b79d25>] receive_writer_thread+0x425/0x6f0 [zfs]
Feb  5 10:23:43 ubuntu kernel: [ 3134.134963]  [<ffffffffc0b79905>] ? receive_writer_thread+0x5/0x6f0 [zfs]
Feb  5 10:23:43 ubuntu kernel: [ 3134.135001]  [<ffffffffc0b79900>] ? receive_freeobjects.isra.12+0xe0/0xe0 [zfs]
Feb  5 10:23:43 ubuntu kernel: [ 3134.135031]  [<ffffffffc07d74f1>] thread_generic_wrapper+0x71/0x80 [spl]
Feb  5 10:23:43 ubuntu kernel: [ 3134.135046]  [<ffffffffc07d7480>] ? __thread_exit+0x20/0x20 [spl]
Feb  5 10:23:43 ubuntu kernel: [ 3134.135059]  [<ffffffff810a0808>] kthread+0xd8/0xf0
Feb  5 10:23:43 ubuntu kernel: [ 3134.135066]  [<ffffffff810a0730>] ? kthread_create_on_node+0x1e0/0x1e0
Feb  5 10:23:43 ubuntu kernel: [ 3134.135075]  [<ffffffff8182decf>] ret_from_fork+0x3f/0x70
Feb  5 10:23:43 ubuntu kernel: [ 3134.135081]  [<ffffffff810a0730>] ? kthread_create_on_node+0x1e0/0x1e0
Feb  5 10:23:43 ubuntu kernel: [ 3134.135087] Code: 88 0c 17 88 0f c3 90 48 89 f8 48 83 fa 20 0f 82 03 01 00 00 48 39 fe 7d 0f 49 89 f0 49 01 d0 49 39 f8 0f 8f 9f 00 00 00 48 89 d1 <f3> a4 c3 48 81 fa a8 02 00 00 72 05 40 38 fe 74 3b 48 83 ea 20 
Feb  5 10:23:43 ubuntu kernel: [ 3134.135204] RIP  [<ffffffff813ff224>] __memmove+0x24/0x1a0
Feb  5 10:23:43 ubuntu kernel: [ 3134.135212]  RSP <ffff88007f133cc8>
Feb  5 10:23:43 ubuntu kernel: [ 3134.135216] CR2: ffffc90029404000
Feb  5 10:23:43 ubuntu kernel: [ 3134.144797] ---[ end trace 58627849f4392b17 ]---

gmelikov commented 7 years ago

Receiving stream from zfs send -D had a bug, which was fixed in https://github.com/zfsonlinux/zfs/commit/2e0e443ac40c7e825a02519a497328226bd866ff . You have 2 options: -zfs send without -D -use master branch

hunbalazs commented 7 years ago

@gmelikov is it send or receive related? I already tried master on the recv side, same bug ocurred at the same time.

gmelikov commented 7 years ago

@hunbalazs IIRC it's receive related. If it's not your case - try to send without -D first.

hunbalazs commented 7 years ago

I double checked, v0.7.0-rc3_72_g298ec40 is actually current master. I was ~~brave~~ foolish enough to delete the original pool so I can't try to send w/o -D flag, sorry :disappointed: .

Feb  5 18:35:51 ubuntu kernel: [  637.617914] ZFS: Loaded module v0.7.0-rc3_72_g298ec40, ZFS pool version 5000, ZFS filesystem version 5
Feb  5 18:35:53 ubuntu kernel: [  639.710317] ZPIOS: Loaded module v0.7.0-rc3_72_g298ec40
Feb  5 18:37:40 ubuntu kernel: [  746.773649] SPL: using hostid 0x00000000
Feb  5 18:40:36 ubuntu kernel: [  922.922338] BUG: unable to handle kernel paging request at ffffc9002eea6000
Feb  5 18:40:36 ubuntu kernel: [  922.922352] IP: [<ffffffff813ff224>] __memmove+0x24/0x1a0
Feb  5 18:40:36 ubuntu kernel: [  922.922360] PGD 2250a1067 PUD 2250a2067 PMD 17977c067 PTE 0
Feb  5 18:40:36 ubuntu kernel: [  922.922366] Oops: 0000 [#1] SMP 
Feb  5 18:40:36 ubuntu kernel: [  922.922370] Modules linked in: dm_crypt algif_skcipher af_alg zpios(OE) zfs(POE) zavl(POE) zcommon(POE) znvpair(POE) zunicode(POE) icp(POE) splat(OE) spl(OE) binfmt_misc drbg ansi_cprng ctr ccm rfcomm bnep snd_hda_codec_hdmi intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel arc4 kvm iwldvm irqbypass mac80211 crct10dif_pclmul dell_laptop dell_wmi sparse_keymap dcdbas crc32_pclmul dell_smm_hwmon snd_hda_codec_idt snd_hda_codec_generic aesni_intel btusb btrtl btbcm btintel snd_hda_intel aes_x86_64 bluetooth lrw iwlwifi snd_hda_codec gf128mul glue_helper snd_seq_midi snd_seq_midi_event snd_hda_core ablk_helper snd_rawmidi snd_hwdep cryptd joydev input_leds cfg80211 serio_raw snd_seq snd_pcm snd_seq_device lpc_ich snd_timer snd mei_me mei soundcore shpchp 8250_fintek mac_hid dell_smo8800 dell_rbtn parport_pc ppdev lp parport autofs4 overlay nls_utf8 isofs nls_iso8859_1 dm_mirror dm_region_hash dm_log uas usb_storage nouveau mxm_wmi i2c_algo_bit ttm psmouse drm_kms_helper syscopyarea sysfillrect sysimgblt e1000e fb_sys_fops ahci drm libahci ptp pps_core wmi video fjes
Feb  5 18:40:36 ubuntu kernel: [  922.922513] CPU: 2 PID: 15386 Comm: receive_writer Tainted: P           OE   4.4.0-31-generic #50-Ubuntu
Feb  5 18:40:36 ubuntu kernel: [  922.922518] Hardware name: Dell Inc. Latitude E6530/0GMVN7, BIOS A17 08/19/2015
Feb  5 18:40:36 ubuntu kernel: [  922.922522] task: ffff8801fffc8000 ti: ffff88007db28000 task.ti: ffff88007db28000
Feb  5 18:40:36 ubuntu kernel: [  922.922525] RIP: 0010:[<ffffffff813ff224>]  [<ffffffff813ff224>] __memmove+0x24/0x1a0
Feb  5 18:40:36 ubuntu kernel: [  922.922531] RSP: 0018:ffff88007db2bcc8  EFLAGS: 00010206
Feb  5 18:40:36 ubuntu kernel: [  922.922534] RAX: ffffc9000ea00000 RBX: ffff88016e60e000 RCX: 0000000000002a00
Feb  5 18:40:36 ubuntu kernel: [  922.922537] RDX: 0000000000008a00 RSI: ffffc9002eea6000 RDI: ffffc9000ea06000
Feb  5 18:40:36 ubuntu kernel: [  922.922540] RBP: ffff88007db2bd00 R08: 0000000000000001 R09: 0000000000000000
Feb  5 18:40:36 ubuntu kernel: [  922.922543] R10: ffff88019777c580 R11: ffffffff81ccf1ea R12: ffffffffc0a9b090
Feb  5 18:40:36 ubuntu kernel: [  922.922547] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Feb  5 18:40:36 ubuntu kernel: [  922.922550] FS:  0000000000000000(0000) GS:ffff88022dc80000(0000) knlGS:0000000000000000
Feb  5 18:40:36 ubuntu kernel: [  922.922554] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb  5 18:40:36 ubuntu kernel: [  922.922557] CR2: ffffc9002eea6000 CR3: 0000000002e0a000 CR4: 00000000001406e0
Feb  5 18:40:36 ubuntu kernel: [  922.922560] Stack:
Feb  5 18:40:36 ubuntu kernel: [  922.922562]  ffffffffc0977db5 000000007db2bd60 000000027db2be30 ffff88016e60e000
Feb  5 18:40:36 ubuntu kernel: [  922.922570]  ffffffffc0a9b090 0000000000000000 0000000000000000 ffff88007db2bd48
Feb  5 18:40:36 ubuntu kernel: [  922.922575]  ffffffffc0978118 ffff880050329380 00000000f76f4800 ffff88007db2be30
Feb  5 18:40:36 ubuntu kernel: [  922.922581] Call Trace:
Feb  5 18:40:36 ubuntu kernel: [  922.922630]  [<ffffffffc0977db5>] ? __dbuf_hold_impl+0x2d5/0x590 [zfs]
Feb  5 18:40:36 ubuntu kernel: [  922.922651]  [<ffffffffc0978118>] dbuf_hold_impl+0xa8/0xd0 [zfs]
Feb  5 18:40:36 ubuntu kernel: [  922.922671]  [<ffffffffc0979676>] dbuf_hold+0x36/0x60 [zfs]
Feb  5 18:40:36 ubuntu kernel: [  922.922692]  [<ffffffffc0980ea5>] dmu_buf_hold_noread+0x85/0xe0 [zfs]
Feb  5 18:40:36 ubuntu kernel: [  922.922713]  [<ffffffffc0980f2f>] dmu_buf_hold+0x2f/0x80 [zfs]
Feb  5 18:40:36 ubuntu kernel: [  922.922733]  [<ffffffffc097ba8c>] ? bqueue_dequeue+0xbc/0xd0 [zfs]
Feb  5 18:40:36 ubuntu kernel: [  922.922757]  [<ffffffffc098bd25>] receive_writer_thread+0x425/0x6f0 [zfs]
Feb  5 18:40:36 ubuntu kernel: [  922.922767]  [<ffffffffc06b0e4a>] ? spl_kmem_free+0x2a/0x40 [spl]
Feb  5 18:40:36 ubuntu kernel: [  922.922795]  [<ffffffffc098b900>] ? receive_freeobjects.isra.12+0xe0/0xe0 [zfs]
Feb  5 18:40:36 ubuntu kernel: [  922.922805]  [<ffffffffc06b34f1>] thread_generic_wrapper+0x71/0x80 [spl]
Feb  5 18:40:36 ubuntu kernel: [  922.922813]  [<ffffffffc06b3480>] ? __thread_exit+0x20/0x20 [spl]
Feb  5 18:40:36 ubuntu kernel: [  922.922823]  [<ffffffff810a0808>] kthread+0xd8/0xf0
Feb  5 18:40:36 ubuntu kernel: [  922.922829]  [<ffffffff810a0730>] ? kthread_create_on_node+0x1e0/0x1e0
Feb  5 18:40:36 ubuntu kernel: [  922.922839]  [<ffffffff8182decf>] ret_from_fork+0x3f/0x70
Feb  5 18:40:36 ubuntu kernel: [  922.922843]  [<ffffffff810a0730>] ? kthread_create_on_node+0x1e0/0x1e0
Feb  5 18:40:36 ubuntu kernel: [  922.922846] Code: 88 0c 17 88 0f c3 90 48 89 f8 48 83 fa 20 0f 82 03 01 00 00 48 39 fe 7d 0f 49 89 f0 49 01 d0 49 39 f8 0f 8f 9f 00 00 00 48 89 d1 <f3> a4 c3 48 81 fa a8 02 00 00 72 05 40 38 fe 74 3b 48 83 ea 20 
Feb  5 18:40:36 ubuntu kernel: [  922.922917] RIP  [<ffffffff813ff224>] __memmove+0x24/0x1a0
Feb  5 18:40:36 ubuntu kernel: [  922.922924]  RSP <ffff88007db2bcc8>
Feb  5 18:40:36 ubuntu kernel: [  922.922927] CR2: ffffc9002eea6000
Feb  5 18:40:36 ubuntu kernel: [  922.932837] ---[ end trace f82349dba0ad7f3e ]---

loli10K commented 7 years ago

@hunbalazs do you still have the send stream lying around? It'd be interesting to see how the record that is causing the fault looks like, can you patch zstreamdump:

diff --git a/cmd/zstreamdump/zstreamdump.c b/cmd/zstreamdump/zstreamdump.c
index aa594b8..dc1f9d0 100644
--- a/cmd/zstreamdump/zstreamdump.c
+++ b/cmd/zstreamdump/zstreamdump.c
@@ -95,6 +95,8 @@ ssread(void *buf, size_t len, zio_cksum_t *cksum)
        if ((outlen = fread(buf, len, 1, send_stream)) == 0)
                return (0);

+       fwrite(buf, len, 1, stderr);
+
        if (do_cksum) {
                if (do_byteswap)
                        fletcher_4_incremental_byteswap(buf, len, cksum);

and then try to receive the send stream again with:

{ cat $streamfile | /path/to/patched/zstreamdump -vv 2>&1 1>&3 3>&- | cat > zfs recv $dest; } 3>&1 1>&2

I'm not sure it's going to work but if it's the only data remaining from the old pool it's probably worth a try.

hunbalazs commented 7 years ago

@loli10K thank you for the help I still have the stream lying on an external hard drive. I allied the patch to current master and built zfs, but I'm sure it wont work because cat > zfs just overwrites the zfs binary

loli10K commented 7 years ago

@hunbalazs hum, i'm not very good with the shell, maybe remove the "cat >" part? { cat $streamfile | /path/to/patched/zstreamdump -vv 2>&1 1>&3 3>&- | zfs recv $dest; } 3>&1 1>&2

The idea here is to "print" the stream data in human readable form (via zstreamdump) just before we receive it, hopefully the very last DRR printed is the one causing the fault. We try to do that with the shell by piping cat $streamfile to the patched zstreamdump which acts just as a "printing filter" before the actual receive.

hunbalazs commented 7 years ago

@loli10K I'm not a shell wizard either :smile: We got our little bad object:

WRITE object = 1740980 type = 19 checksum type = 8 compression type = 15
    offset = 0 logical_size = 15872 compressed_size = 8192 payload_size = 8192 props = f000f001e

no checksum at the end, as I can see other object have checksum after offset line

loli10K commented 7 years ago

@hunbalazs that doesn't look like anything to me. Unfortunately. I would have expected a WRITE_BYREF record to be the cause.

Can you share the stream that is causing the fault? The next step i'd take to troubleshoot something like this is to get kernel+zfs with debugging symbols, try some stap printf'ing around the functions dumped in the stacktrace while receiving the stream and/or attach a debugger and see what happens.

hunbalazs commented 7 years ago

@loli10K compiled everything again with --debug-* and imported everything without a hitch. Do you still need the backup, should I try something else to debug it further?

loli10K commented 7 years ago

@hunbalazs if you feel comfortable i'd encourage you to try and debug this issue on a live kernel by yourself (i think you can find a lot of tutorials online on how to do it, and hopefully one day on the ZoL wiki https://github.com/zfsonlinux/zfs/wiki/Debugging).

The way i usually do this for development is a QEMU box with a serial port via gdb, so if anything goes wrong i can just bounce it and restart the debugging session. Basically just attach gdb to the virtual machine, load all the symbols, setup some breakpoints (i'd start with receive_writer_thread and the other functions in the backtrace) and start receiving the backup stream: when you hit the breakpoint inspect the frame and look for suspicious variables that could lead to the issue. Rinse and repeat.

If you don't mind sharing the backup stream i'd also like to try and debug this issue. It's a real pity we lost the source pool, now we don't know if it's reproducible on the send side.

williamstein commented 7 years ago

I'm finding this very easy to reproduce using tag:zfs-0.7.2 built from source on Ubuntu 16.04.3. (And this bug itself is a huge nightmare for me, but that's besides the point.)

Create a pool with dedup and compression called a on a sparse image file (say).
Rsync /usr/bin/ for a bit into the pool (maybe around 100MB worth), then just do the same again to get about 5 copies of /usr/bin, called bin, bin2, bin3, bin4, etc.
Snapshot the pool: zfs snapshot a@0
Export the zpool using zfs send -cDR a@0 > a.zfs
Destroy and then input: zfs destroy -r a and then cat a.zfs | zfs recv -F -v a

Get this kernel crash and reboot of VM:

ct  5 23:00:49 kucalc-ws1-node-nepb systemd[1]: Started Session 95 of user cocalc.
Oct  5 23:00:50 kucalc-ws1-node-nepb systemd[1]: Started Session 96 of user cocalc.
[  466.243320] BUG: unable to handle kernel paging request at ffffa5030586a000
[  466.250428] IP: __memmove+0x24/0x1a0
[  466.254102] PGD 12b495067 
[  466.254103] PUD 12b496067 
[  466.256909] PMD 8eb8b067 
[  466.259733] PTE 0
[  466.262471] 
[  466.266086] Oops: 0000 [#1] SMP
[  466.269325] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache xt_set xt_multiport ipt
able_mangle iptable_raw ipip ip_set_hash_ip tunnel4 ip_set_hash_net ip_tunnel ip_set veth nf_conntrack_netlink nfne
tlink xt_statistic xt_nat xt_recent ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_comment xt_mark ipt_MASQUERADE nf_nat_ma
squerade_ipv4 xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype xt_conntrack
nf_nat nf_conntrack br_netfilter bridge stp llc aufs zfs(POE) ip6table_filter zunicode(POE) ip6_tables zavl(POE) i
ptable_filter ip_tables icp(POE) x_tables zcommon(POE) znvpair(POE) spl(OE) ppdev input_leds serio_raw pvpanic parp
ort_pc binfmt_misc parport ib_iser rdma_cm iw_cm ib_cm ib_core sunrpc configfs iscsi_tcp libiscsi_tcp libiscsi scsi
_transport_iscsi
[  466.340918]  autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_p
q libcrc32c raid1 raid0 multipath linear crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86
_64 crypto_simd glue_helper cryptd psmouse virtio_net virtio_scsi
[  466.366829] CPU: 0 PID: 14925 Comm: receive_writer Tainted: P           OE   4.10.0-35-generic #39~16.04.1-Ubunt
u
[  466.377188] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
[  466.386507] task: ffff95e392ad9680 task.stack: ffffa50301e5c000
[  466.392531] RIP: 0010:__memmove+0x24/0x1a0
[  466.396729] RSP: 0018:ffffa50301e5fcd8 EFLAGS: 00010287
[  466.402053] RAX: ffffa50309292000 RBX: ffff95e382c72800 RCX: 000000000000b000
[  466.409287] RDX: 0000000000020000 RSI: ffffa5030586a000 RDI: ffffa503092a7000
[  466.416536] RBP: ffffa50301e5fd10 R08: ffffa50305875000 R09: 0000000000000000
[  466.423770] R10: ffff95e374b213e0 R11: ffffdcf201fcf820 R12: ffffffffc08fd2b0
[  466.431003] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[  466.438238] FS:  0000000000000000(0000) GS:ffff95e42fc00000(0000) knlGS:0000000000000000
[  466.446428] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  466.452272] CR2: ffffa5030586a000 CR3: 0000000105135000 CR4: 00000000001406f0
[  466.459510] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  466.466745] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  466.473980] Call Trace:
[  466.476599]  ? __dbuf_hold_impl+0x4e9/0x640 [zfs]
[  466.481456]  dbuf_hold_impl+0xa8/0xd0 [zfs]
[  466.485779]  dbuf_hold+0x36/0x60 [zfs]
[  466.489668]  dmu_buf_hold_noread+0x85/0x100 [zfs]
[  466.494527]  dmu_buf_hold+0x2f/0x80 [zfs]
[  466.498710]  receive_writer_thread+0x5da/0xa00 [zfs]
[  466.503795]  ? set_next_entity+0xc3/0x1b0
[  466.507911]  ? set_curr_task_fair+0x2b/0x60
[  466.512207]  ? spl_kmem_free+0x33/0x40 [spl]
[  466.516620]  ? receive_freeobjects.isra.14+0xf0/0xf0 [zfs]
[  466.522214]  thread_generic_wrapper+0x72/0x80 [spl]
[  466.527197]  kthread+0x109/0x140
[  466.530545]  ? __thread_exit+0x20/0x20 [spl]
[  466.534920]  ? kthread_create_on_node+0x60/0x60
[  466.539575]  ret_from_fork+0x2c/0x40
[  466.543253] Code: 90 90 90 90 90 90 90 48 89 f8 48 83 fa 20 0f 82 03 01 00 00 48 39 fe 7d 0f 49 89 f0 49 01 d0 4
9 39 f8 0f 8f 9f 00 00 00 48 89 d1 <f3> a4 c3 48 81 fa a8 02 00 00 72 05 40 38 fe 74 3b 48 83 ea 20 
[  466.562236] RIP: __memmove+0x24/0x1a0 RSP: ffffa50301e5fcd8
[  466.567909] CR2: ffffa5030586a000
[  466.571325] ---[ end trace 9dbe9f086e2fcc44 ]---
[  466.576042] Kernel panic - not syncing: Fatal exception
[  466.582116] Kernel Offset: 0x19800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbf
ffffff)
[  466.593014] Rebooting in 10 seconds..
[  476.519578] ACPI MEMORY or I/O RESET_REG.

For me also using -L in step 4 makes absolutely no difference; it still crashes.

The exact same filesystem sent without -D recv's fine (everything as above but no D -- still compressed, etc.).

If you try the above and still can't reproduce the problem, ping me and I'll provide more help.

williamstein commented 7 years ago

One I'm seeing is I guess exactly the same as https://github.com/zfsonlinux/zfs/issues/6334

So please I guess disregard my above remark in favor of "I'm another person seeing the same thing as was just reported in #6334".

chrisrd commented 7 years ago

It's failing in memmove(), called from __dbuf_hold_impl(), which looks like:

static int
__dbuf_hold_impl(struct dbuf_hold_impl_data *dh)
{
...
                        bcopy(dh->dh_dr->dt.dl.dr_data->b_data,
                            dh->dh_db->db.db_data, dh->dh_db->db.db_size);
...
}

And judging from RDX: 0000000000020000, it's trying to copy 0x20000 bytes (== 128KB).

Copying a linear 128KB in kernel memory sounds off. Maybe something to do with the ABD changes, e.g. trying to do a linear copy on a non-linear buffer?

chrisrd commented 7 years ago

@williamstein are you able to provide a disassembly of your memmove? E.g. if you have your vmlinux file:

gdb -batch -ex 'file vmlinux' -ex 'disassemble /r memmove'

Reason being, I'm trying to work out what the actual args to the memmove are, but the my local memmove doesn't contain the Code from the crash dump:

echo '[  466.543253] Code: 90 90 90 90 90 90 90 48 89 f8 48 83 fa 20 0f 82 03 01 00 00 48 39 fe 7d 0f 49 89 f0 49 01 d0 49 39 f8 0f 8f 9f 00 00 00 48 89 d1 <f3> a4 c3 48 81 fa a8 02 00 00 72 05 40 38 fe 74 3b 48 83 ea 20' |
perl -ne '
  chomp;  s/.*Code: //;
  s/[<>]//g; printf("char Code[] = \"\\x%s\";\nint main(){}\n", join("\\x", split));
' > /tmp/tmp.c
$ gcc -g -o /tmp/tmp.exe /tmp/tmp.c && gdb -batch -ex "file /tmp/tmp.exe" -ex "disassemble/r Code"
Dump of assembler code for function Code:
   0x0000000000601040 <+0>:     90      nop     
   0x0000000000601041 <+1>:     90      nop     
   0x0000000000601042 <+2>:     90      nop
   0x0000000000601043 <+3>:     90      nop 
   0x0000000000601044 <+4>:     90      nop
   0x0000000000601045 <+5>:     90      nop
   0x0000000000601046 <+6>:     90      nop
   0x0000000000601047 <+7>:     48 89 f8        mov    %rdi,%rax                # %rdi:ffffa503092a7000 %rax:ffffa50309292000
   0x000000000060104a <+10>:    48 83 fa 20     cmp    $0x20,%rdx               # %rdx:0000000000020000
   0x000000000060104e <+14>:    0f 82 03 01 00 00       jb     0x601157
   0x0000000000601054 <+20>:    48 39 fe        cmp    %rdi,%rsi
   0x0000000000601057 <+23>:    7d 0f   jge    0x601068 <Code+40> 
   0x0000000000601059 <+25>:    49 89 f0        mov    %rsi,%r8
   0x000000000060105c <+28>:    49 01 d0        add    %rdx,%r8
   0x000000000060105f <+31>:    49 39 f8        cmp    %rdi,%r8
   0x0000000000601062 <+34>:    0f 8f 9f 00 00 00       jg     0x601107
   0x0000000000601068 <+40>:    48 89 d1        mov    %rdx,%rcx
   0x000000000060106b <+43>:    f3 a4   rep movsb %ds:(%rsi),%es:(%rdi)
   0x000000000060106d <+45>:    c3      retq    
   0x000000000060106e <+46>:    48 81 fa a8 02 00 00    cmp    $0x2a8,%rdx
   0x0000000000601075 <+53>:    72 05   jb     0x60107c <Code+60>
   0x0000000000601077 <+55>:    40 38 fe        cmp    %dil,%sil
   0x000000000060107a <+58>:    74 3b   je     0x6010b7
   0x000000000060107c <+60>:    48 83 ea 20     sub    $0x20,%rdx
   0x0000000000601080 <+64>:    00 00   add    %al,(%rax)
End of assembler dump.

haraldschilly commented 7 years ago

Hi @chrisrd I'm with @williamstein .... and I have to admit I barely know what I'm doing here. The kernel is from Ubuntu, and running this gdb command on the variant with debug symbols gives this:

$ gdb -batch -ex 'file vmlinux-4.10.0-35-generic' -ex 'disassemble /r memmove' > /tmp/vmlinux-4.10.0-35-generic-memmove.txt
$ cat /tmp/vmlinux-4.10.0-35-generic-memmove.txt
Dump of assembler code for function memmove:
   0xffffffff814585d0 <+0>:     48 89 f8        mov    %rdi,%rax
   0xffffffff814585d3 <+3>:     48 83 fa 20     cmp    $0x20,%rdx
   0xffffffff814585d7 <+7>:     0f 82 03 01 00 00       jb     0xffffffff814586e0 <memmove+272>
   0xffffffff814585dd <+13>:    48 39 fe        cmp    %rdi,%rsi
   0xffffffff814585e0 <+16>:    7d 0f   jge    0xffffffff814585f1 <memmove+33>
   0xffffffff814585e2 <+18>:    49 89 f0        mov    %rsi,%r8
   0xffffffff814585e5 <+21>:    49 01 d0        add    %rdx,%r8
   0xffffffff814585e8 <+24>:    49 39 f8        cmp    %rdi,%r8
   0xffffffff814585eb <+27>:    0f 8f 9f 00 00 00       jg     0xffffffff81458690 <memmove+192>
   0xffffffff814585f1 <+33>:    90      nop
   0xffffffff814585f2 <+34>:    90      nop
   0xffffffff814585f3 <+35>:    90      nop
   0xffffffff814585f4 <+36>:    90      nop
   0xffffffff814585f5 <+37>:    90      nop
   0xffffffff814585f6 <+38>:    90      nop
   0xffffffff814585f7 <+39>:    48 81 fa a8 02 00 00    cmp    $0x2a8,%rdx
   0xffffffff814585fe <+46>:    72 05   jb     0xffffffff81458605 <memmove+53>
   0xffffffff81458600 <+48>:    40 38 fe        cmp    %dil,%sil
   0xffffffff81458603 <+51>:    74 3b   je     0xffffffff81458640 <memmove+112>
   0xffffffff81458605 <+53>:    48 83 ea 20     sub    $0x20,%rdx
   0xffffffff81458609 <+57>:    48 83 ea 20     sub    $0x20,%rdx
   0xffffffff8145860d <+61>:    4c 8b 1e        mov    (%rsi),%r11
   0xffffffff81458610 <+64>:    4c 8b 56 08     mov    0x8(%rsi),%r10
   0xffffffff81458614 <+68>:    4c 8b 4e 10     mov    0x10(%rsi),%r9
   0xffffffff81458618 <+72>:    4c 8b 46 18     mov    0x18(%rsi),%r8
   0xffffffff8145861c <+76>:    48 8d 76 20     lea    0x20(%rsi),%rsi
   0xffffffff81458620 <+80>:    4c 89 1f        mov    %r11,(%rdi)
   0xffffffff81458623 <+83>:    4c 89 57 08     mov    %r10,0x8(%rdi)
   0xffffffff81458627 <+87>:    4c 89 4f 10     mov    %r9,0x10(%rdi)
   0xffffffff8145862b <+91>:    4c 89 47 18     mov    %r8,0x18(%rdi)
   0xffffffff8145862f <+95>:    48 8d 7f 20     lea    0x20(%rdi),%rdi
   0xffffffff81458633 <+99>:    73 d4   jae    0xffffffff81458609 <memmove+57>
   0xffffffff81458635 <+101>:   48 83 c2 20     add    $0x20,%rdx
   0xffffffff81458639 <+105>:   e9 a2 00 00 00  jmpq   0xffffffff814586e0 <memmove+272>
   0xffffffff8145863e <+110>:   66 90   xchg   %ax,%ax
   0xffffffff81458640 <+112>:   48 89 d1        mov    %rdx,%rcx
   0xffffffff81458643 <+115>:   4c 8b 5c 16 f8  mov    -0x8(%rsi,%rdx,1),%r11
   0xffffffff81458648 <+120>:   4c 8d 54 17 f8  lea    -0x8(%rdi,%rdx,1),%r10
   0xffffffff8145864d <+125>:   48 c1 e9 03     shr    $0x3,%rcx
   0xffffffff81458651 <+129>:   f3 48 a5        rep movsq %ds:(%rsi),%es:(%rdi)
   0xffffffff81458654 <+132>:   4d 89 1a        mov    %r11,(%r10)
   0xffffffff81458657 <+135>:   e9 0c 01 00 00  jmpq   0xffffffff81458768 <memmove+408>
   0xffffffff8145865c <+140>:   0f 1f 40 00     nopl   0x0(%rax)
   0xffffffff81458660 <+144>:   48 89 d1        mov    %rdx,%rcx
   0xffffffff81458663 <+147>:   4c 8b 1e        mov    (%rsi),%r11
   0xffffffff81458666 <+150>:   49 89 fa        mov    %rdi,%r10
   0xffffffff81458669 <+153>:   48 8d 74 16 f8  lea    -0x8(%rsi,%rdx,1),%rsi
   0xffffffff8145866e <+158>:   48 8d 7c 17 f8  lea    -0x8(%rdi,%rdx,1),%rdi
   0xffffffff81458673 <+163>:   48 c1 e9 03     shr    $0x3,%rcx
   0xffffffff81458677 <+167>:   fd      std
   0xffffffff81458678 <+168>:   f3 48 a5        rep movsq %ds:(%rsi),%es:(%rdi)
   0xffffffff8145867b <+171>:   fc      cld
   0xffffffff8145867c <+172>:   4d 89 1a        mov    %r11,(%r10)
   0xffffffff8145867f <+175>:   e9 e4 00 00 00  jmpq   0xffffffff81458768 <memmove+408>
   0xffffffff81458684 <+180>:   66 90   xchg   %ax,%ax
   0xffffffff81458686 <+182>:   66 2e 0f 1f 84 00 00 00 00 00   nopw   %cs:0x0(%rax,%rax,1)
   0xffffffff81458690 <+192>:   48 81 fa a8 02 00 00    cmp    $0x2a8,%rdx
   0xffffffff81458697 <+199>:   72 05   jb     0xffffffff8145869e <memmove+206>
   0xffffffff81458699 <+201>:   40 38 fe        cmp    %dil,%sil
   0xffffffff8145869c <+204>:   74 c2   je     0xffffffff81458660 <memmove+144>
   0xffffffff8145869e <+206>:   48 01 d6        add    %rdx,%rsi
   0xffffffff814586a1 <+209>:   48 01 d7        add    %rdx,%rdi
   0xffffffff814586a4 <+212>:   48 83 ea 20     sub    $0x20,%rdx
   0xffffffff814586a8 <+216>:   48 83 ea 20     sub    $0x20,%rdx
   0xffffffff814586ac <+220>:   4c 8b 5e f8     mov    -0x8(%rsi),%r11
   0xffffffff814586b0 <+224>:   4c 8b 56 f0     mov    -0x10(%rsi),%r10
   0xffffffff814586b4 <+228>:   4c 8b 4e e8     mov    -0x18(%rsi),%r9
   0xffffffff814586b8 <+232>:   4c 8b 46 e0     mov    -0x20(%rsi),%r8
   0xffffffff814586bc <+236>:   48 8d 76 e0     lea    -0x20(%rsi),%rsi
   0xffffffff814586c0 <+240>:   4c 89 5f f8     mov    %r11,-0x8(%rdi)
   0xffffffff814586c4 <+244>:   4c 89 57 f0     mov    %r10,-0x10(%rdi)
   0xffffffff814586c8 <+248>:   4c 89 4f e8     mov    %r9,-0x18(%rdi)
   0xffffffff814586cc <+252>:   4c 89 47 e0     mov    %r8,-0x20(%rdi)
   0xffffffff814586d0 <+256>:   48 8d 7f e0     lea    -0x20(%rdi),%rdi
   0xffffffff814586d4 <+260>:   73 d2   jae    0xffffffff814586a8 <memmove+216>
   0xffffffff814586d6 <+262>:   48 83 c2 20     add    $0x20,%rdx
   0xffffffff814586da <+266>:   48 29 d6        sub    %rdx,%rsi
   0xffffffff814586dd <+269>:   48 29 d7        sub    %rdx,%rdi
   0xffffffff814586e0 <+272>:   48 83 fa 10     cmp    $0x10,%rdx
   0xffffffff814586e4 <+276>:   72 2a   jb     0xffffffff81458710 <memmove+320>
   0xffffffff814586e6 <+278>:   4c 8b 1e        mov    (%rsi),%r11
   0xffffffff814586e9 <+281>:   4c 8b 56 08     mov    0x8(%rsi),%r10
   0xffffffff814586ed <+285>:   4c 8b 4c 16 f0  mov    -0x10(%rsi,%rdx,1),%r9
   0xffffffff814586f2 <+290>:   4c 8b 44 16 f8  mov    -0x8(%rsi,%rdx,1),%r8
   0xffffffff814586f7 <+295>:   4c 89 1f        mov    %r11,(%rdi)
   0xffffffff814586fa <+298>:   4c 89 57 08     mov    %r10,0x8(%rdi)
   0xffffffff814586fe <+302>:   4c 89 4c 17 f0  mov    %r9,-0x10(%rdi,%rdx,1)
   0xffffffff81458703 <+307>:   4c 89 44 17 f8  mov    %r8,-0x8(%rdi,%rdx,1)
   0xffffffff81458708 <+312>:   eb 5e   jmp    0xffffffff81458768 <memmove+408>
   0xffffffff8145870a <+314>:   66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
   0xffffffff81458710 <+320>:   48 83 fa 08     cmp    $0x8,%rdx
   0xffffffff81458714 <+324>:   72 12   jb     0xffffffff81458728 <memmove+344>
   0xffffffff81458716 <+326>:   4c 8b 1e        mov    (%rsi),%r11
   0xffffffff81458719 <+329>:   4c 8b 54 16 f8  mov    -0x8(%rsi,%rdx,1),%r10
   0xffffffff8145871e <+334>:   4c 89 1f        mov    %r11,(%rdi)
   0xffffffff81458721 <+337>:   4c 89 54 17 f8  mov    %r10,-0x8(%rdi,%rdx,1)
   0xffffffff81458726 <+342>:   eb 40   jmp    0xffffffff81458768 <memmove+408>
   0xffffffff81458728 <+344>:   48 83 fa 04     cmp    $0x4,%rdx
   0xffffffff8145872c <+348>:   72 12   jb     0xffffffff81458740 <memmove+368>
   0xffffffff8145872e <+350>:   44 8b 1e        mov    (%rsi),%r11d
   0xffffffff81458731 <+353>:   44 8b 54 16 fc  mov    -0x4(%rsi,%rdx,1),%r10d
   0xffffffff81458736 <+358>:   44 89 1f        mov    %r11d,(%rdi)
   0xffffffff81458739 <+361>:   44 89 54 17 fc  mov    %r10d,-0x4(%rdi,%rdx,1)
   0xffffffff8145873e <+366>:   eb 28   jmp    0xffffffff81458768 <memmove+408>
   0xffffffff81458740 <+368>:   48 83 fa 02     cmp    $0x2,%rdx
   0xffffffff81458744 <+372>:   72 16   jb     0xffffffff8145875c <memmove+396>
   0xffffffff81458746 <+374>:   66 44 8b 1e     mov    (%rsi),%r11w
   0xffffffff8145874a <+378>:   66 44 8b 54 16 fe       mov    -0x2(%rsi,%rdx,1),%r10w
   0xffffffff81458750 <+384>:   66 44 89 1f     mov    %r11w,(%rdi)
   0xffffffff81458754 <+388>:   66 44 89 54 17 fe       mov    %r10w,-0x2(%rdi,%rdx,1)
   0xffffffff8145875a <+394>:   eb 0c   jmp    0xffffffff81458768 <memmove+408>
   0xffffffff8145875c <+396>:   48 83 fa 01     cmp    $0x1,%rdx
   0xffffffff81458760 <+400>:   72 06   jb     0xffffffff81458768 <memmove+408>
   0xffffffff81458762 <+402>:   44 8a 1e        mov    (%rsi),%r11b
   0xffffffff81458765 <+405>:   44 88 1f        mov    %r11b,(%rdi)
   0xffffffff81458768 <+408>:   c3      retq
End of assembler dump.

chrisrd commented 7 years ago

Hi @haraldschilly, thanks, that's exactly what I was after. However your disassembly is (practically) identical to my disassembly (using a different kernel version and likely a different compiler version), i.e. neither contains the code sequence from the crash dump. That's a mystery!

haraldschilly commented 7 years ago

ok, wait, maybe there is an inconsistency I'm not aware of

chrisrd commented 7 years ago

Scratch that! If you ignore the 'nops' at the start of the Code disassembly, you see it matches the start of your (and my disassembly). Then the 'nops' in your disassembly (6 bytes, <+33> to <+38> are different to the Code disassembly (6 bytes, <+40> to <+45>), but after that they both match up again. Looks like there must be some kind if live patching going on.

I think that gives me enough to work on, I'll see what I can see.

Thanks again!

haraldschilly commented 7 years ago

I was just checking the kernel versions, they're all the same.

chrisrd commented 7 years ago

So, looks like there's something wrong with the source address (and/or size), i.e. b_data:

bcopy(dh->dh_dr->dt.dl.dr_data->b_data,
                            dh->dh_db->db.db_data, dh->dh_db->db.db_size);

Basically:

$ linux/scripts/decodecode <<END
  [466.543253] Code: 90 90 90 90 90 90 90 48 89 f8 48 83 fa 20 0f 82 03 01 00 00 48 39 fe 7d 0f 49 89 f0 49 01 d0 49 39 f8 0f 8f 9f 00 00 00 48 89 d1 <f3> a4 c3 48 81 fa a8 02 00 00 72 05 40 38 fe 74 3b 48 83 ea 20
END
All code
========
   0:   90                      nop
   1:   90                      nop
   2:   90                      nop
   3:   90                      nop
   4:   90                      nop
   5:   90                      nop
   6:   90                      nop
   7:   48 89 f8                mov    %rdi,%rax
   a:   48 83 fa 20             cmp    $0x20,%rdx
   e:   0f 82 03 01 00 00       jb     0x117
  14:   48 39 fe                cmp    %rdi,%rsi
  17:   7d 0f                   jge    0x28
  19:   49 89 f0                mov    %rsi,%r8
  1c:   49 01 d0                add    %rdx,%r8
  1f:   49 39 f8                cmp    %rdi,%r8
  22:   0f 8f 9f 00 00 00       jg     0xc7
  28:   48 89 d1                mov    %rdx,%rcx
  2b:   f3 a4                   rep movsb %ds:(%rsi),%es:*(%rdi)                <-- trapping instruction
  2d:   c3                      retq
  2e:   48 81 fa a8 02 00 00    cmp    $0x2a8,%rdx
  35:   72 05                   jb     0x3c
  37:   40 38 fe                cmp    %dil,%sil
  3a:   74 3b                   je     0x77
  3c:   48 83 ea 20             sub    $0x20,%rdx

Code starting with the faulting instruction
===========================================
   0:   f3 a4                   rep movsb %ds:(%rsi),%es:(%rdi)
   2:   c3                      retq
   3:   48 81 fa a8 02 00 00    cmp    $0x2a8,%rdx
   a:   72 05                   jb     0x11
   c:   40 38 fe                cmp    %dil,%sil
   f:   74 3b                   je     0x4c
  11:   48 83 ea 20             sub    $0x20,%rdx

The faulting instruction, rep movsb %ds:(%rsi),%es:*(%rdi), copies %rcx bytes from the location pointed to by %rsi, to the location pointed to by %rdi, and for each byte copy, decrementing %rcx and incrementing %rsi and %rdi.

The CR2 register contains the faulting address, ffffa5030586a000, i.e. it's %rsi, the source address.

The previous instruction, mov %rdx,%rcx shows that %rcx starts out at %rdx, i.e. 0x2000 == 128K. By the time we fault %rcx is 0xb000. I.e. we've already copied 0x2000 - 0xb000 == 0x15000 bytes.

And now it's way past my bedtime! G'night.

dweeezil commented 7 years ago

@hunbalazs The problem is the difference between compressed size and uncompressed size. The bcopy is trying to copy the uncompressed size for a compressed buffer. The ARC buffer in dr_data is compressed but the space allocated in the prior dbuf_set_data() call is based on an uncompressed size. It seems like either the newly-allocated buffer should be sized based on the logical size of the ARC buffer in dr_data. This seems to be a place where a specific compression test may be necessary.

dweeezil commented 7 years ago

Actually, it looks like an ABD issue after all, although compression likely comes into play, too.

dweeezil commented 7 years ago

The __dbuf_hold_impl() function is doing at least one thing wrong here: it's never allocating a compressed ARC buffer, regardless of the compression of the arc buffer in the dbuf dirty record (dh->dh_dr->db.dl.dr_data). This causes various mis-matches of source and destination buffer sizes.

Also, it seems that in some cases, a compressed ARC buffer is in the dirty record, but its header indicates no compression.

I'm been playing with a fix in dweeezil:issue-5742 but its as likely as not that the real problem lies well outside the __dbuf_hold_impl() function.

The goal of the patch referenced above is to allocate a new compressed ARC buffer when the dirty record's arc buffer is compressed. Unfortunately, there's no API for the dbuf layer to determine this so I added the arc_buf_compressed() function (which also indirectly returns the type of the compression in the header, for debugging).

At this patch stands now, it does eliminate the panics, however, there is typically at least one corrupted file after the receive is done and I suspect it has to do with the mismatch between the compression flag on the ARC buffer and its corresponding header.

I'm not sure where to go with this because I'm not really very tuned into the intricacies of the compressed ARC. FWIW, it seems that this bug is only triggered when receiving send streams with both compression and dedup enabled.

williamstein commented 7 years ago

FWIW, it seems that this bug is only triggered when receiving send streams with both compression and dedup enabled.

Just to clarify, do you mean: (a) compression and dedup enabled in the stream itself, e.g., by using the -LDc option to zfs send, or (b) compression and dedup enabled in the receiving pool. If so, does the compression setting matter.

I hope you mean (a) and that (b) is completely irrelevant to this bug.

I'm only asking, because I'm trying to find a workaround -- I'm going to be storing over a half million ZFS streams soon, where some of the data is very dedup-friendly (collected homework assignments in classes), so it would be very good if I could use dedup on them. If completely avoiding zfs stream compression, and instead just piping through lz4 would work, that would be fine for me.

dweeezil commented 7 years ago

@williamstein I meant (a), but need to run a few more tests to be sure. That said, however, it's not clear to me why this is only every triggered by the act of receiving specifically-formatted send streams as opposed to normal filesystem access. I have a feeling that dbuf_fix_old_data() may be involved but, again, I need to do a bit more testing.

Keep in mind that deduped send streams (zfs send -D) don't really have anything to do with ZFS' dedup facility but are more of a fancy form of compression; they (dedup send stream) can be received on a file system which is not dedup-enabled.

williamstein commented 7 years ago

@dweeezil thanks. I also tested a lot yesterday, and (a) seems to be true, at least in all my tests. I'm now storing my zfs streams using the output of zfs send -DL ... | lz4 -c.

dweeezil commented 7 years ago

@williamstein That's good information. For my part, I'll not likely have a chance to do more work on this until the coming weekend during which I'll try to track down how we're getting a compressed ARC buffer with an associated ARC header indicating no compression (which is what I think is causing the corruption I'm seeing even with the WIP patch referenced above).

dweeezil commented 7 years ago

I worked on this issue with @grwilson at the OpenZFS summit and I believe we've gotten to the root causes. There should be a PR posted within the next couple of days with a fix.

hunbalazs commented 7 years ago

@dweeezil Thank you very much! I no longer have the exported stream because my storage capacity is very limited (also it contained lot of stuff I couldn't share).

dweeezil commented 7 years ago

I tested a complete version of the patch and it now receives the stream properly. After cleaning up the patch, it'll be posted as a PR tomorrow.