Closed hunbalazs closed 7 years ago
Receiving stream from zfs send -D
had a bug, which was fixed in https://github.com/zfsonlinux/zfs/commit/2e0e443ac40c7e825a02519a497328226bd866ff .
You have 2 options:
-zfs send
without -D
-use master branch
@gmelikov is it send or receive related? I already tried master on the recv side, same bug ocurred at the same time.
@hunbalazs IIRC it's receive related.
If it's not your case - try to send without -D
first.
I double checked, v0.7.0-rc3_72_g298ec40 is actually current master.
I was brave foolish enough to delete the original pool so I can't try to send w/o -D flag, sorry :disappointed: .
Feb 5 18:35:51 ubuntu kernel: [ 637.617914] ZFS: Loaded module v0.7.0-rc3_72_g298ec40, ZFS pool version 5000, ZFS filesystem version 5
Feb 5 18:35:53 ubuntu kernel: [ 639.710317] ZPIOS: Loaded module v0.7.0-rc3_72_g298ec40
Feb 5 18:37:40 ubuntu kernel: [ 746.773649] SPL: using hostid 0x00000000
Feb 5 18:40:36 ubuntu kernel: [ 922.922338] BUG: unable to handle kernel paging request at ffffc9002eea6000
Feb 5 18:40:36 ubuntu kernel: [ 922.922352] IP: [<ffffffff813ff224>] __memmove+0x24/0x1a0
Feb 5 18:40:36 ubuntu kernel: [ 922.922360] PGD 2250a1067 PUD 2250a2067 PMD 17977c067 PTE 0
Feb 5 18:40:36 ubuntu kernel: [ 922.922366] Oops: 0000 [#1] SMP
Feb 5 18:40:36 ubuntu kernel: [ 922.922370] Modules linked in: dm_crypt algif_skcipher af_alg zpios(OE) zfs(POE) zavl(POE) zcommon(POE) znvpair(POE) zunicode(POE) icp(POE) splat(OE) spl(OE) binfmt_misc drbg ansi_cprng ctr ccm rfcomm bnep snd_hda_codec_hdmi intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel arc4 kvm iwldvm irqbypass mac80211 crct10dif_pclmul dell_laptop dell_wmi sparse_keymap dcdbas crc32_pclmul dell_smm_hwmon snd_hda_codec_idt snd_hda_codec_generic aesni_intel btusb btrtl btbcm btintel snd_hda_intel aes_x86_64 bluetooth lrw iwlwifi snd_hda_codec gf128mul glue_helper snd_seq_midi snd_seq_midi_event snd_hda_core ablk_helper snd_rawmidi snd_hwdep cryptd joydev input_leds cfg80211 serio_raw snd_seq snd_pcm snd_seq_device lpc_ich snd_timer snd mei_me mei soundcore shpchp 8250_fintek mac_hid dell_smo8800 dell_rbtn parport_pc ppdev lp parport autofs4 overlay nls_utf8 isofs nls_iso8859_1 dm_mirror dm_region_hash dm_log uas usb_storage nouveau mxm_wmi i2c_algo_bit ttm psmouse drm_kms_helper syscopyarea sysfillrect sysimgblt e1000e fb_sys_fops ahci drm libahci ptp pps_core wmi video fjes
Feb 5 18:40:36 ubuntu kernel: [ 922.922513] CPU: 2 PID: 15386 Comm: receive_writer Tainted: P OE 4.4.0-31-generic #50-Ubuntu
Feb 5 18:40:36 ubuntu kernel: [ 922.922518] Hardware name: Dell Inc. Latitude E6530/0GMVN7, BIOS A17 08/19/2015
Feb 5 18:40:36 ubuntu kernel: [ 922.922522] task: ffff8801fffc8000 ti: ffff88007db28000 task.ti: ffff88007db28000
Feb 5 18:40:36 ubuntu kernel: [ 922.922525] RIP: 0010:[<ffffffff813ff224>] [<ffffffff813ff224>] __memmove+0x24/0x1a0
Feb 5 18:40:36 ubuntu kernel: [ 922.922531] RSP: 0018:ffff88007db2bcc8 EFLAGS: 00010206
Feb 5 18:40:36 ubuntu kernel: [ 922.922534] RAX: ffffc9000ea00000 RBX: ffff88016e60e000 RCX: 0000000000002a00
Feb 5 18:40:36 ubuntu kernel: [ 922.922537] RDX: 0000000000008a00 RSI: ffffc9002eea6000 RDI: ffffc9000ea06000
Feb 5 18:40:36 ubuntu kernel: [ 922.922540] RBP: ffff88007db2bd00 R08: 0000000000000001 R09: 0000000000000000
Feb 5 18:40:36 ubuntu kernel: [ 922.922543] R10: ffff88019777c580 R11: ffffffff81ccf1ea R12: ffffffffc0a9b090
Feb 5 18:40:36 ubuntu kernel: [ 922.922547] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Feb 5 18:40:36 ubuntu kernel: [ 922.922550] FS: 0000000000000000(0000) GS:ffff88022dc80000(0000) knlGS:0000000000000000
Feb 5 18:40:36 ubuntu kernel: [ 922.922554] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 5 18:40:36 ubuntu kernel: [ 922.922557] CR2: ffffc9002eea6000 CR3: 0000000002e0a000 CR4: 00000000001406e0
Feb 5 18:40:36 ubuntu kernel: [ 922.922560] Stack:
Feb 5 18:40:36 ubuntu kernel: [ 922.922562] ffffffffc0977db5 000000007db2bd60 000000027db2be30 ffff88016e60e000
Feb 5 18:40:36 ubuntu kernel: [ 922.922570] ffffffffc0a9b090 0000000000000000 0000000000000000 ffff88007db2bd48
Feb 5 18:40:36 ubuntu kernel: [ 922.922575] ffffffffc0978118 ffff880050329380 00000000f76f4800 ffff88007db2be30
Feb 5 18:40:36 ubuntu kernel: [ 922.922581] Call Trace:
Feb 5 18:40:36 ubuntu kernel: [ 922.922630] [<ffffffffc0977db5>] ? __dbuf_hold_impl+0x2d5/0x590 [zfs]
Feb 5 18:40:36 ubuntu kernel: [ 922.922651] [<ffffffffc0978118>] dbuf_hold_impl+0xa8/0xd0 [zfs]
Feb 5 18:40:36 ubuntu kernel: [ 922.922671] [<ffffffffc0979676>] dbuf_hold+0x36/0x60 [zfs]
Feb 5 18:40:36 ubuntu kernel: [ 922.922692] [<ffffffffc0980ea5>] dmu_buf_hold_noread+0x85/0xe0 [zfs]
Feb 5 18:40:36 ubuntu kernel: [ 922.922713] [<ffffffffc0980f2f>] dmu_buf_hold+0x2f/0x80 [zfs]
Feb 5 18:40:36 ubuntu kernel: [ 922.922733] [<ffffffffc097ba8c>] ? bqueue_dequeue+0xbc/0xd0 [zfs]
Feb 5 18:40:36 ubuntu kernel: [ 922.922757] [<ffffffffc098bd25>] receive_writer_thread+0x425/0x6f0 [zfs]
Feb 5 18:40:36 ubuntu kernel: [ 922.922767] [<ffffffffc06b0e4a>] ? spl_kmem_free+0x2a/0x40 [spl]
Feb 5 18:40:36 ubuntu kernel: [ 922.922795] [<ffffffffc098b900>] ? receive_freeobjects.isra.12+0xe0/0xe0 [zfs]
Feb 5 18:40:36 ubuntu kernel: [ 922.922805] [<ffffffffc06b34f1>] thread_generic_wrapper+0x71/0x80 [spl]
Feb 5 18:40:36 ubuntu kernel: [ 922.922813] [<ffffffffc06b3480>] ? __thread_exit+0x20/0x20 [spl]
Feb 5 18:40:36 ubuntu kernel: [ 922.922823] [<ffffffff810a0808>] kthread+0xd8/0xf0
Feb 5 18:40:36 ubuntu kernel: [ 922.922829] [<ffffffff810a0730>] ? kthread_create_on_node+0x1e0/0x1e0
Feb 5 18:40:36 ubuntu kernel: [ 922.922839] [<ffffffff8182decf>] ret_from_fork+0x3f/0x70
Feb 5 18:40:36 ubuntu kernel: [ 922.922843] [<ffffffff810a0730>] ? kthread_create_on_node+0x1e0/0x1e0
Feb 5 18:40:36 ubuntu kernel: [ 922.922846] Code: 88 0c 17 88 0f c3 90 48 89 f8 48 83 fa 20 0f 82 03 01 00 00 48 39 fe 7d 0f 49 89 f0 49 01 d0 49 39 f8 0f 8f 9f 00 00 00 48 89 d1 <f3> a4 c3 48 81 fa a8 02 00 00 72 05 40 38 fe 74 3b 48 83 ea 20
Feb 5 18:40:36 ubuntu kernel: [ 922.922917] RIP [<ffffffff813ff224>] __memmove+0x24/0x1a0
Feb 5 18:40:36 ubuntu kernel: [ 922.922924] RSP <ffff88007db2bcc8>
Feb 5 18:40:36 ubuntu kernel: [ 922.922927] CR2: ffffc9002eea6000
Feb 5 18:40:36 ubuntu kernel: [ 922.932837] ---[ end trace f82349dba0ad7f3e ]---
@hunbalazs do you still have the send stream lying around? It'd be interesting to see how the record that is causing the fault looks like, can you patch zstreamdump
:
diff --git a/cmd/zstreamdump/zstreamdump.c b/cmd/zstreamdump/zstreamdump.c
index aa594b8..dc1f9d0 100644
--- a/cmd/zstreamdump/zstreamdump.c
+++ b/cmd/zstreamdump/zstreamdump.c
@@ -95,6 +95,8 @@ ssread(void *buf, size_t len, zio_cksum_t *cksum)
if ((outlen = fread(buf, len, 1, send_stream)) == 0)
return (0);
+ fwrite(buf, len, 1, stderr);
+
if (do_cksum) {
if (do_byteswap)
fletcher_4_incremental_byteswap(buf, len, cksum);
and then try to receive the send stream again with:
{ cat $streamfile | /path/to/patched/zstreamdump -vv 2>&1 1>&3 3>&- | cat > zfs recv $dest; } 3>&1 1>&2
I'm not sure it's going to work but if it's the only data remaining from the old pool it's probably worth a try.
@loli10K thank you for the help
I still have the stream lying on an external hard drive.
I allied the patch to current master and built zfs, but I'm sure it wont work because cat > zfs
just overwrites the zfs binary
@hunbalazs hum, i'm not very good with the shell, maybe remove the "cat >" part? { cat $streamfile | /path/to/patched/zstreamdump -vv 2>&1 1>&3 3>&- | zfs recv $dest; } 3>&1 1>&2
The idea here is to "print" the stream data in human readable form (via zstreamdump) just before we receive it, hopefully the very last DRR printed is the one causing the fault. We try to do that with the shell by piping cat $streamfile
to the patched zstreamdump which acts just as a "printing filter" before the actual receive.
@loli10K I'm not a shell wizard either :smile: We got our little bad object:
WRITE object = 1740980 type = 19 checksum type = 8 compression type = 15
offset = 0 logical_size = 15872 compressed_size = 8192 payload_size = 8192 props = f000f001e
no checksum at the end, as I can see other object have checksum after offset line
@hunbalazs that doesn't look like anything to me. Unfortunately. I would have expected a WRITE_BYREF
record to be the cause.
Can you share the stream that is causing the fault? The next step i'd take to troubleshoot something like this is to get kernel+zfs with debugging symbols, try some stap
printf'ing around the functions dumped in the stacktrace while receiving the stream and/or attach a debugger and see what happens.
@loli10K compiled everything again with --debug-* and imported everything without a hitch. Do you still need the backup, should I try something else to debug it further?
@hunbalazs if you feel comfortable i'd encourage you to try and debug this issue on a live kernel by yourself (i think you can find a lot of tutorials online on how to do it, and hopefully one day on the ZoL wiki https://github.com/zfsonlinux/zfs/wiki/Debugging).
The way i usually do this for development is a QEMU box with a serial port via gdb, so if anything goes wrong i can just bounce it and restart the debugging session. Basically just attach gdb to the virtual machine, load all the symbols, setup some breakpoints (i'd start with receive_writer_thread
and the other functions in the backtrace) and start receiving the backup stream: when you hit the breakpoint inspect the frame and look for suspicious variables that could lead to the issue. Rinse and repeat.
If you don't mind sharing the backup stream i'd also like to try and debug this issue. It's a real pity we lost the source pool, now we don't know if it's reproducible on the send side.
I'm finding this very easy to reproduce using tag:zfs-0.7.2 built from source on Ubuntu 16.04.3. (And this bug itself is a huge nightmare for me, but that's besides the point.)
a
on a sparse image file (say).zfs snapshot a@0
zfs send -cDR a@0 > a.zfs
zfs destroy -r a
and then cat a.zfs | zfs recv -F -v a
ct 5 23:00:49 kucalc-ws1-node-nepb systemd[1]: Started Session 95 of user cocalc.
Oct 5 23:00:50 kucalc-ws1-node-nepb systemd[1]: Started Session 96 of user cocalc.
[ 466.243320] BUG: unable to handle kernel paging request at ffffa5030586a000
[ 466.250428] IP: __memmove+0x24/0x1a0
[ 466.254102] PGD 12b495067
[ 466.254103] PUD 12b496067
[ 466.256909] PMD 8eb8b067
[ 466.259733] PTE 0
[ 466.262471]
[ 466.266086] Oops: 0000 [#1] SMP
[ 466.269325] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache xt_set xt_multiport ipt
able_mangle iptable_raw ipip ip_set_hash_ip tunnel4 ip_set_hash_net ip_tunnel ip_set veth nf_conntrack_netlink nfne
tlink xt_statistic xt_nat xt_recent ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_comment xt_mark ipt_MASQUERADE nf_nat_ma
squerade_ipv4 xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype xt_conntrack
nf_nat nf_conntrack br_netfilter bridge stp llc aufs zfs(POE) ip6table_filter zunicode(POE) ip6_tables zavl(POE) i
ptable_filter ip_tables icp(POE) x_tables zcommon(POE) znvpair(POE) spl(OE) ppdev input_leds serio_raw pvpanic parp
ort_pc binfmt_misc parport ib_iser rdma_cm iw_cm ib_cm ib_core sunrpc configfs iscsi_tcp libiscsi_tcp libiscsi scsi
_transport_iscsi
[ 466.340918] autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_p
q libcrc32c raid1 raid0 multipath linear crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86
_64 crypto_simd glue_helper cryptd psmouse virtio_net virtio_scsi
[ 466.366829] CPU: 0 PID: 14925 Comm: receive_writer Tainted: P OE 4.10.0-35-generic #39~16.04.1-Ubunt
u
[ 466.377188] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
[ 466.386507] task: ffff95e392ad9680 task.stack: ffffa50301e5c000
[ 466.392531] RIP: 0010:__memmove+0x24/0x1a0
[ 466.396729] RSP: 0018:ffffa50301e5fcd8 EFLAGS: 00010287
[ 466.402053] RAX: ffffa50309292000 RBX: ffff95e382c72800 RCX: 000000000000b000
[ 466.409287] RDX: 0000000000020000 RSI: ffffa5030586a000 RDI: ffffa503092a7000
[ 466.416536] RBP: ffffa50301e5fd10 R08: ffffa50305875000 R09: 0000000000000000
[ 466.423770] R10: ffff95e374b213e0 R11: ffffdcf201fcf820 R12: ffffffffc08fd2b0
[ 466.431003] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 466.438238] FS: 0000000000000000(0000) GS:ffff95e42fc00000(0000) knlGS:0000000000000000
[ 466.446428] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 466.452272] CR2: ffffa5030586a000 CR3: 0000000105135000 CR4: 00000000001406f0
[ 466.459510] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 466.466745] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 466.473980] Call Trace:
[ 466.476599] ? __dbuf_hold_impl+0x4e9/0x640 [zfs]
[ 466.481456] dbuf_hold_impl+0xa8/0xd0 [zfs]
[ 466.485779] dbuf_hold+0x36/0x60 [zfs]
[ 466.489668] dmu_buf_hold_noread+0x85/0x100 [zfs]
[ 466.494527] dmu_buf_hold+0x2f/0x80 [zfs]
[ 466.498710] receive_writer_thread+0x5da/0xa00 [zfs]
[ 466.503795] ? set_next_entity+0xc3/0x1b0
[ 466.507911] ? set_curr_task_fair+0x2b/0x60
[ 466.512207] ? spl_kmem_free+0x33/0x40 [spl]
[ 466.516620] ? receive_freeobjects.isra.14+0xf0/0xf0 [zfs]
[ 466.522214] thread_generic_wrapper+0x72/0x80 [spl]
[ 466.527197] kthread+0x109/0x140
[ 466.530545] ? __thread_exit+0x20/0x20 [spl]
[ 466.534920] ? kthread_create_on_node+0x60/0x60
[ 466.539575] ret_from_fork+0x2c/0x40
[ 466.543253] Code: 90 90 90 90 90 90 90 48 89 f8 48 83 fa 20 0f 82 03 01 00 00 48 39 fe 7d 0f 49 89 f0 49 01 d0 4
9 39 f8 0f 8f 9f 00 00 00 48 89 d1 <f3> a4 c3 48 81 fa a8 02 00 00 72 05 40 38 fe 74 3b 48 83 ea 20
[ 466.562236] RIP: __memmove+0x24/0x1a0 RSP: ffffa50301e5fcd8
[ 466.567909] CR2: ffffa5030586a000
[ 466.571325] ---[ end trace 9dbe9f086e2fcc44 ]---
[ 466.576042] Kernel panic - not syncing: Fatal exception
[ 466.582116] Kernel Offset: 0x19800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbf
ffffff)
[ 466.593014] Rebooting in 10 seconds..
[ 476.519578] ACPI MEMORY or I/O RESET_REG.
For me also using -L in step 4 makes absolutely no difference; it still crashes.
The exact same filesystem sent without -D recv's fine (everything as above but no D -- still compressed, etc.).
If you try the above and still can't reproduce the problem, ping me and I'll provide more help.
One I'm seeing is I guess exactly the same as https://github.com/zfsonlinux/zfs/issues/6334
So please I guess disregard my above remark in favor of "I'm another person seeing the same thing as was just reported in #6334".
It's failing in memmove()
, called from __dbuf_hold_impl()
, which looks like:
static int
__dbuf_hold_impl(struct dbuf_hold_impl_data *dh)
{
...
bcopy(dh->dh_dr->dt.dl.dr_data->b_data,
dh->dh_db->db.db_data, dh->dh_db->db.db_size);
...
}
And judging from RDX: 0000000000020000
, it's trying to copy 0x20000 bytes (== 128KB).
Copying a linear 128KB in kernel memory sounds off. Maybe something to do with the ABD changes, e.g. trying to do a linear copy on a non-linear buffer?
@williamstein are you able to provide a disassembly of your memmove
? E.g. if you have your vmlinux
file:
gdb -batch -ex 'file vmlinux' -ex 'disassemble /r memmove'
Reason being, I'm trying to work out what the actual args to the memmove
are, but the my local memmove
doesn't contain the Code
from the crash dump:
echo '[ 466.543253] Code: 90 90 90 90 90 90 90 48 89 f8 48 83 fa 20 0f 82 03 01 00 00 48 39 fe 7d 0f 49 89 f0 49 01 d0 49 39 f8 0f 8f 9f 00 00 00 48 89 d1 <f3> a4 c3 48 81 fa a8 02 00 00 72 05 40 38 fe 74 3b 48 83 ea 20' |
perl -ne '
chomp; s/.*Code: //;
s/[<>]//g; printf("char Code[] = \"\\x%s\";\nint main(){}\n", join("\\x", split));
' > /tmp/tmp.c
$ gcc -g -o /tmp/tmp.exe /tmp/tmp.c && gdb -batch -ex "file /tmp/tmp.exe" -ex "disassemble/r Code"
Dump of assembler code for function Code:
0x0000000000601040 <+0>: 90 nop
0x0000000000601041 <+1>: 90 nop
0x0000000000601042 <+2>: 90 nop
0x0000000000601043 <+3>: 90 nop
0x0000000000601044 <+4>: 90 nop
0x0000000000601045 <+5>: 90 nop
0x0000000000601046 <+6>: 90 nop
0x0000000000601047 <+7>: 48 89 f8 mov %rdi,%rax # %rdi:ffffa503092a7000 %rax:ffffa50309292000
0x000000000060104a <+10>: 48 83 fa 20 cmp $0x20,%rdx # %rdx:0000000000020000
0x000000000060104e <+14>: 0f 82 03 01 00 00 jb 0x601157
0x0000000000601054 <+20>: 48 39 fe cmp %rdi,%rsi
0x0000000000601057 <+23>: 7d 0f jge 0x601068 <Code+40>
0x0000000000601059 <+25>: 49 89 f0 mov %rsi,%r8
0x000000000060105c <+28>: 49 01 d0 add %rdx,%r8
0x000000000060105f <+31>: 49 39 f8 cmp %rdi,%r8
0x0000000000601062 <+34>: 0f 8f 9f 00 00 00 jg 0x601107
0x0000000000601068 <+40>: 48 89 d1 mov %rdx,%rcx
0x000000000060106b <+43>: f3 a4 rep movsb %ds:(%rsi),%es:(%rdi)
0x000000000060106d <+45>: c3 retq
0x000000000060106e <+46>: 48 81 fa a8 02 00 00 cmp $0x2a8,%rdx
0x0000000000601075 <+53>: 72 05 jb 0x60107c <Code+60>
0x0000000000601077 <+55>: 40 38 fe cmp %dil,%sil
0x000000000060107a <+58>: 74 3b je 0x6010b7
0x000000000060107c <+60>: 48 83 ea 20 sub $0x20,%rdx
0x0000000000601080 <+64>: 00 00 add %al,(%rax)
End of assembler dump.
Hi @chrisrd I'm with @williamstein .... and I have to admit I barely know what I'm doing here. The kernel is from Ubuntu, and running this gdb command on the variant with debug symbols gives this:
$ gdb -batch -ex 'file vmlinux-4.10.0-35-generic' -ex 'disassemble /r memmove' > /tmp/vmlinux-4.10.0-35-generic-memmove.txt
$ cat /tmp/vmlinux-4.10.0-35-generic-memmove.txt
Dump of assembler code for function memmove:
0xffffffff814585d0 <+0>: 48 89 f8 mov %rdi,%rax
0xffffffff814585d3 <+3>: 48 83 fa 20 cmp $0x20,%rdx
0xffffffff814585d7 <+7>: 0f 82 03 01 00 00 jb 0xffffffff814586e0 <memmove+272>
0xffffffff814585dd <+13>: 48 39 fe cmp %rdi,%rsi
0xffffffff814585e0 <+16>: 7d 0f jge 0xffffffff814585f1 <memmove+33>
0xffffffff814585e2 <+18>: 49 89 f0 mov %rsi,%r8
0xffffffff814585e5 <+21>: 49 01 d0 add %rdx,%r8
0xffffffff814585e8 <+24>: 49 39 f8 cmp %rdi,%r8
0xffffffff814585eb <+27>: 0f 8f 9f 00 00 00 jg 0xffffffff81458690 <memmove+192>
0xffffffff814585f1 <+33>: 90 nop
0xffffffff814585f2 <+34>: 90 nop
0xffffffff814585f3 <+35>: 90 nop
0xffffffff814585f4 <+36>: 90 nop
0xffffffff814585f5 <+37>: 90 nop
0xffffffff814585f6 <+38>: 90 nop
0xffffffff814585f7 <+39>: 48 81 fa a8 02 00 00 cmp $0x2a8,%rdx
0xffffffff814585fe <+46>: 72 05 jb 0xffffffff81458605 <memmove+53>
0xffffffff81458600 <+48>: 40 38 fe cmp %dil,%sil
0xffffffff81458603 <+51>: 74 3b je 0xffffffff81458640 <memmove+112>
0xffffffff81458605 <+53>: 48 83 ea 20 sub $0x20,%rdx
0xffffffff81458609 <+57>: 48 83 ea 20 sub $0x20,%rdx
0xffffffff8145860d <+61>: 4c 8b 1e mov (%rsi),%r11
0xffffffff81458610 <+64>: 4c 8b 56 08 mov 0x8(%rsi),%r10
0xffffffff81458614 <+68>: 4c 8b 4e 10 mov 0x10(%rsi),%r9
0xffffffff81458618 <+72>: 4c 8b 46 18 mov 0x18(%rsi),%r8
0xffffffff8145861c <+76>: 48 8d 76 20 lea 0x20(%rsi),%rsi
0xffffffff81458620 <+80>: 4c 89 1f mov %r11,(%rdi)
0xffffffff81458623 <+83>: 4c 89 57 08 mov %r10,0x8(%rdi)
0xffffffff81458627 <+87>: 4c 89 4f 10 mov %r9,0x10(%rdi)
0xffffffff8145862b <+91>: 4c 89 47 18 mov %r8,0x18(%rdi)
0xffffffff8145862f <+95>: 48 8d 7f 20 lea 0x20(%rdi),%rdi
0xffffffff81458633 <+99>: 73 d4 jae 0xffffffff81458609 <memmove+57>
0xffffffff81458635 <+101>: 48 83 c2 20 add $0x20,%rdx
0xffffffff81458639 <+105>: e9 a2 00 00 00 jmpq 0xffffffff814586e0 <memmove+272>
0xffffffff8145863e <+110>: 66 90 xchg %ax,%ax
0xffffffff81458640 <+112>: 48 89 d1 mov %rdx,%rcx
0xffffffff81458643 <+115>: 4c 8b 5c 16 f8 mov -0x8(%rsi,%rdx,1),%r11
0xffffffff81458648 <+120>: 4c 8d 54 17 f8 lea -0x8(%rdi,%rdx,1),%r10
0xffffffff8145864d <+125>: 48 c1 e9 03 shr $0x3,%rcx
0xffffffff81458651 <+129>: f3 48 a5 rep movsq %ds:(%rsi),%es:(%rdi)
0xffffffff81458654 <+132>: 4d 89 1a mov %r11,(%r10)
0xffffffff81458657 <+135>: e9 0c 01 00 00 jmpq 0xffffffff81458768 <memmove+408>
0xffffffff8145865c <+140>: 0f 1f 40 00 nopl 0x0(%rax)
0xffffffff81458660 <+144>: 48 89 d1 mov %rdx,%rcx
0xffffffff81458663 <+147>: 4c 8b 1e mov (%rsi),%r11
0xffffffff81458666 <+150>: 49 89 fa mov %rdi,%r10
0xffffffff81458669 <+153>: 48 8d 74 16 f8 lea -0x8(%rsi,%rdx,1),%rsi
0xffffffff8145866e <+158>: 48 8d 7c 17 f8 lea -0x8(%rdi,%rdx,1),%rdi
0xffffffff81458673 <+163>: 48 c1 e9 03 shr $0x3,%rcx
0xffffffff81458677 <+167>: fd std
0xffffffff81458678 <+168>: f3 48 a5 rep movsq %ds:(%rsi),%es:(%rdi)
0xffffffff8145867b <+171>: fc cld
0xffffffff8145867c <+172>: 4d 89 1a mov %r11,(%r10)
0xffffffff8145867f <+175>: e9 e4 00 00 00 jmpq 0xffffffff81458768 <memmove+408>
0xffffffff81458684 <+180>: 66 90 xchg %ax,%ax
0xffffffff81458686 <+182>: 66 2e 0f 1f 84 00 00 00 00 00 nopw %cs:0x0(%rax,%rax,1)
0xffffffff81458690 <+192>: 48 81 fa a8 02 00 00 cmp $0x2a8,%rdx
0xffffffff81458697 <+199>: 72 05 jb 0xffffffff8145869e <memmove+206>
0xffffffff81458699 <+201>: 40 38 fe cmp %dil,%sil
0xffffffff8145869c <+204>: 74 c2 je 0xffffffff81458660 <memmove+144>
0xffffffff8145869e <+206>: 48 01 d6 add %rdx,%rsi
0xffffffff814586a1 <+209>: 48 01 d7 add %rdx,%rdi
0xffffffff814586a4 <+212>: 48 83 ea 20 sub $0x20,%rdx
0xffffffff814586a8 <+216>: 48 83 ea 20 sub $0x20,%rdx
0xffffffff814586ac <+220>: 4c 8b 5e f8 mov -0x8(%rsi),%r11
0xffffffff814586b0 <+224>: 4c 8b 56 f0 mov -0x10(%rsi),%r10
0xffffffff814586b4 <+228>: 4c 8b 4e e8 mov -0x18(%rsi),%r9
0xffffffff814586b8 <+232>: 4c 8b 46 e0 mov -0x20(%rsi),%r8
0xffffffff814586bc <+236>: 48 8d 76 e0 lea -0x20(%rsi),%rsi
0xffffffff814586c0 <+240>: 4c 89 5f f8 mov %r11,-0x8(%rdi)
0xffffffff814586c4 <+244>: 4c 89 57 f0 mov %r10,-0x10(%rdi)
0xffffffff814586c8 <+248>: 4c 89 4f e8 mov %r9,-0x18(%rdi)
0xffffffff814586cc <+252>: 4c 89 47 e0 mov %r8,-0x20(%rdi)
0xffffffff814586d0 <+256>: 48 8d 7f e0 lea -0x20(%rdi),%rdi
0xffffffff814586d4 <+260>: 73 d2 jae 0xffffffff814586a8 <memmove+216>
0xffffffff814586d6 <+262>: 48 83 c2 20 add $0x20,%rdx
0xffffffff814586da <+266>: 48 29 d6 sub %rdx,%rsi
0xffffffff814586dd <+269>: 48 29 d7 sub %rdx,%rdi
0xffffffff814586e0 <+272>: 48 83 fa 10 cmp $0x10,%rdx
0xffffffff814586e4 <+276>: 72 2a jb 0xffffffff81458710 <memmove+320>
0xffffffff814586e6 <+278>: 4c 8b 1e mov (%rsi),%r11
0xffffffff814586e9 <+281>: 4c 8b 56 08 mov 0x8(%rsi),%r10
0xffffffff814586ed <+285>: 4c 8b 4c 16 f0 mov -0x10(%rsi,%rdx,1),%r9
0xffffffff814586f2 <+290>: 4c 8b 44 16 f8 mov -0x8(%rsi,%rdx,1),%r8
0xffffffff814586f7 <+295>: 4c 89 1f mov %r11,(%rdi)
0xffffffff814586fa <+298>: 4c 89 57 08 mov %r10,0x8(%rdi)
0xffffffff814586fe <+302>: 4c 89 4c 17 f0 mov %r9,-0x10(%rdi,%rdx,1)
0xffffffff81458703 <+307>: 4c 89 44 17 f8 mov %r8,-0x8(%rdi,%rdx,1)
0xffffffff81458708 <+312>: eb 5e jmp 0xffffffff81458768 <memmove+408>
0xffffffff8145870a <+314>: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
0xffffffff81458710 <+320>: 48 83 fa 08 cmp $0x8,%rdx
0xffffffff81458714 <+324>: 72 12 jb 0xffffffff81458728 <memmove+344>
0xffffffff81458716 <+326>: 4c 8b 1e mov (%rsi),%r11
0xffffffff81458719 <+329>: 4c 8b 54 16 f8 mov -0x8(%rsi,%rdx,1),%r10
0xffffffff8145871e <+334>: 4c 89 1f mov %r11,(%rdi)
0xffffffff81458721 <+337>: 4c 89 54 17 f8 mov %r10,-0x8(%rdi,%rdx,1)
0xffffffff81458726 <+342>: eb 40 jmp 0xffffffff81458768 <memmove+408>
0xffffffff81458728 <+344>: 48 83 fa 04 cmp $0x4,%rdx
0xffffffff8145872c <+348>: 72 12 jb 0xffffffff81458740 <memmove+368>
0xffffffff8145872e <+350>: 44 8b 1e mov (%rsi),%r11d
0xffffffff81458731 <+353>: 44 8b 54 16 fc mov -0x4(%rsi,%rdx,1),%r10d
0xffffffff81458736 <+358>: 44 89 1f mov %r11d,(%rdi)
0xffffffff81458739 <+361>: 44 89 54 17 fc mov %r10d,-0x4(%rdi,%rdx,1)
0xffffffff8145873e <+366>: eb 28 jmp 0xffffffff81458768 <memmove+408>
0xffffffff81458740 <+368>: 48 83 fa 02 cmp $0x2,%rdx
0xffffffff81458744 <+372>: 72 16 jb 0xffffffff8145875c <memmove+396>
0xffffffff81458746 <+374>: 66 44 8b 1e mov (%rsi),%r11w
0xffffffff8145874a <+378>: 66 44 8b 54 16 fe mov -0x2(%rsi,%rdx,1),%r10w
0xffffffff81458750 <+384>: 66 44 89 1f mov %r11w,(%rdi)
0xffffffff81458754 <+388>: 66 44 89 54 17 fe mov %r10w,-0x2(%rdi,%rdx,1)
0xffffffff8145875a <+394>: eb 0c jmp 0xffffffff81458768 <memmove+408>
0xffffffff8145875c <+396>: 48 83 fa 01 cmp $0x1,%rdx
0xffffffff81458760 <+400>: 72 06 jb 0xffffffff81458768 <memmove+408>
0xffffffff81458762 <+402>: 44 8a 1e mov (%rsi),%r11b
0xffffffff81458765 <+405>: 44 88 1f mov %r11b,(%rdi)
0xffffffff81458768 <+408>: c3 retq
End of assembler dump.
Hi @haraldschilly, thanks, that's exactly what I was after. However your disassembly is (practically) identical to my disassembly (using a different kernel version and likely a different compiler version), i.e. neither contains the code sequence from the crash dump. That's a mystery!
ok, wait, maybe there is an inconsistency I'm not aware of
Scratch that! If you ignore the 'nops' at the start of the Code
disassembly, you see it matches the start of your (and my disassembly). Then the 'nops' in your disassembly (6 bytes, <+33>
to <+38>
are different to the Code
disassembly (6 bytes, <+40>
to <+45>
), but after that they both match up again. Looks like there must be some kind if live patching going on.
I think that gives me enough to work on, I'll see what I can see.
Thanks again!
I was just checking the kernel versions, they're all the same.
So, looks like there's something wrong with the source address (and/or size), i.e. b_data
:
bcopy(dh->dh_dr->dt.dl.dr_data->b_data,
dh->dh_db->db.db_data, dh->dh_db->db.db_size);
Basically:
$ linux/scripts/decodecode <<END
[466.543253] Code: 90 90 90 90 90 90 90 48 89 f8 48 83 fa 20 0f 82 03 01 00 00 48 39 fe 7d 0f 49 89 f0 49 01 d0 49 39 f8 0f 8f 9f 00 00 00 48 89 d1 <f3> a4 c3 48 81 fa a8 02 00 00 72 05 40 38 fe 74 3b 48 83 ea 20
END
All code
========
0: 90 nop
1: 90 nop
2: 90 nop
3: 90 nop
4: 90 nop
5: 90 nop
6: 90 nop
7: 48 89 f8 mov %rdi,%rax
a: 48 83 fa 20 cmp $0x20,%rdx
e: 0f 82 03 01 00 00 jb 0x117
14: 48 39 fe cmp %rdi,%rsi
17: 7d 0f jge 0x28
19: 49 89 f0 mov %rsi,%r8
1c: 49 01 d0 add %rdx,%r8
1f: 49 39 f8 cmp %rdi,%r8
22: 0f 8f 9f 00 00 00 jg 0xc7
28: 48 89 d1 mov %rdx,%rcx
2b: f3 a4 rep movsb %ds:(%rsi),%es:*(%rdi) <-- trapping instruction
2d: c3 retq
2e: 48 81 fa a8 02 00 00 cmp $0x2a8,%rdx
35: 72 05 jb 0x3c
37: 40 38 fe cmp %dil,%sil
3a: 74 3b je 0x77
3c: 48 83 ea 20 sub $0x20,%rdx
Code starting with the faulting instruction
===========================================
0: f3 a4 rep movsb %ds:(%rsi),%es:(%rdi)
2: c3 retq
3: 48 81 fa a8 02 00 00 cmp $0x2a8,%rdx
a: 72 05 jb 0x11
c: 40 38 fe cmp %dil,%sil
f: 74 3b je 0x4c
11: 48 83 ea 20 sub $0x20,%rdx
The faulting instruction, rep movsb %ds:(%rsi),%es:*(%rdi)
, copies %rcx bytes from the location pointed to by %rsi, to the location pointed to by %rdi, and for each byte copy, decrementing %rcx and incrementing %rsi and %rdi.
The CR2 register contains the faulting address, ffffa5030586a000
, i.e. it's %rsi, the source address.
The previous instruction, mov %rdx,%rcx
shows that %rcx starts out at %rdx, i.e. 0x2000 == 128K. By the time we fault %rcx is 0xb000. I.e. we've already copied 0x2000 - 0xb000 == 0x15000 bytes.
And now it's way past my bedtime! G'night.
@hunbalazs The problem is the difference between compressed size and uncompressed size. The bcopy is trying to copy the uncompressed size for a compressed buffer. The ARC buffer in dr_data
is compressed but the space allocated in the prior dbuf_set_data()
call is based on an uncompressed size. It seems like either the newly-allocated buffer should be sized based on the logical size of the ARC buffer in dr_data
. This seems to be a place where a specific compression test may be necessary.
Actually, it looks like an ABD issue after all, although compression likely comes into play, too.
The __dbuf_hold_impl()
function is doing at least one thing wrong here: it's never allocating a compressed ARC buffer, regardless of the compression of the arc buffer in the dbuf dirty record (dh->dh_dr->db.dl.dr_data
). This causes various mis-matches of source and destination buffer sizes.
Also, it seems that in some cases, a compressed ARC buffer is in the dirty record, but its header indicates no compression.
I'm been playing with a fix in dweeezil:issue-5742 but its as likely as not that the real problem lies well outside the __dbuf_hold_impl()
function.
The goal of the patch referenced above is to allocate a new compressed ARC buffer when the dirty record's arc buffer is compressed. Unfortunately, there's no API for the dbuf layer to determine this so I added the arc_buf_compressed()
function (which also indirectly returns the type of the compression in the header, for debugging).
At this patch stands now, it does eliminate the panics, however, there is typically at least one corrupted file after the receive is done and I suspect it has to do with the mismatch between the compression flag on the ARC buffer and its corresponding header.
I'm not sure where to go with this because I'm not really very tuned into the intricacies of the compressed ARC. FWIW, it seems that this bug is only triggered when receiving send streams with both compression and dedup enabled.
FWIW, it seems that this bug is only triggered when receiving send streams with both compression and dedup enabled.
Just to clarify, do you mean:
(a) compression and dedup enabled in the stream itself, e.g., by using the -LDc
option to zfs send
, or
(b) compression and dedup enabled in the receiving pool. If so, does the compression setting matter.
I hope you mean (a) and that (b) is completely irrelevant to this bug.
I'm only asking, because I'm trying to find a workaround -- I'm going to be storing over a half million ZFS streams soon, where some of the data is very dedup-friendly (collected homework assignments in classes), so it would be very good if I could use dedup on them. If completely avoiding zfs stream compression, and instead just piping through lz4 would work, that would be fine for me.
@williamstein I meant (a), but need to run a few more tests to be sure. That said, however, it's not clear to me why this is only every triggered by the act of receiving specifically-formatted send streams as opposed to normal filesystem access. I have a feeling that dbuf_fix_old_data()
may be involved but, again, I need to do a bit more testing.
Keep in mind that deduped send streams (zfs send -D) don't really have anything to do with ZFS' dedup facility but are more of a fancy form of compression; they (dedup send stream) can be received on a file system which is not dedup-enabled.
@dweeezil thanks. I also tested a lot yesterday, and (a) seems to be true, at least in all my tests. I'm now storing my zfs streams using the output of zfs send -DL ... | lz4 -c
.
@williamstein That's good information. For my part, I'll not likely have a chance to do more work on this until the coming weekend during which I'll try to track down how we're getting a compressed ARC buffer with an associated ARC header indicating no compression (which is what I think is causing the corruption I'm seeing even with the WIP patch referenced above).
I worked on this issue with @grwilson at the OpenZFS summit and I believe we've gotten to the root causes. There should be a PR posted within the next couple of days with a fix.
@dweeezil Thank you very much! I no longer have the exported stream because my storage capacity is very limited (also it contained lot of stuff I couldn't share).
I tested a complete version of the patch and it now receives the stream properly. After cleaning up the patch, it'll be posted as a PR tomorrow.
System information
Describe the problem you're observing
Saved a pool with the same version (fully upgraded pool)
zfs send -R -D -e -v rpool@beforesend
. When I want to recv it back I get the following BUGDescribe how to reproduce the problem
zfs upgrade with 0.7.0-rc3 zfs send -R -D -e -v <pool>@<snap> zfs revc -v -s -F <pool>
Include any warning/errors/backtraces from the system logs