Closed dioni21 closed 6 years ago
Today's debug results:
By carefully using the method from zfs_revert described at #6497 and #5389, I managed to delete the latest TXG uberblock, and finally could import the pool.
I'll now start a scrub and sleep. Any more ideas? I doubt that retrying the recv will work, but maybe with other send options...
@rincebrain could this be another variant of the zfs send
bug?
@dioni21 If it's a send bug, I would swear it's not related, since ignore_hole_birth should explicitly disable ever using that kind of information, so even if there are more hole_birth bugs, it shouldn't apply here.
It'd have been nice if we could have somehow kept the mangled block around to compare and see what, if anything, else was hosed, but that's a burnt bridge now.
Given the number of bugs that appear to involve impossible values showing up in structures, I have to wonder if you've got a CPU or RAM problem with bits flipping.
@dioni21 If it's a send bug, I would swear it's not related, since ignore_hole_birth should explicitly disable ever using that kind of information, so even if there are more hole_birth bugs, it shouldn't apply here.
We have talked about hole_birth
in the other issue, and I agreed. But this may be another new bug.
I've been told that using dedup is NOT recommended. Maybe most people do not use it, and do not see these bugs?
It'd have been nice if we could have somehow kept the mangled block around to compare and see what, if anything, else was hosed, but that's a burnt bridge now.
Indeed, I still have the second disk from first copy, in which I am almost sure that the very same bug has happened, and upgrading kernel only made it worse.
But what exactly do you want, and how do I extract?
Here is a dump of the cleaned uber block, for example:
000000 0000000000bab10c 0000000000001388
000010 0000000000002935 04046ce9920692a2
000020 000000005b53573a 0000000000000008
000030 0000000162322f60 0000000000000008
000040 00000001f85a0d38 0000000000000008
000050 000000029059e350 800b070200030003
000060 0000000000000000 0000000000000000
000070 0000000000000000 0000000000002935
000080 0000000000000128 00000004b32904f3
000090 000008caf06eb551 00083efb5508e826
0000a0 052b2ffd83f49c5d 0000000000001388
0000b0 00000000a11cea11 0000000000000000
0000c0 0000000000000000 0000000000000000
*
000fd0 0000000000000000 0210da7ab10c7a11
000fe0 94e69ffafcfd0351 4bace0fdcdc7c9a8
000ff0 a77f242abc04b785 8e777489ae6c7690
Given the number of bugs that appear to involve impossible values showing up in structures, I have to wonder if you've got a CPU or RAM problem with bits flipping.
If that was the problem, it would not be a repeated bug, at the very same point (tank/jonny@2: destination already exists
). Also, remember that after removing the --dedup
flag from zfs send
I could match md5sum for 2.54 TB of content.
I really thought I had some bad hardware when I had similar issues as #7723 but that was simply a recent kernel bug. I have run memcheck and mprime, 24hrs each, without issues.
@dioni21 I'm not saying it can't be a bug, just that if it is, it's not related to the hole_birth bug, and other than bugs in the not-in-a-release-yet encryption code, I haven't seen any zfs send bugs that aren't filed by you in recent memory.
It's not impossible that nobody else has run into it, since the ZFS code doesn't have secret telemetry phoning home to report who uses what features, but particularly with how you're running into it reliably, I would be quite surprised if it's "just" that nobody who uses send -D reported a bug.
If it were something like #7723, then you'd be in a rather sad position, because that bug appears to have not been bad hardware, but instead, a bad kernel that stomped on memory.
If the bug is like #6439, then dnodesize=legacy might work around it too, and then we know that the codepaths that need examining are ones where send -D is enabled and dnodesize is not legacy, but that makes it still surprising that it's a reliably reproducible race condition.
Okay, I spent a little bit of time and couldn't easily reproduce this last night with those exact pool properties and send flags on a single-disk pool.
I'm trying to think of a good set of instructions to give you for further investigation without you having to share the datasets in question. If I had the send streams, I think what I'd do with them would be trying to send e.g. half of the incremental stream and seeing if it panics, if not, 75%, if so, 25%, and so on, and continue until I ended up with whichever specific record in the stream made it eat paste, and then go tear apart the send stream for all the records related to that object. (I'd probably cheat and start at something like ~1684 MiB into the incremental stream, since that's the last part of your output before it panics, but that's just a premature optimization.)
Another approach would be to do the above until I knew what specific item's records were inducing the panic, then, instead of trying to tear apart the send stream, using zdb to investigate the object(s) in question on the sender side and see how they're "fine" on the source but causing fires when recv'd.
@rincebrain first of all, I wish to thank you a lot to keep giving me attention!
I have some more data to add from recent tests.
Okay, I spent a little bit of time and couldn't easily reproduce this last night with those exact pool properties and send flags on a single-disk pool.
:cry:
I'm trying to think of a good set of instructions to give you for further investigation without you having to share the datasets in question. If I had the send streams, I think what I'd do with them would be trying to send e.g. half of the incremental stream and seeing if it panics, if not, 75%, if so, 25%, and so on, and continue until I ended up with whichever specific record in the stream made it eat paste, and then go tear apart the send stream for all the records related to that object. (I'd probably cheat and start at something like ~1684 MiB into the incremental stream, since that's the last part of your output before it panics, but that's just a premature optimization.)
Humm. I think it's feasible. I'll have to change mbuffer to dd so that I can see which block gives error.
But now that you said this, I noticed that each run stopped at a different point. Those that I have logged:
in @ 2813 KiB/s, out @ 0.0 KiB/s, 1684 MiB total, buffer 44% full
in @ 511 KiB/s, out @ 0.0 KiB/s, 736 MiB total, buffer 43% full
in @ 0.0 KiB/s, out @ 0.0 KiB/s, 1315 MiB total, buffer 63% full
Everytime I read from the original pool, not from a dump file.
It is possible that on some of these times there were a backend free operation, from the last zfs destroy
. For example, right now I am recreating the tank/jonny dataset from zero for another test, and I have this:
zpool get freeing
NAME PROPERTY VALUE SOURCE
hgst freeing 0 -
tank freeing 279G -
I'm not sure what this means, though...
Another approach would be to do the above until I knew what specific item's records were inducing the panic, then, instead of trying to tear apart the send stream, using
zdb
to investigate the object(s) in question on the sender side and see how they're "fine" on the source but causing fires when recv'd.
One of the latest tests I've done was running a zdb
on new pool, with it offline (exported). I ran the command zdb -e -cc -dd -D -h -s -AAA -bbb -vv -I 320 tank
which created an output of 5.9GB, but no error.
I could run a zdb overnight on the source pool exported, any suggestions?
Yeah, I was thinking it might be a background processing task that mangled data, so if you decided to try chopping up the stream to figure out which part breaks it, I'd include something like a zpool sync; sleep 60; to see if it panicked at the point that you cut off the input.
If it were something like #7723, then you'd be in a rather sad position, because that bug appears to have not been bad hardware, but instead, a bad kernel that stomped on memory.
Yes, but I'm sure that my very first run was before upgrading to that kernel. And since I knew about it, I'm running in the previously safe kernel again. No more panic due to that.
If the bug is like #6439, then dnodesize=legacy might work around it too, and then we know that the codepaths that need examining are ones where send -D is enabled and dnodesize is not legacy, but that makes it still surprising that it's a reliably reproducible race condition.
I thought about that to. I changed the new dataset to dnodesize=legacy
, and remove xattrs=sa
to be sure. Also, using a clean (no flags) zfs send
command:
# zfs inherit xattr tank
# zfs set dnodesize=legacy tank
# zfs destroy tank/jonny
# zfs create tank/jonny
# zfs set compression=on secondarycache=all dedup=verify tank/jonny
# date;time zfs send hgst/jonny@pre_copy | mbuffer -m 4G -s 128k | time zfs receive -Fvs tank/jonny;date
Not finished yet. Currently on 160G of a 340G source dataset.
Also, a friend insisted that zpool import -F
should fix the import, and that zfs_revert
script would not be needed. I did some tests. After the panic condition, without running zfs_revert
, I did:
zpool import -Fn tank
=> panic
zpool import -F tank
=> panic
zpool import -FX tank
=> panic
I even did remove the l2arc device (an lvm on ssd), no change...
Another try:
echo 1 >> /sys/module/zfs/parameters/zfs_recover
The panic message on import is now somewhat different, and gave me some fear:
Jul 24 22:33:38 nexus kernel: WARNING: blkptr at 0000000014991dbf has invalid TYPE 165
Jul 24 22:33:38 nexus kernel: WARNING: blkptr at 0000000014991dbf has invalid CHECKSUM 91
Jul 24 22:33:38 nexus kernel: WARNING: blkptr at 0000000014991dbf has invalid COMPRESS 91
Jul 24 22:33:38 nexus kernel: WARNING: blkptr at 0000000014991dbf DVA 0 has invalid VDEV 2823991889
Jul 24 22:33:38 nexus kernel: WARNING: blkptr at 0000000014991dbf DVA 1 has invalid VDEV 1384481044
Jul 24 22:33:38 nexus kernel: WARNING: blkptr at 0000000014991dbf DVA 2 has invalid VDEV 1084271260
Jul 24 22:33:38 nexus kernel: WARNING: blkptr at 0000000014542e0a has invalid TYPE 165
Jul 24 22:33:38 nexus kernel: WARNING: blkptr at 0000000014542e0a has invalid CHECKSUM 91
Jul 24 22:33:38 nexus kernel: WARNING: blkptr at 0000000014542e0a has invalid COMPRESS 91
Jul 24 22:33:38 nexus kernel: WARNING: blkptr at 0000000014542e0a DVA 0 has invalid VDEV 2823991889
Jul 24 22:33:38 nexus kernel: WARNING: blkptr at 0000000014542e0a DVA 1 has invalid VDEV 1384481044
Jul 24 22:33:38 nexus kernel: WARNING: blkptr at 0000000014542e0a DVA 2 has invalid VDEV 1084271260
Jul 24 22:33:38 nexus kernel: BUG: unable to handle kernel
Jul 24 22:33:38 nexus kernel: WARNING: blkptr at 0000000028f45670 has invalid CHECKSUM 0
Jul 24 22:33:38 nexus kernel: WARNING: blkptr at 0000000028f45670 has invalid COMPRESS 0
Jul 24 22:33:38 nexus kernel: WARNING: blkptr at 0000000028f45670 DVA 0 has invalid VDEV 150402303
Jul 24 22:33:38 nexus kernel: NULL pointer dereference at 0000000000000110
Jul 24 22:33:38 nexus kernel: IP: ddt_object_lookup.constprop.6+0x7/0x50 [zfs]
Jul 24 22:33:38 nexus kernel: PGD 0 P4D 0
Jul 24 22:33:38 nexus kernel: Oops: 0000 [#1] SMP NOPTI
Jul 24 22:33:38 nexus kernel: Modules linked in: ppp_synctty ppp_async ppp_generic slhc bridge stp llc xt_REDIRECT nf_nat_redirect ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle nf_log_ipv4 nf_conntrack_
ipv4 nf_defrag_ipv4 nf_conntrack_netbios_ns nf_conntrack_broadcast ip6table_mangle ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 nf_log_common xt_LOG ip6t_frag nf_conntrack_ipv6 nf_defrag_ipv6 xt_multiport xt_conntrack nf_conntrack libcrc32c xt_ma
rk ip6table_filter ip6_tables it87 hwmon_vid zfs(POE) zunicode(POE) zavl(POE) icp(POE) snd_hda_codec_hdmi zcommon(POE) znvpair(POE) edac_mce_amd spl(OE) crct10dif_pclmul crc32_pclmul snd_hda_codec_realtek ghash_clmulni_intel snd_hda_codec_g
eneric snd_hda_intel snd_hda_codec snd_hda_core fam15h_power snd_hwdep k10temp snd_seq snd_seq_device snd_pcm
Jul 24 22:33:38 nexus kernel: snd_timer snd sp5100_tco soundcore i2c_piix4 shpchp acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc binfmt_misc raid1 crc32c_intel e100 r8169 mii uas usb_storage nouveau video mxm_wmi wmi i2c_algo_bit
drm_kms_helper ttm drm
Jul 24 22:33:38 nexus kernel: CPU: 1 PID: 11036 Comm: z_rd_iss Tainted: P OE 4.16.16-200.fc27.x86_64 #1
Jul 24 22:33:38 nexus kernel: Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./970A-D3P, BIOS FD 02/26/2016
Jul 24 22:33:38 nexus kernel: RIP: 0010:ddt_object_lookup.constprop.6+0x7/0x50 [zfs]
Jul 24 22:33:38 nexus kernel: RSP: 0018:ffff999368103c98 EFLAGS: 00010246
Jul 24 22:33:38 nexus kernel: RAX: ffff8adf47b80e00 RBX: ffff8adf47b80e00 RCX: 0000000000000000
Jul 24 22:33:38 nexus kernel: RDX: ffff8adf47b80e00 RSI: 0000000000000000 RDI: 0000000000000070
Jul 24 22:33:38 nexus kernel: RBP: 0000000000000070 R08: ffffb9933fc44980 R09: ffff999368103d80
Jul 24 22:33:38 nexus kernel: R10: 0000000000000000 R11: 0000000000000f01 R12: 00000000ffffffff
Jul 24 22:33:38 nexus kernel: R13: 0000000000080000 R14: 0000000000000070 R15: 0000000000000002
Jul 24 22:33:38 nexus kernel: FS: 0000000000000000(0000) GS:ffff8adf7ec40000(0000) knlGS:0000000000000000
Jul 24 22:33:38 nexus kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 24 22:33:38 nexus kernel: CR2: 0000000000000110 CR3: 0000000813610000 CR4: 00000000000406e0
Jul 24 22:33:38 nexus kernel: Call Trace:
Jul 24 22:33:38 nexus kernel: ddt_repair_start+0x3e/0xb0 [zfs]
Jul 24 22:33:38 nexus kernel: zio_ddt_read_start+0x58/0x1b0 [zfs]
Jul 24 22:33:38 nexus kernel: ? taskq_dispatch_ent+0x12d/0x1b0 [spl]
Jul 24 22:33:38 nexus kernel: ? zio_taskq_member.isra.7.constprop.13+0x70/0x70 [zfs]
Jul 24 22:33:38 nexus kernel: ? __wake_up_common_lock+0x89/0xc0
Jul 24 22:33:38 nexus kernel: zio_execute+0x87/0xe0 [zfs]
Jul 24 22:33:38 nexus kernel: taskq_thread+0x2d1/0x530 [spl]
Jul 24 22:33:38 nexus kernel: ? wake_up_q+0x70/0x70
Jul 24 22:33:38 nexus kernel: ? zio_taskq_member.isra.7.constprop.13+0x70/0x70 [zfs]
Jul 24 22:33:38 nexus kernel: ? taskq_thread_spawn+0x50/0x50 [spl]
Jul 24 22:33:38 nexus kernel: kthread+0x113/0x130
Jul 24 22:33:38 nexus kernel: ? kthread_create_worker_on_cpu+0x70/0x70
Jul 24 22:33:38 nexus kernel: ret_from_fork+0x22/0x40
Jul 24 22:33:38 nexus kernel: Code: 00 00 00 48 c7 c6 d0 1f a8 c0 48 c7 c7 58 98 a9 c0 e8 8e ff 07 00 b8 02 00 00 00 c3 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 89 f6 <48> 8b b4 f7 a0 00 00 00 48 85 f6 74 13 48 8b bf 90 00 00 00 48
Jul 24 22:33:38 nexus kernel: RIP: ddt_object_lookup.constprop.6+0x7/0x50 [zfs] RSP: ffff999368103c98
Jul 24 22:33:38 nexus kernel: CR2: 0000000000000110
Jul 24 22:33:38 nexus kernel: ---[ end trace b775a1be84906163 ]---
Jul 24 22:33:38 nexus kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000090
Jul 24 22:33:38 nexus kernel: IP: zio_vdev_child_io+0x33/0xf0 [zfs]
Jul 24 22:33:38 nexus kernel: PGD 0 P4D 0
Jul 24 22:33:38 nexus kernel: Oops: 0000 [#2] SMP NOPTI
Jul 24 22:33:38 nexus kernel: Modules linked in: ppp_synctty ppp_async ppp_generic slhc bridge stp llc xt_REDIRECT nf_nat_redirect ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle nf_log_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack_netbios_ns nf_conntrack_broadcast ip6table_mangle ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 nf_log_common xt_LOG ip6t_frag nf_conntrack_ipv6 nf_defrag_ipv6 xt_multiport xt_conntrack nf_conntrack libcrc32c xt_mark ip6table_filter ip6_tables it87 hwmon_vid zfs(POE) zunicode(POE) zavl(POE) icp(POE) snd_hda_codec_hdmi zcommon(POE) znvpair(POE) edac_mce_amd spl(OE) crct10dif_pclmul crc32_pclmul snd_hda_codec_realtek ghash_clmulni_intel snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core fam15h_power snd_hwdep k10temp snd_seq snd_seq_device snd_pcm
Jul 24 22:33:38 nexus kernel: snd_timer snd sp5100_tco soundcore i2c_piix4 shpchp acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc binfmt_misc raid1 crc32c_intel e100 r8169 mii uas usb_storage nouveau video mxm_wmi wmi i2c_algo_bit drm_kms_helper ttm drm
Jul 24 22:33:38 nexus kernel: CPU: 4 PID: 10961 Comm: zpool Tainted: P D OE 4.16.16-200.fc27.x86_64 #1
Jul 24 22:33:38 nexus kernel: Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./970A-D3P, BIOS FD 02/26/2016
Jul 24 22:33:38 nexus kernel: RIP: 0010:zio_vdev_child_io+0x33/0xf0 [zfs]
Jul 24 22:33:38 nexus kernel: RSP: 0018:ffff999367e8f528 EFLAGS: 00010202
Jul 24 22:33:38 nexus kernel: RAX: ffff8adf4c0b7cd0 RBX: ffff8adf3c9ec900 RCX: 0000000000000000
Jul 24 22:33:38 nexus kernel: RDX: 0000000000000000 RSI: ffff8adf3c9ec970 RDI: 0000000000000001
Jul 24 22:33:38 nexus kernel: RBP: 0000000000000000 R08: ffff8adf4c0b7cd0 R09: 0000000000000200
Jul 24 22:33:38 nexus kernel: R10: ffff8adea85ce098 R11: 0000000000f80000 R12: 0000000000000004
Jul 24 22:33:38 nexus kernel: R13: 0000000000000001 R14: ffff8adea85ce080 R15: 0000000000000001
Jul 24 22:33:38 nexus kernel: FS: 00007f4687a1d7c0(0000) GS:ffff8adf7ed00000(0000) knlGS:0000000000000000
Jul 24 22:33:38 nexus kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 24 22:33:38 nexus kernel: CR2: 0000000000000090 CR3: 00000007fd6be000 CR4: 00000000000406e0
Jul 24 22:33:38 nexus kernel: Call Trace:
Jul 24 22:33:38 nexus kernel: vdev_mirror_io_start+0x136/0x170 [zfs]
Jul 24 22:33:38 nexus kernel: ? vdev_mirror_open+0x120/0x120 [zfs]
Jul 24 22:33:38 nexus kernel: zio_vdev_io_start+0x186/0x330 [zfs]
Jul 24 22:33:38 nexus kernel: ? tsd_get_by_thread+0x2a/0x40 [spl]
Jul 24 22:33:38 nexus kernel: ? taskq_member+0x14/0x20 [spl]
Jul 24 22:33:38 nexus kernel: zio_nowait+0xa3/0x140 [zfs]
Jul 24 22:33:38 nexus kernel: spa_load_verify_cb+0x170/0x1a0 [zfs]
Jul 24 22:33:38 nexus kernel: traverse_visitbp+0x1b7/0x990 [zfs]
Jul 24 22:33:38 nexus kernel: traverse_dnode+0xf1/0x1c0 [zfs]
Jul 24 22:33:38 nexus kernel: traverse_visitbp+0x716/0x990 [zfs]
Jul 24 22:33:38 nexus kernel: traverse_visitbp+0x337/0x990 [zfs]
Jul 24 22:33:38 nexus kernel: message repeated 4 times: [ traverse_visitbp+0x337/0x990 [zfs]]
Jul 24 22:33:38 nexus kernel: traverse_dnode+0xf1/0x1c0 [zfs]
Jul 24 22:33:38 nexus kernel: traverse_visitbp+0x7e3/0x990 [zfs]
Jul 24 22:33:38 nexus kernel: ? mutex_lock+0xe/0x30
Jul 24 22:33:38 nexus kernel: ? spl_kmem_cache_free+0x28/0x1d0 [spl]
Jul 24 22:33:38 nexus kernel: traverse_impl+0x1e6/0x450 [zfs]
Jul 24 22:33:38 nexus kernel: traverse_dataset_resume+0x42/0x50 [zfs]
Jul 24 22:33:38 nexus kernel: ? spa_async_suspend+0x90/0x90 [zfs]
Jul 24 22:33:38 nexus kernel: traverse_pool+0x15e/0x190 [zfs]
Jul 24 22:33:38 nexus kernel: ? spa_async_suspend+0x90/0x90 [zfs]
Jul 24 22:33:38 nexus kernel: spa_load+0x1ca2/0x2120 [zfs]
Jul 24 22:33:38 nexus kernel: spa_load_best+0x58/0x270 [zfs]
Jul 24 22:33:38 nexus kernel: ? zpool_get_rewind_policy+0x188/0x190 [zcommon]
Jul 24 22:33:38 nexus kernel: spa_import+0x20a/0x710 [zfs]
Jul 24 22:33:38 nexus kernel: zfs_ioc_pool_import+0x137/0x150 [zfs]
Jul 24 22:33:38 nexus kernel: zfsdev_ioctl+0x1d8/0x610 [zfs]
Jul 24 22:33:38 nexus kernel: do_vfs_ioctl+0xa4/0x620
Jul 24 22:33:38 nexus kernel: SyS_ioctl+0x74/0x80
Jul 24 22:33:38 nexus kernel: do_syscall_64+0x74/0x180
Jul 24 22:33:38 nexus kernel: entry_SYSCALL_64_after_hwframe+0x3d/0xa2
Jul 24 22:33:38 nexus kernel: RIP: 0033:0x7f46855540f7
Jul 24 22:33:38 nexus kernel: RSP: 002b:00007ffd350c2a08 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Jul 24 22:33:38 nexus kernel: RAX: ffffffffffffffda RBX: 00007ffd350c2a70 RCX: 00007f46855540f7
Jul 24 22:33:38 nexus kernel: RDX: 00007ffd350c2a70 RSI: 0000000000005a02 RDI: 0000000000000003
Jul 24 22:33:38 nexus kernel: RBP: 0000561a64d812b0 R08: 0000561a64da5670 R09: 0000000000000078
Jul 24 22:33:38 nexus kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000561a64d86b88
Jul 24 22:33:38 nexus kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Jul 24 22:33:38 nexus kernel: Code: 89 fb 8b 7c 24 18 48 89 d5 83 ff 01 0f 85 b5 00 00 00 48 85 f6 0f 84 ac 00 00 00 81 a3 d8 02 00 00 ff ff bf ff 41 bb 00 00 f8 00 <48> 83 bd 90 00 00 00 00 75 07 48 81 c1 00 00 40 00 8b 83 d0 02
Jul 24 22:33:38 nexus kernel: RIP: zio_vdev_child_io+0x33/0xf0 [zfs] RSP: ffff999367e8f528
Jul 24 22:33:38 nexus kernel: CR2: 0000000000000090
Jul 24 22:33:38 nexus kernel: ---[ end trace b775a1be84906164 ]---
Not good, lets disable it for next tests:
echo 0 >> /sys/module/zfs/parameters/zfs_recover
That doesn't look especially surprising - it's trying to repair a DDT entry and the bp is NULL, so it tries to reach into the bp and NULL pointer+epsilon dereferences.
I think most of the settings you used on tank/jonny are academic, since zfs send -R will overwrite them with whatever the settings were on the source side, unless you use -x or -o to override anything in received datasets.
I think most of the settings you used on tank/jonny are academic, since zfs send -R will overwrite them with whatever the settings were on the source side, unless you use -x or -o to override anything in received datasets.
Note that in this run I did not use send -R
. I wanted to run a send without any argument, to make sure that they are not the problem.
Current settings are:
# zfs get all tank/jonny | grep -v default | grep -v -- -
NAME PROPERTY VALUE SOURCE
tank/jonny sharenfs rw=@192.168.0.0/24,no_subtree_check,no_root_squash,async inherited from tank
tank/jonny compression on local
tank/jonny atime off inherited from tank
tank/jonny secondarycache all local
tank/jonny dedup verify local
tank/jonny dnodesize legacy inherited from tank
tank/jonny relatime off inherited from tank
I thought we were already certain that the presence or absence of the -D flag induces this problem, from #7703?
I thought we were already certain that the presence or absence of the -D flag induces this problem, from #7703?
I think this is another bug. Maybe related to dedup, but not to zfs send -D
(--dedup
)
Since we found that zfs send -D
has a bug, I restarted from an zeroed out disk (dd if=/dev/zero of=/dev/sdf
) and began a new copy. As stated at first message in this issue:
# zpool create tank -o listsnapshots=on -o ashift=12 -o comment='Jonny 10T Seagate' ata-ST10000NM0016-1TT101_ZA22TBDD-part1 cache /dev/mapper/ssd-cache
# zfs set dedup=verify dnodesize=auto atime=off relatime=off secondarycache=metadata xattr=sa compression=on tank
# date;zfs send --props --large-block --replicate --embed --compressed --replicate hgst@pre_copy | mbuffer -m 4G -s 128k | zfs receive -Fvs tank;date
Thu Jul 19 08:42:29 -03 2018
...
(Humm, maybe the zfs set
above was academic... :joy:)
And on the second pass:
date;zfs send --props --large-block --replicate --embed --compressed --replicate -I pre_copy hgst@2 | mbuffer -m 4G -s 128k | zfs receive -Fvs tank;date
Sat Jul 21 12:51:55 -03 2018
Where it stopped after sending two good datasets.
BTW: I recently renamed the @2
snapshot to @step2
. I think I saw some bug related to single char snapshot names. But nothing changed...
Ah, I hadn't seen that you had changed the send command to not include dedup above.
I...don't recall any such bug about snapshot names?
Since I'm guessing that something about the send stream is making the receiver scribble over things wildly, I'd probably suggest at least trying to build a kASAN kernel+modules and seeing if it can tell you anything useful about the memory accesses.
It might not work, since there's definitely a chance it's processing "valid" data and ending up with its pants on its head, but it's worth a try.
Beware, long text below...
Since I'm guessing that something about the send stream is making the receiver scribble over things wildly, I'd probably suggest at least trying to build a kASAN kernel+modules and seeing if it can tell you anything useful about the memory accesses.
I did not know KASAN, very interesting...
But today I was thinking about moving to ZFS master code. 0.7 is lagging behind with many features and bug fixes.
For example, the only issue I could find on github with the same error message is on #2739. It appears to be related to clones, which I don't have. The issue is closed and marked for 0.8.0 milestone, but I could not find any patch. Is it solved?
Do you think this is a bad idea? I have been in master sometime ago, but downgraded to stable after finding its yum repo, and hoping for more stability.
Anyway, today's test report:
Yesterday I said:
I noticed that each run stopped at a different point. Those that I have logged
I was not sure if the reason for this was that the sending stream has been generated with mixed options.
So I did some runs with the exact same input. First, creating the input, and saving it in a ext4 filesystem:
# date;zfs send -I pre_copy hgst/jonny@step2 | mbuffer -m 4G -s 128k -o zfs-send-hgst_jonny-pre_copy_to_step2.stream ;date
Wed Jul 25 21:45:37 -03 2018
in @ 0.0 KiB/s, out @ 0.0 KiB/s, 3908 MiB total, buffer 0% full
summary: 3908 MiByte in 16min 48.2sec - average of 3969 KiB/s
Wed Jul 25 22:02:25 -03 2018
Now, let do some receives/panic/revert/reboot/retries:
# date; dd if=/d3/zfs-send-hgst_jonny-pre_copy_to_step2.stream bs=4096 status=progress | zfs receive -v tank/jonny;date
Thu Jul 26 07:49:11 -03 2018
receiving incremental stream of hgst/jonny@step2 into tank/jonny@step2
1340084224 bytes (1.3 GB, 1.2 GiB) copied, 15.0005 s, 89.3 MB/s
Message from syslogd@nexus at Jul 26 07:50:00 ...
kernel:[72552.096303] PANIC: blkptr at 000000004f269913 has invalid TYPE 165
Message from syslogd@nexus at Jul 26 07:50:00 ...
kernel:PANIC: blkptr at 000000004f269913 has invalid TYPE 165
^C
336739+0 records in
336738+0 records out
1379278848 bytes (1.4 GB, 1.3 GiB) copied, 236.332 s, 5.8 MB/s
The CTRL-C was needed to stop dd and get its last status. zfs recv
was frozen.
revert hack
reboot
import
rollback tank/jonny@pre_copy
Let's try again:
# date; dd if=/d3/zfs-send-hgst_jonny-pre_copy_to_step2.stream bs=4096 status=progress | zfs receive -v tank/jonny;date
Thu Jul 26 08:01:49 -03 2018
receiving incremental stream of hgst/jonny@step2 into tank/jonny@step2
1379278848 bytes (1.4 GB, 1.3 GiB) copied, 59.7849 s, 23.1 MB/s
Message from syslogd@nexus at Jul 26 08:06:47 ...
kernel:[ 660.411016] PANIC: blkptr at 000000003055bd39 has invalid TYPE 165
Message from syslogd@nexus at Jul 26 08:06:47 ...
kernel:PANIC: blkptr at 000000003055bd39 has invalid TYPE 165
^C
336739+0 records in
336738+0 records out
1379278848 bytes (1.4 GB, 1.3 GiB) copied, 312.546 s, 4.4 MB/s
...
# date; dd if=/d3/zfs-send-hgst_jonny-pre_copy_to_step2.stream bs=4096 status=progress | zfs receive -v tank/jonny;date
Thu Jul 26 08:14:11 -03 2018
receiving incremental stream of hgst/jonny@step2 into tank/jonny@step2
1379278848 bytes (1.4 GB, 1.3 GiB) copied, 43.0477 s, 32.0 MB/s
Message from syslogd@nexus at Jul 26 08:19:04 ...
kernel:[ 585.697510] PANIC: blkptr at 000000004bb3bfc3 has invalid TYPE 165
Message from syslogd@nexus at Jul 26 08:19:04 ...
kernel:PANIC: blkptr at 000000004bb3bfc3 has invalid TYPE 165
^C
336739+0 records in
336738+0 records out
1379278848 bytes (1.4 GB, 1.3 GiB) copied, 6503.44 s, 212 kB/s
There is a long time between zfs recv
stopping and the panic itself. I think that ZFS is processing received data. Is this why you said to use zpool sync
?
Yeah, I was thinking it might be a background processing task that mangled data, so if you decided to try chopping up the stream to figure out which part breaks it, I'd include something like a zpool sync; sleep 60; to see if it panicked at the point that you cut off the input.
You also said before:
If I had the send streams, I think what I'd do with them would be trying to send e.g. half of the incremental stream and seeing if it panics, if not, 75%, if so, 25%, and so on, and continue until I ended up with whichever specific record in the stream made it eat paste, and then go tear apart the send stream for all the records related to that object.
Knowing that it always stop around 1.4G should I chop the input file to 75% (and so on) of this size and try again? I suspect your intention to see how far can the stream go before returning EEXISTS, right? I could do this tomorrow.
But let's continue with the tests I already did:
That doesn't look especially surprising - it's trying to repair a DDT entry and the bp is NULL, so it tries to reach into the bp and NULL pointer+epsilon dereferences.
Suspecting anything in dedup code, I did an incremental transfer with dedup disabled:
zfs set dedup=off tank/jonny
# date; dd if=/d3/zfs-send-hgst_jonny-pre_copy_to_step2.stream bs=4096 status=progress | zfs receive -v tank/jonny;date
Thu Jul 26 10:08:19 -03 2018
receiving incremental stream of hgst/jonny@step2 into tank/jonny@step2
1275166720 bytes (1.3 GB, 1.2 GiB) copied, 11 s, 116 MB/s
Message from syslogd@nexus at Jul 26 10:08:33 ...
kernel:[ 159.986948] PANIC: blkptr at 00000000db484e49 has invalid TYPE 165
Message from syslogd@nexus at Jul 26 10:08:33 ...
kernel:PANIC: blkptr at 00000000db484e49 has invalid TYPE 165
^C
336739+0 records in
336738+0 records out
1379278848 bytes (1.4 GB, 1.3 GiB) copied, 65.969 s, 20.9 MB/s
No luck yet, but the recv processing was much faster without the dedup code. ;-)
And what if I copied the WHOLE dataset without dedup?
# time zfs destroy -r tank/jonny
real 0m54.444s
user 0m0.007s
sys 0m0.008s
# time zfs create tank/jonny
real 5m33.528s
user 0m0.015s
sys 0m0.030s
# time zfs set compression=on secondarycache=all dedup=off tank/jonny
real 0m57.872s
user 0m0.004s
sys 0m0.006s
The last commands took too much time. Probably because of background freeing in a SATA disk. So let's wait for it to finish before continuing. I will not be present, so automate it:
# while [[ $(zpool get -Hp -o value freeing tank) != 0 ]] ; do date ; sleep 60 ; done ; date;time zfs send hgst/jonny@pre_copy | mbuffer -m 4G -s 128k | time zfs receive -Fvs tank/jonny;date
Thu Jul 26 12:00:35 -03 2018
receiving full stream of hgst/jonny@pre_copy into tank/jonny@pre_copy
in @ 43.4 MiB/s, out @ 43.4 MiB/s, 356 GiB total, buffer 0% fullll
summary: 356 GiByte in 1h 29min 06.8sec - average of 68.2 MiB/s
received 356G stream in 5346 seconds (68.2M/sec)
real 89m7.509s
user 0m14.962s
sys 32m13.821s
Thu Jul 26 13:29:42 -03 2018
When I got home, do the incremental part:
date;cat /d3/zfs-send-hgst_jonny-pre_copy_to_step2.stream | zfs receive -Fvs tank/jonny; date
Thu Jul 26 23:15:12 -03 2018
receiving incremental stream of hgst/jonny@step2 into tank/jonny@step2
cannot restore to tank/jonny@step2: destination already exists
Thu Jul 26 23:15:28 -03 2018
Hey! Error, but no panic??? Wait?
zfs receive -A tank/jonny
Now, PANIC!
Jul 26 23:17:10 nexus kernel: PANIC: blkptr at 000000000e4e385f has invalid TYPE 165
Jul 26 23:17:10 nexus kernel: Showing stack for process 16044
Jul 26 23:17:10 nexus kernel: CPU: 0 PID: 16044 Comm: txg_sync Tainted: P OE 4.16.16-200.fc27.x86_64 #1
Jul 26 23:17:10 nexus kernel: Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./970A-D3P, BIOS FD 02/26/2016
Jul 26 23:17:10 nexus kernel: Call Trace:
Jul 26 23:17:10 nexus kernel: dump_stack+0x5c/0x85
Jul 26 23:17:10 nexus kernel: vcmn_err+0xbc/0xf0 [spl]
Jul 26 23:17:10 nexus kernel: ? _cond_resched+0x15/0x40
Jul 26 23:17:10 nexus kernel: ? __kmalloc+0x18a/0x220
Jul 26 23:17:10 nexus kernel: ? __sg_alloc_table+0x7f/0x150
Jul 26 23:17:10 nexus kernel: ? sg_init_table+0x11/0x30
Jul 26 23:17:10 nexus kernel: ? __sg_alloc_table+0x9b/0x150
blablabla...
And lastly, something I did earlier. Only two datasets did not receive the incremental stream. Let's see if only one dataset is problematic.
date; zfs send -I pre_copy hgst/dedup@step2 | mbuffer -m 4G -s 128k -o /d3/zfs-send-hgst_dedup-pre_copy_to_step2.stream
Wed Jul 25 22:31:06 -03 2018
in @ 0.0 KiB/s, out @ 0.0 KiB/s, 38.4 GiB total, buffer 0% full
summary: 38.4 GiByte in 8min 22.0sec - average of 78.3 MiB/s
Hey, no panic! And no need to rollback, since I did not use recv -s
. Also, no need to reboot! Quick tests! Let's try again, and see if it stops always at the same point:
# date; dd if=/d3/zfs-send-hgst_dedup-pre_copy_to_step2.stream bs=4096 status=progress | zfs receive -v tank/dedup;date
Wed Jul 25 23:01:29 -03 2018
receiving incremental stream of hgst/dedup@step2 into tank/dedup@step2
4626763776 bytes (4.6 GB, 4.3 GiB) copied, 72.0069 s, 64.3 MB/scannot restore to tank/dedup@step2: destination already exists
Wed Jul 25 23:03:01 -03 2018
# date; dd if=/d3/zfs-send-hgst_dedup-pre_copy_to_step2.stream bs=4096 status=progress | zfs receive -v tank/dedup;date
Wed Jul 25 23:04:07 -03 2018
receiving incremental stream of hgst/dedup@step2 into tank/dedup@step2
4619636736 bytes (4.6 GB, 4.3 GiB) copied, 68.0955 s, 67.8 MB/scannot restore to tank/dedup@step2: destination already exists
Wed Jul 25 23:05:37 -03 2018
# date; dd if=/d3/zfs-send-hgst_dedup-pre_copy_to_step2.stream bs=4096 status=progress | zfs receive -v tank/dedup;date
Wed Jul 25 23:31:28 -03 2018
receiving incremental stream of hgst/dedup@step2 into tank/dedup@step2
4542021632 bytes (4.5 GB, 4.2 GiB) copied, 70.0269 s, 64.9 MB/scannot restore to tank/dedup@step2: destination already exists
Wed Jul 25 23:32:58 -03 2018
# date; dd if=/d3/zfs-send-hgst_dedup-pre_copy_to_step2.stream bs=4096 status=progress | zfs receive -v tank/dedup;date
Again in the morning:
Thu Jul 26 07:02:37 -03 2018
receiving incremental stream of hgst/dedup@step2 into tank/dedup@step2
4483534848 bytes (4.5 GB, 4.2 GiB) copied, 68.0381 s, 65.9 MB/scannot restore to tank/dedup@step2: destination already exists
Thu Jul 26 07:04:06 -03 2018
# date; dd if=/d3/zfs-send-hgst_dedup-pre_copy_to_step2.stream bs=4096 status=progress | zfs receive -v tank/dedup;date
Thu Jul 26 07:09:44 -03 2018
receiving incremental stream of hgst/dedup@step2 into tank/dedup@step2
4603531264 bytes (4.6 GB, 4.3 GiB) copied, 60 s, 76.7 MB/s cannot restore to tank/dedup@step2: destination already exists
Thu Jul 26 07:11:14 -03 2018
# date; dd if=/d3/zfs-send-hgst_dedup-pre_copy_to_step2.stream bs=4096 status=progress | zfs receive -v tank/dedup;date
Thu Jul 26 07:14:08 -03 2018
receiving incremental stream of hgst/dedup@step2 into tank/dedup@step2
4619636736 bytes (4.6 GB, 4.3 GiB) copied, 69.0036 s, 66.9 MB/scannot restore to tank/dedup@step2: destination already exists
Thu Jul 26 07:15:37 -03 2018
So, it is not exact, but always near the same point that the problem occur.
The good news is that I can use this stream to do the stream partitioning test, since I do not need to reboot to retry.
The bad news is that maybe there are TWO bugs. One for the EEXISTS, and another for the data corruption.
What's next?
Well, I will stop tests for a while, and wait for your suggestion on the next step:
I love to debug, but this is taking too much time already, and my family wants to use the home primary computer, which is somewhat on hold.
@rincebrain Any suggestion from the above?
I'm inclined to go master and restart send/recv with a clean pool. But running on master could have been the initial problem.
BTW: I already compiled the master modules (commit id fb7307b8922e9eb7da430733d6665e06d2cb0bf5) , and did another chance to zdb
, this time compiled with Address Sanitizing. Pool tank
was exported, and zdb
, although from "the future", should be backwards compatible with a past pool. So, I thought it could give some info. And these are the results:
# /root/zfs/zfs/cmd/zdb/zdb -ebccv -I 2400 tank
Traversing all blocks to verify checksums and verify nothing leaked ...
loading concrete vdev 0, metaslab 144 of 145 ...
13.8G completed ( 6MB/s) estimated time remaining: 121hr 24min 25sec ASAN:DEADLYSIGNAL
=================================================================
==3200==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000030 (pc 0x000000417b6d bp 0x7fd94c035300 sp 0x7ffdb0411150 T0)
==3200==The signal is caused by a READ memory access.
==3200==Hint: address points to the zero page.
#0 0x417b6c in zdb_count_block /root/zfs/zfs/cmd/zdb/zdb.c:3278
#1 0x427056 in zdb_blkptr_cb /root/zfs/zfs/cmd/zdb/zdb.c:3356
#2 0x7fd962ff5ddc in traverse_visitbp ../../module/zfs/dmu_traverse.c:285
#3 0x7fd962ff694a in traverse_visitbp ../../module/zfs/dmu_traverse.c:321
#4 0x7fd962ff7a79 in traverse_dnode ../../module/zfs/dmu_traverse.c:501
#5 0x7fd962ff6550 in traverse_visitbp ../../module/zfs/dmu_traverse.c:357
#6 0x7fd962ff694a in traverse_visitbp ../../module/zfs/dmu_traverse.c:321
#7 0x7fd962ff694a in traverse_visitbp ../../module/zfs/dmu_traverse.c:321
#8 0x7fd962ff694a in traverse_visitbp ../../module/zfs/dmu_traverse.c:321
#9 0x7fd962ff694a in traverse_visitbp ../../module/zfs/dmu_traverse.c:321
#10 0x7fd962ff694a in traverse_visitbp ../../module/zfs/dmu_traverse.c:321
#11 0x7fd962ff7a79 in traverse_dnode ../../module/zfs/dmu_traverse.c:501
#12 0x7fd962ff6cba in traverse_visitbp ../../module/zfs/dmu_traverse.c:398
#13 0x7fd962ff832b in traverse_impl ../../module/zfs/dmu_traverse.c:666
#14 0x7fd962ff8930 in traverse_dataset_resume ../../module/zfs/dmu_traverse.c:694
#15 0x7fd962ff8c38 in traverse_pool ../../module/zfs/dmu_traverse.c:758
#16 0x418c1e in dump_block_stats /root/zfs/zfs/cmd/zdb/zdb.c:4107
#17 0x426824 in dump_zpool /root/zfs/zfs/cmd/zdb/zdb.c:5038
#18 0x408613 in main /root/zfs/zfs/cmd/zdb/zdb.c:5841
#19 0x7fd961e20f29 in __libc_start_main (/lib64/libc.so.6+0x20f29)
#20 0x408d59 in _start (/root/zfs/zfs/cmd/zdb/.libs/lt-zdb+0x408d59)
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /root/zfs/zfs/cmd/zdb/zdb.c:3278 in zdb_count_block
==3200==ABORTING
The error occurred at this part of the file:
3264 if (BP_GET_DEDUP(bp)) {
3265 ddt_t *ddt;
3266 ddt_entry_t *dde;
3267
3268 ddt = ddt_select(zcb->zcb_spa, bp);
3269 ddt_enter(ddt);
3270 dde = ddt_lookup(ddt, bp, B_FALSE);
3271
3272 if (dde == NULL) {
3273 refcnt = 0;
3274 } else {
3275 ddt_phys_t *ddp = ddt_phys_select(dde, bp);
3276 ddt_phys_decref(ddp);
3277 refcnt = ddp->ddp_refcnt;
3278 if (ddt_phys_total_refcnt(dde) == 0)
3279 ddt_remove(ddt, dde);
3280 }
3281 ddt_exit(ddt);
3282 }
So, can I conclude that there is a serious problem in my dedup tables?
Problem is: what pointer has the value 0x0030
? Nothing in lines 3278/3279 appears to be a pointer operation. Maybe the real error was in line 3277, when dereferencing ddp
, which came from ddt_phys_select
? But it did not make any sense to me...
Right now my source pool has a dedupratio
of 1.44x, and this is one of the main reasons that I use ZFS
.
I mean, if ddt_phys_total_refcnt is inlined/a macro, that'd explain your line numbering.
I would not suggest using master without having good backups, not because I know of a great many outstanding bugs against it, but because a lot of big features have been merged, and some of them have had bugs found in the past.
I think we already knew your DDT was sometimes getting mangled from #7703 - though I thought you had started receiving without dedup being enabled at all to reproduce this.
I mean, if ddt_phys_total_refcnt is inlined/a macro, that'd explain your line numbering.
Compilation options, by default, are -g -O2
. IIRC, this does not enable auto-inline, and the source does not request inline.
I would not suggest using master without having good backups, not because I know of a great many outstanding bugs against it, but because a lot of big features have been merged, and some of them have had bugs found in the past.
Yes, but the source pool has already been there, some time ago. And I'm getting out of options... :-(
I think we already knew your DDT was sometimes getting mangled from #7703 - though I thought you had started receiving without dedup being enabled at all to reproduce this.
I did this only for a single dataset. Not for the whole pool.
Ah, you only disabled dedup for one dataset.
So, I have two thoughts.
One is that, given that you need this data to be up and running for others, you should probably just resort to rsyncing and recreating the snapshots, rather than debugging this, as much as I would prefer to find out what's rotten in the state of Denmark.
The other is that, assuming you want to keep debugging this, what you'd need would be the exact instructions where ASAN is reporting a NULL read to be certain what the operation involved is - basically looking at the output of the disas[semble] command from gdb on the zdb binary in question, or if you can convince ASAN to not try to list symbols, and instead list specific offsets into the function (which is probably there in the difference between 0x417b6c and where zdb_count_block "starts", e.g. info line zdb_count_block on the zdb binary will probably say something like starting at 0x417b00 and then you'd want to look at the instructions around the pc location from ASAN's output to figure out what the hell it was actually noticing on fire).
One is that, given that you need this data to be up and running for others, you should probably just resort to rsyncing and recreating the snapshots, rather than debugging this, as much as I would prefer to find out what's rotten in the state of Denmark.
:joy::joy::joy::joy::joy::joy::joy::joy:
The other is that, assuming you want to keep debugging this, what you'd need would be the exact instructions where ASAN is reporting a NULL read to be certain what the operation involved is - basically looking at the output of the disas[semble] command from gdb on the zdb binary in question, or if you can convince ASAN to not try to list symbols, and instead list specific offsets into the function (which is probably there in the difference between 0x417b6c and where zdb_count_block "starts", e.g. info line zdb_count_block on the zdb binary will probably say something like starting at 0x417b00 and then you'd want to look at the instructions around the pc location from ASAN's output to figure out what the hell it was actually noticing on fire).
Since I had some time, I went the other way: cleared the destination pool, disabled dedup in the source pool (I know this will not clear dedup, but I need this to avoid --replicate to create new pools with dedup), and restarted a copy with the simplest approach:
# zfs send --props --replicate hgst@pre_copy | mbuffer -m 4G | zfs receive -Fvs tank
receiving full stream of hgst@pre_copy into tank@pre_copy
received 2.12M stream in 2 seconds (1.06M/sec)
receiving full stream of hgst/backup-linux@pre_copy into tank/backup-linux@pre_copy
received 334G stream in 8155 seconds (41.9M/sec)
receiving full stream of hgst/jonny@pre_copy into tank/jonny@pre_copy
received 356G stream in 10139 seconds (36.0M/sec)
receiving full stream of hgst/backup-remote@pre_copy into tank/backup-remote@pre_copy
received 87.0G stream in 1363 seconds (65.3M/sec)
receiving full stream of hgst/dedup@pre_copy into tank/dedup@pre_copy
received 234G stream in 14972 seconds (16.0M/sec)
receiving full stream of hgst/backup-windows@pre_copy into tank/backup-windows@pre_copy
received 262G stream in 3269 seconds (82.2M/sec)
receiving full stream of hgst/xbox-bkp@pre_copy into tank/xbox-bkp@pre_copy
received 496G stream in 3813 seconds (133M/sec)
receiving full stream of hgst/videos@pre_copy into tank/videos@pre_copy
received 247G stream in 2342 seconds (108M/sec)
receiving full stream of hgst/data1@pre_copy into tank/data1@pre_copy
received 398G stream in 3244 seconds (126M/sec)
receiving full stream of hgst/xbox@pre_copy into tank/xbox@pre_copy
received 1.26T stream in 9707 seconds (136M/sec)
receiving full stream of hgst/android@pre_copy into tank/android@pre_copy
received 232G stream in 2318 seconds (102M/sec)
receiving full stream of hgst/backup-flavia@pre_copy into tank/backup-flavia@pre_copy
summary: 4058 GiByte in 16h 55min 39.3sec - average of 68.2 MiB/s
received 122G stream in 1616 seconds (77.2M/sec)
In short: no problems. I even ran a zdb in this new pool. No problems also!
# time /root/zfs/zfs/cmd/zdb/zdb -cccbvv -I 2400 tank
Traversing all blocks to verify checksums and verify nothing leaked ...
loading concrete vdev 0, metaslab 144 of 145 ...
3.68T completed ( 90MB/s) estimated time remaining: 0hr 00min 00sec
No leaks (block sum matches space maps exactly)
bp count: 37097387
ganged count: 0
bp logical: 4425229795328 avg: 119286
bp physical: 4036460209152 avg: 108807 compression: 1.10
bp allocated: 4046248386560 avg: 109070 compression: 1.09
bp deduped: 0 ref>1: 0 deduplication: 1.00
SPA allocated: 4046248386560 used: 40.61%
...
real 727m3.087s
user 757m51.787s
sys 357m10.958s
Ok, next step, incremental send:
# zfs send --replicate -I pre_copy hgst@step2 | mbuffer -m 4G | zfs receive -Fvs tank
receiving incremental stream of hgst@step2 into tank@step2
received 7.89G stream in 103 seconds (78.4M/sec)
receiving incremental stream of hgst/backup-linux@step2 into tank/backup-linux@step2
received 312B stream in 1 seconds (312B/sec)
receiving incremental stream of hgst/jonny@step2 into tank/jonny@step2
cannot restore to tank/jonny@step2: destination already exists
mbuffer: error: outputThread: error writing to <stdout> at offset 0x24afe6000: Broken pipe
summary: 9392 MiByte in 3min 43.0sec - average of 42.1 MiB/s
mbuffer: warning: error during output to <stdout>: Broken pipe
Oh GOD! Again?
Also:
# time /root/zfs/zfs/cmd/zdb/zdb -cccbvv -I 2400 tank
error: blkptr at 0x62a0004eedc0 has invalid CHECKSUM 91
Aborted (core dumped)
real 0m1.954s
user 0m0.497s
sys 0m0.776s
Now I can be sure that the problem is in the source pool.
I am stubborn, and will make a new try, with a new set of snapshots and see if it makes a difference. If not, will resort to rsync.
Thanks again for you help until now...
New snapshot, new life. Start from the very beginning.
Clear all snapshots from source pool. Also, reset zfs (most) parameters to original setup (dedup,xattr,dnodesize). Create a new snapshot to start zfs send/recv
.
Re-add the second disk to destination pool, it now or never! Go!
# date ; zfs send --props --replicate hgst@step2 | mbuffer -m 4G -s 128k | zfs receive -Fvs tank ; date
Sat Aug 4 21:01:03 -03 2018
receiving full stream of hgst@step2 into tank@step2
in @ 0.0 KiB/s, out @ 0.0 KiB/s, 1419 MiB total, buffer 0% fullreceived 1.39G stream in 68 seconds (20.9M/sec)
receiving full stream of hgst/backup-linux@step2 into tank/backup-linux@step2
in @ 53.2 MiB/s, out @ 0.0 KiB/s, 333 GiB total, buffer 86% fullreceived 332G stream in 5583 seconds (60.9M/sec)
receiving full stream of hgst/jonny@step2 into tank/jonny@step2
in @ 0.0 KiB/s, out @ 0.0 KiB/s, 694 GiB total, buffer 100% fullreceived 360G stream in 11761 seconds (31.4M/sec)
receiving full stream of hgst/backup-remote@step2 into tank/backup-remote@step2
in @ 0.0 KiB/s, out @ 0.0 KiB/s, 781 GiB total, buffer 89% fullreceived 87.0G stream in 1155 seconds (77.1M/sec)
receiving full stream of hgst/dedup@step2 into tank/dedup@step2
in @ 30.0 MiB/s, out @ 0.0 KiB/s, 1019 GiB total, buffer 100% fullreceived 239G stream in 3450 seconds (70.8M/sec)
receiving full stream of hgst/backup-windows@step2 into tank/backup-windows@step2
in @ 0.0 KiB/s, out @ 0.0 KiB/s, 1281 GiB total, buffer 100% fullreceived 262G stream in 7717 seconds (34.7M/sec)
receiving full stream of hgst/xbox-bkp@step2 into tank/xbox-bkp@step2
in @ 0.0 KiB/s, out @ 0.0 KiB/s, 1777 GiB total, buffer 100% fullreceived 496G stream in 7128 seconds (71.3M/sec)
receiving full stream of hgst/videos@step2 into tank/videos@step2
in @ 0.0 KiB/s, out @ 0.0 KiB/s, 2025 GiB total, buffer 100% fullreceived 247G stream in 3712 seconds (68.2M/sec)
receiving full stream of hgst/data1@step2 into tank/data1@step2
in @ 0.0 KiB/s, out @ 0.0 KiB/s, 2423 GiB total, buffer 100% fullreceived 398G stream in 7795 seconds (52.3M/sec)
receiving full stream of hgst/xbox@step2 into tank/xbox@step2
in @ 0.0 KiB/s, out @ 0.0 KiB/s, 3713 GiB total, buffer 100% fullreceived 1.26T stream in 29111 seconds (45.4M/sec)
receiving full stream of hgst/android@step2 into tank/android@step2
in @ 5118 KiB/s, out @ 0.0 KiB/s, 3949 GiB total, buffer 87% fullreceived 236G stream in 3882 seconds (62.2M/sec)
receiving full stream of hgst/backup-flavia@step2 into tank/backup-flavia@step2
in @ 0.0 KiB/s, out @ 57.2 MiB/s, 4071 GiB total, buffer 7% full
summary: 4071 GiByte in 23h 31min 58.5sec - average of 49.2 MiB/s
received 122G stream in 3296 seconds (37.9M/sec)
Sun Aug 5 20:33:28 -03 2018
So far, so good. Now, the incremental step:
# date ; zfs send --props --replicate -I step2 hgst@step3 | mbuffer -m 4G -s 128k | time zfs receive -Fvs tank ; date
Mon Aug 6 22:33:15 -03 2018
receiving incremental stream of hgst@step3 into tank@step3
in @ 0.0 KiB/s, out @ 0.0 KiB/s, 2831 MiB total, buffer 9% fullreceived 2.76G stream in 79 seconds (35.8M/sec)
receiving incremental stream of hgst/backup-linux@step3 into tank/backup-linux@step3
in @ 0.0 KiB/s, out @ 0.0 KiB/s, 2831 MiB total, buffer 9% fullreceived 312B stream in 1 seconds (312B/sec)
receiving incremental stream of hgst/jonny@step3 into tank/jonny@step3
in @ 0.0 KiB/s, out @ 0.0 KiB/s, 3211 MiB total, buffer 0% fullreceived 380M stream in 9 seconds (42.2M/sec)
receiving incremental stream of hgst/backup-remote@step3 into tank/backup-remote@step3
in @ 0.0 KiB/s, out @ 0.0 KiB/s, 3211 MiB total, buffer 0% fullreceived 312B stream in 1 seconds (312B/sec)
receiving incremental stream of hgst/dedup@step3 into tank/dedup@step3
in @ 0.0 KiB/s, out @ 0.0 KiB/s, 3211 MiB total, buffer 0% fullreceived 312B stream in 1 seconds (312B/sec)
receiving incremental stream of hgst/backup-windows@step3 into tank/backup-windows@step3
in @ 0.0 KiB/s, out @ 0.0 KiB/s, 3211 MiB total, buffer 0% fullreceived 312B stream in 1 seconds (312B/sec)
receiving incremental stream of hgst/xbox-bkp@step3 into tank/xbox-bkp@step3
in @ 0.0 KiB/s, out @ 0.0 KiB/s, 3211 MiB total, buffer 0% fullreceived 312B stream in 1 seconds (312B/sec)
receiving incremental stream of hgst/videos@step3 into tank/videos@step3
in @ 0.0 KiB/s, out @ 0.0 KiB/s, 3211 MiB total, buffer 0% fullreceived 312B stream in 1 seconds (312B/sec)
receiving incremental stream of hgst/data1@step3 into tank/data1@step3
in @ 0.0 KiB/s, out @ 0.0 KiB/s, 3211 MiB total, buffer 0% full
summary: 3211 MiByte in 1min 31.8sec - average of 35.0 MiB/s
received 1.23M stream in 1 seconds (1.23M/sec)
receiving incremental stream of hgst/xbox@step3 into tank/xbox@step3
received 312B stream in 1 seconds (312B/sec)
receiving incremental stream of hgst/android@step3 into tank/android@step3
received 312B stream in 1 seconds (312B/sec)
receiving incremental stream of hgst/backup-flavia@step3 into tank/backup-flavia@step3
received 312B stream in 1 seconds (312B/sec)
Mon Aug 6 22:34:49 -03 2018
Hey, where's the bug? Maybe it was not in the send/recv
part, but in the previously generated snapshot?
Some things are still different from the original setup:
compression
was changed from gzip-9
to on
secondarycache
was all
in one dataset. I disabled it to reduce SSD usage during repeated xfers.send
args: --dedup --large-block --embed --compressed
Anyway, I think I'll say it's done for now. I'll finish my hard drive migration and hope for no more encounters with this "bug/feature/glitch".
This will likely be fixed by a patch stack to port some recent fixes to the 0.7-release branch. These fixes deal with receiving incremental streams in which dnode slots are recycled and their size changes: dweeezil:zfs-0.7-release-recv-dnode. There was previously a patch from #6576 to fix this problem but it was reverted. The patch stack I referred to above pulls in some more recent work that was mainly geared toward raw send streams, but also included more robust fixes to this problem.
Related issues/PRs: #6366 #6576 #7733
Closing. The fixes for this issue were included in 0.7.11. We will be releasing an 0.7.12 for #7933.
System information
Describe the problem you're observing
Trying to move and cleanup data to a new pool, in bigger disks.
Completely zero the new disk, and create a single GPT partition
No problem until here. To ensure this, system I/O has been kept to a minimum, graphics environment disabled, and kernel rewind to a stable one (#7703 has affected me too)
Now, let's try to move another incremental step:
Meanwhile:
After this,
zfs receive -A tank/jonny
never returns.Indeed, any command that would access tank freezes.
To be able to fully reboot I had to disable automatic pool import. Whenever I try to import the new pool, I get this panic:
Not even
zpool import -F
work, always panic.Describe how to reproduce the problem
Apparently, this is not the first time in my setup, I think this is what have caused the destruction I told about in #7703 and #6414
That's why in this run I just have one disk in the new pool
Include any warning/errors/backtraces from the system logs
See above...