openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.34k stars 1.72k forks source link

ZFS corruption related to snapshots post-2.0.x upgrade #12014

Open jgoerzen opened 3 years ago

jgoerzen commented 3 years ago

System information

Type Version/Name
Distribution Name Debian
Distribution Version Buster
Linux Kernel 5.10.0-0.bpo.5-amd64
Architecture amd64
ZFS Version 2.0.3-1~bpo10+1
SPL Version 2.0.3-1~bpo10+1

Describe the problem you're observing

Since upgrading to 2.0.x and enabling crypto, every week or so, I start to have issues with my zfs send/receive-based backups. Upon investigating, I will see output like this:

zpool status -v
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 00:03:37 with 0 errors on Mon May  3 16:58:33 2021
config:

    NAME         STATE     READ WRITE CKSUM
    rpool        ONLINE       0     0     0
      nvme0n1p7  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <0xeb51>:<0x0>

Of note, the <0xeb51> is sometimes a snapshot name; if I zfs destroy the snapshot, it is replaced by this tag.

Bug #11688 implies that zfs destroy on the snapshot and then a scrub will fix it. For me, it did not. If I run a scrub without rebooting after seeing this kind of zpool status output, I get the following in very short order, and the scrub (and eventually much of the system) hangs:

[393801.328126] VERIFY3(0 == remove_reference(hdr, NULL, tag)) failed (0 == 1)
[393801.328129] PANIC at arc.c:3790:arc_buf_destroy()
[393801.328130] Showing stack for process 363
[393801.328132] CPU: 2 PID: 363 Comm: z_rd_int Tainted: P     U     OE     5.10.0-0.bpo.5-amd64 #1 Debian 5.10.24-1~bpo10+1
[393801.328133] Hardware name: Dell Inc. XPS 15 7590/0VYV0G, BIOS 1.8.1 07/03/2020
[393801.328134] Call Trace:
[393801.328140]  dump_stack+0x6d/0x88
[393801.328149]  spl_panic+0xd3/0xfb [spl]
[393801.328153]  ? __wake_up_common_lock+0x87/0xc0
[393801.328221]  ? zei_add_range+0x130/0x130 [zfs]
[393801.328225]  ? __cv_broadcast+0x26/0x30 [spl]
[393801.328275]  ? zfs_zevent_post+0x238/0x2a0 [zfs]
[393801.328302]  arc_buf_destroy+0xf3/0x100 [zfs]
[393801.328331]  arc_read_done+0x24d/0x490 [zfs]
[393801.328388]  zio_done+0x43d/0x1020 [zfs]
[393801.328445]  ? zio_vdev_io_assess+0x4d/0x240 [zfs]
[393801.328502]  zio_execute+0x90/0xf0 [zfs]
[393801.328508]  taskq_thread+0x2e7/0x530 [spl]
[393801.328512]  ? wake_up_q+0xa0/0xa0
[393801.328569]  ? zio_taskq_member.isra.11.constprop.17+0x60/0x60 [zfs]
[393801.328574]  ? taskq_thread_spawn+0x50/0x50 [spl]
[393801.328576]  kthread+0x116/0x130
[393801.328578]  ? kthread_park+0x80/0x80
[393801.328581]  ret_from_fork+0x22/0x30

However I want to stress that this backtrace is not the original cause of the problem, and it only appears if I do a scrub without first rebooting.

After that panic, the scrub stalled -- and a second error appeared:

zpool status -v
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub in progress since Sat May  8 08:11:07 2021
    152G scanned at 132M/s, 1.63M issued at 1.41K/s, 172G total
    0B repaired, 0.00% done, no estimated completion time
config:

    NAME         STATE     READ WRITE CKSUM
    rpool        ONLINE       0     0     0
      nvme0n1p7  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <0xeb51>:<0x0>
        rpool/crypt/debian-1/home/jgoerzen/no-backup@[elided]-hourly-2021-05-07_02.17.01--2d:<0x0>

I have found the solution to this issue is to reboot into single-user mode and run a scrub. Sometimes it takes several scrubs, maybe even with some reboots in between, but eventually it will clear up the issue. If I reboot before scrubbing, I do not get the panic or the hung scrub.

I run this same version of ZoL on two other machines, one of which runs this same kernel version. What is unique about this machine?

I made a significant effort to rule out hardware issues, including running several memory tests and the built-in Dell diagnostics. I believe I have rules that out.

Describe how to reproduce the problem

I can't at will. I have to wait for a spell.

Include any warning/errors/backtraces from the system logs

See above

Potentially related bugs

jgoerzen commented 3 years ago

Two other interesting tidbits...

When I do the reboot after this issue occurs, the mounting of the individual zfs datasets is S L O W. Several seconds each, and that normally just flies by. After scrubbing, it is back to normal speed of mounting.

The datasets that have snapshot issues vary with each one. Sometimes it's just one, sometimes many. But var is almost always included. (Though its parent, which has almost no activity ever, also is from time to time, so that's odd.)

jstenback commented 3 years ago

Same symptoms here, more or less. See also issue #11688.

glueckself commented 3 years ago

I also have the symptom with the corrupted snapshots, without kernel panics so far.

So far it only affected my Debian system with Linux 5.10 and zfs 2.0.3 (I've turned the server off for today, I can check the exact versions tomorrow). Also, while the system has the 2.0.3 zfs utils + module, the pool is still left on 0.8.6 format. I wasn't able to execute zfs list -r -t all <affected dataset> - it displayed cannot iterate filesystems and only a few snapshots (instead of tens it should've). Also, I couldn't destroy the affected snapshots because it said they didn't exist anymore. I couldn't send the dataset with syncoid at all.

On the corrupted system, after I got the mail from ZED, I manually ran a scrub at first, after which the zpool status said that there were no errors. However, the next zpool status, seconds after the first, again said that there were errors. Subsequent scrubs didn't clean the errors.

I've rebooted the server into an Ubuntu 20.10 live with zfs 0.8.4-1ubuntu11 ~(again, sorry that I haven't noted the version, can add it tomorrow)~ and after a scrub the errors were gone. Following scrubs haven't detected errors anymore. zfs list -r -t all ... again displayed a large list of snapshots.

The errors didn't seem to affect the data on the zvols (all 4 affected snapshots are of zvols). The zvols are used as disks for VMs with ext4 on them. ~I will verify them tomorrow.~ EDIT: I checked one of the VM disks, neither fsck nor dpkg -V (verify checksums of all files installed from a package) could find any errors (except mismatching dpkg-checksums of config files I've changed - that is to be expected).

I have two other Ubuntu 21.04 based Systems with zfs-2.0.2-1ubuntu5 which are not affected until now. However, they have their pools already upgraded to 2. All are snapshotted with sanoid and have the datasets encrypted.

~My next step will be to downgrade zfs back to 0.8.6 on the Debian system and see what happens.~

EDIT: More points I've noted while investigating with 0.8.4-1ubuntu11:

EDIT 2:

dcarosone commented 3 years ago

I'm seeing this too, on Ubuntu 21.04, also using zfs encryption

I have znapzend running, and it makes a lot of snapshots. Sometimes, some of them are bad, and can't be used (for example, attempting to send them to a replica destination fails). I now use the skipIntermediates option, and so at least forward progress is made on the next snapshot interval.

In the most recent case (this morning) I had something like 4300 errors (many more than I'd seen previously). There are no block-level errors (read/write/cksum). They're cleared after destroying the affected snapshots and scrubbing (and maybe a reboot, depending on .. day?)

Warning! Speculation below:

aerusso commented 3 years ago

@jgoerzen Can you

  1. Capture the zpool events -v report when one of these "bad" snapshots is created?
  2. Try to zfs send that snapshot (i.e., to zfs send ... | cat >/dev/null; notice the need to use cat).
  3. Reboot, and try to zfs send the snapshot.

In my case, #11688 (which you already reference), I've discovered that rebooting "heals" the snapshot -- at least using the patchset I mentioned there

jgoerzen commented 3 years ago

I'll be glad to. Unfortunately, I rebooted the machine yesterday, so I expect it will be about a week before the problem recurs.

It is interesting to see the discussion today in #11688. The unique factor about the machine that doesn't work for me is that I have encryption enabled. It wouldn't surprise me to see the same thing here, but I will of course wait for it to recur and let you know.

jgoerzen commented 3 years ago

Hello @aerusso,

The problem recurred over the weekend and I noticed it this morning.

Unfortunately, the incident that caused it had already expired out of the zpool events buffer (apparently), as it only went as far back as less than an hour ago. However, I did find this in syslog:

Jun 20 01:17:39 athena zed: eid=34569 class=authentication pool='rpool' bookmark=12680:0:0:98
Jun 20 01:17:39 athena zed: eid=34570 class=data pool='rpool' priority=2 err=5 flags=0x180 bookmark=12680:0:0:242
Jun 20 01:17:40 athena zed: eid=34571 class=data pool='rpool' priority=2 err=5 flags=0x180 bookmark=12680:0:0:261
...
Jun 20 17:17:39 athena zed: eid=37284 class=authentication pool='rpool' bookmark=19942:0:0:98
Jun 20 17:17:39 athena zed: eid=37285 class=data pool='rpool' priority=2 err=5 flags=0x180 bookmark=19942:0:0:242
Jun 20 17:17:40 athena zed: eid=37286 class=data pool='rpool' priority=2 err=5 flags=0x180 bookmark=19942:0:0:261
...
Jun 20 18:17:28 athena zed: eid=37376 class=data pool='rpool' priority=2 err=5 flags=0x180 bookmark=21921:0:0:2072
Jun 20 18:17:29 athena zed: eid=37377 class=authentication pool='rpool' priority=2 err=5 flags=0x80 bookmark=21921:0:0:2072
Jun 20 18:17:29 athena zed: eid=37378 class=data pool='rpool' priority=2 err=5 flags=0x80 bookmark=21921:0:0:2072
Jun 20 18:17:40 athena zed: eid=37411 class=authentication pool='rpool' bookmark=21923:0:0:0

It should be noted that my hourly snap/send stuff runs at 17 minutes past the hour, so that may explain this timestamp correlation.

zpool status reported:

  pool: rpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 00:04:12 with 0 errors on Sun Jun 13 00:28:13 2021
config:

    NAME         STATE     READ WRITE CKSUM
    rpool        ONLINE       0     0     0
      nvme0n1p7  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <0x5c81>:<0x0>
        <0x3188>:<0x0>
        rpool/crypt/debian-1@athena-hourly-2021-06-20_23.17.01--2d:<0x0>
        rpool/crypt/debian-1/var@athena-hourly-2021-06-20_23.17.01--2d:<0x0>
        <0x4de6>:<0x0>

Unfortunately I forgot to attempt to do a zfs send before reboot. Those snapshots, though not referenced directly, would have been included in a send -I that would have been issued. From my logs:

Jun 20 18:17:03 athena simplesnapwrap[4740]: Running: /sbin/zfs send -I rpool/crypt/debian-1/var@__simplesnap_bakfs1_2021-06-20T22:17:02__ rpool/crypt/debian-1/var@__simplesnap_bakfs1_2021-06-20T23:17:03__
Jun 20 18:17:03 athena simplesnap[2466/simplesnapwrap]: internal error: warning: cannot send 'rpool/crypt/debian-1/var@athena-hourly-2021-06-20_23.17.01--2d': Invalid argument

So I think that answers the question.

After a reboot but before a scrub, the zfs send you gave executes fine.

cbreak-black commented 3 years ago

I have similar symptoms, on an encrypted single-ssd ubuntu 21.04 boot pool, using stock zfs from ubuntu's repos. Deleting the affected snapshots and scrubbing previously cleared the errors, but on reoccurence, repeated scrubbing (without deleting them) caused a deadlock. My system has ECC memory, so it's probably not RAM related.

aerusso commented 3 years ago

@cbreak-black Was there a system restart between the occurrence of the corrupted snapshot and the problems? Restarting has "fixed" this symptom for me (though you will need to scrub twice for the message to disappear, I believe).

I have a suspicion that this may be a version of #10737 , which has an MR under way there. The behavior I am experiencing could be explained by that bug (syncoid starts many zfs sends on my machine, some of which are not finished; SSDs do the send much faster, so are more likely to get deeper into the zfs send before the next command in the pipeline times out; a reboot heals the issue, for me; there's no on disk corruption, as far as I can tell).

I'm holding off on trying to bisect this issue (at least) until testing that MR. (And all the above is conjecture!)

cbreak-black commented 3 years ago

@aerusso No, without a restart I got into the scrub-hang, and had to restart hard. Afterwards, the scrub finished, and several of the errors vanished. The rest of the errors vanished after deleting the snapshots and scrubbing again.

InsanePrawn commented 3 years ago

Can I join the club too? #10019 Note how it's also at 0x0. Sadly I deleted said snapshot and dataset by now.

aerusso commented 3 years ago

@InsanePrawn I can't seem to find commit 4d5b4a33d in any repository I know of (and neither can github, apparently, either). However, in your report you say this was a "recent git master" and the commit I'm currently betting on being guilty is da92d5cbb38cea3a860b8a6bb8ee21f9129e7d7c, which was committed in November of the previous year, so I can't use your data point to rule out my theory!

Also, it sounds like you didn't have any good way to reproduce the error --- however, you were using a test pool. Compared to my reproduction strategy (which is just, turn my computer on and browse the web, check mail, etc.) it might be easier to narrow in on a test case (or might have been easier a year and a half ago, when this was all fresh). Anyway, if you have any scripts or ideas of what you were doing that caused this besides "snapshots being created and deleted every couple minutes", it might be useful too. (I already tried lots of snapshot creations and deletions during fio on several datasets in a VM).

InsanePrawn commented 3 years ago

Yeah, idk why I didn't go look for the commit in my issue - luckily for us, that server (and pool; it does say yolo, but it's my private server's root pool. it's just that i won't cry much if it breaks; originally due to then unreleased crypto) and the git repo on it still exist. Looks like 4d5b4a33d was two systemd-generator commits by me after 610eec452d723bc53ce531095aff9577a2e0dc93

InsanePrawn commented 3 years ago

FWIW the dataset the issue appeared on was an empty filesystem (maybe a single small file inside) dataset that had snapshots (without actual fs activity) taken in quick intervals (somewhere between 30s and 5m intervals) in parallel with a few (5-15) other similarly empty datasets. Edit: These were being snapshotted and replicated by zrepl, probably in a similar manner to what znapzend does.

The pool is a raidz2 on 3.5" spinning SATA disks. I'm afraid I have nothing more to add in terms of reproduction :/

Edit: Turns out the dataset also still exists, the defective snapshot however does not anymore. I doubt that's helpful?

aerusso commented 3 years ago

@InsanePrawn Does running the zrepl workload reproduce the bug on 2.0.5 (or another recent release?)

I don't think the snapshot is terribly important --- unless you're able to really dig into it with zdb (which I have not developed sufficient expertise to do). Rather, I think it's the workload, hardware setup, and (possibly, but I don't understand the mechanism at all) the dataset itself. Encryption also is a common theme, but that might just affect the presentation (i.e., there's no MAC to fail in the unencrypted, unauthenticated, case).

Getting at zpool events -v showing the error would probably tell us something (see mine).

cbreak-black commented 2 years ago

I've since added redundancy to my pool (it's now a mirror with two devices), and disabled autotrim. The snapshot corruption still happens. Still don't know what is causing it. And I also don't know if the corruption happens when creating the snapshot, and only later gets discovered (when I try to zfs send the snapshots), or if snapshots get corrupted some time in between creation and sending.

aerusso commented 2 years ago

@cbreak-black Can you enable the all-debug.sh ZEDlet, and put the temporary directory somewhere permanent (i.e., not the default of /tmp/zed.debug.log)?

This will get the output of zpool events -v as it is generated, and will give times, which you can conceivably triangulate with your other logs. There's other information in those logs that is probably useful, too.

I'll repeat this here: if anyone gets me a reliable reproducer on a new pool, I have no doubt we'll be able to solve this in short order.

wohali commented 2 years ago

Just mentioning here that we saw this on TrueNAS 12.0-U5 with OpenZFS 2.0.5 as well -- see https://github.com/openzfs/zfs/issues/11688#issuecomment-910764101 for our story.

rincebrain commented 2 years ago

Since I don't see anyone mentioning it here yet, #11679 contains a number of stories about the ARC getting confused when encryption is involved and, in a very similar looking illumos bug linked from there, eating data at least once.

gamanakis commented 2 years ago

@jgoerzen are you using raw send/receive? If yes this is closely related to #12594.

jgoerzen commented 2 years ago

@gamanakis Nope, I'm not using raw (-w).

phreaker0 commented 2 years ago

it's present in v2.1.1 as well:

Okt 09 01:01:14 tux sanoid[2043026]: taking snapshot ssd/container/debian-test@autosnap_2021-10-08_23:01:14_hourly
Okt 09 01:01:16 tux sanoid[2043026]: taking snapshot ssd/container/debian-test@autosnap_2021-10-08_23:01:14_frequently
Okt 09 01:01:16 tux kernel: VERIFY3(0 == remove_reference(hdr, NULL, tag)) failed (0 == 1)
Okt 09 01:01:16 tux kernel: PANIC at arc.c:3836:arc_buf_destroy()
Okt 09 01:01:16 tux kernel: Showing stack for process 435
Okt 09 01:01:16 tux kernel: CPU: 2 PID: 435 Comm: z_rd_int_1 Tainted: P           OE     5.4.0-84-generic #94-Ubuntu
Okt 09 01:01:16 tux kernel: Hardware name: GIGABYTE GB-BNi7HG4-950/MKHM17P-00, BIOS F1 05/24/2016
Okt 09 01:01:16 tux kernel: Call Trace:
Okt 09 01:01:16 tux kernel:  dump_stack+0x6d/0x8b
Okt 09 01:01:16 tux kernel:  spl_dumpstack+0x29/0x2b [spl]
Okt 09 01:01:16 tux kernel:  spl_panic+0xd4/0xfc [spl]
Okt 09 01:01:16 tux kernel:  ? kfree+0x231/0x250
Okt 09 01:01:16 tux kernel:  ? spl_kmem_free+0x33/0x40 [spl]
Okt 09 01:01:16 tux kernel:  ? kfree+0x231/0x250
Okt 09 01:01:16 tux kernel:  ? zei_add_range+0x140/0x140 [zfs]
Okt 09 01:01:16 tux kernel:  ? spl_kmem_free+0x33/0x40 [spl]
Okt 09 01:01:16 tux kernel:  ? zfs_zevent_drain+0xd3/0xe0 [zfs]
Okt 09 01:01:16 tux kernel:  ? zei_add_range+0x140/0x140 [zfs]
Okt 09 01:01:16 tux kernel:  ? zfs_zevent_post+0x234/0x270 [zfs]
Okt 09 01:01:16 tux kernel:  arc_buf_destroy+0xfa/0x100 [zfs]
Okt 09 01:01:16 tux kernel:  arc_read_done+0x251/0x4a0 [zfs]
Okt 09 01:01:16 tux kernel:  zio_done+0x407/0x1050 [zfs]
Okt 09 01:01:16 tux kernel:  zio_execute+0x93/0xf0 [zfs]
Okt 09 01:01:16 tux kernel:  taskq_thread+0x2fb/0x510 [spl]
Okt 09 01:01:16 tux kernel:  ? wake_up_q+0x70/0x70
Okt 09 01:01:16 tux kernel:  ? zio_taskq_member.isra.0.constprop.0+0x60/0x60 [zfs]
Okt 09 01:01:16 tux kernel:  kthread+0x104/0x140
Okt 09 01:01:16 tux kernel:  ? task_done+0xb0/0xb0 [spl]
Okt 09 01:01:16 tux kernel:  ? kthread_park+0x90/0x90
Okt 09 01:01:16 tux kernel:  ret_from_fork+0x1f/0x40
phreaker0 commented 2 years ago

@aerusso you wrote that https://github.com/openzfs/zfs/commit/da92d5cbb38cea3a860b8a6bb8ee21f9129e7d7c may be the cause of this issue. My workstation at work panics after a couple of days and I need to reset it. Could you provide a branch of 2.1.1 with this commit reverted (as revert causes merge conflicts I can't fix myself) so I could test if the machine no longer crashes?

aerusso commented 2 years ago

@phreaker0 Unfortunately, the bug that da92d5c introduced (#10737) was fixed by #12299, which I believe is present in all maintained branches now. It does not fix #11688, (which I suspect is the same as this bug).

I'm currently running 0.8.6 on Linux 5.4.y, and am hoping to wait out this bug (I don't have a lot of time right now, or for the foreseeable future). But, If you have a reliable reproducer (or a whole lot of time) you could bisect while running 5.4 (or some other pre-5.10 kernel). I can help anyone who wants to do that. If we can find the guilty commit, I have no doubt this can be resolved.

phreaker0 commented 2 years ago

@aerusso https://github.com/openzfs/zfs/pull/12299 (https://github.com/openzfs/zfs/commit/958826be7a3e17f29e1f5e114c76aa2ec3c8a490) is not in the zfs-2.1.1 tag, checked with:

git log | grep 958826be7a3e17f29e1f5e114c76aa2ec3c8a490

cherry picking the commit to the 2.1.1 version is possible (minor merge conflict), should I try it?

Regarding kernel version, I see this bug on a server with debian buster (4.19.0-18-amd64) and desktop ubuntu focal (5.4). Would it prevent the error if I upgrade to 5.11 on the desktop machine?

I also tried to reproduce the error in a VM but without any luck so far.

rincebrain commented 2 years ago

It's 8415c3c1705ec64ea34a2e90d6810e6fe5232cce on 2.1 and 1885e5ebabe0eeb57055077f1b904dedaa91b506 on 2.0.

phreaker0 commented 2 years ago

@rincebrain thank you for clearing that up. So I will just try if 5.11 kernel may help to work around this problem

AndrewJDR commented 2 years ago

I believe I'm also seeing this, with both the corruption of snapshots and kernel errors during the zfs send/receive (see below). If anyone finds a workaround to enable zfs send/receive workloads to work again, please do mention it on this thread, I'd really appreciate it. Unfortunately, I can't go back to 0.8.x without wiping everything and starting over, because this pool uses zstd and native encryption. So I'm not sure what to do to have a functional pool again other than starting over with a new pool or just toughing it out until this gets fixed.

Ubuntu 20.04.2 LTS
5.4.0-81-generic
zfs-2.1.1-0york0~20.04
zfs-kmod-2.1.1-0york0~20.04
[38788.032357] INFO: task receive_writer:2296222 blocked for more than 604 seconds.
[38788.032362]       Tainted: P           OE     5.4.0-81-generic #91-Ubuntu
[38788.032364] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[38788.032368] receive_writer  D    0 2296222      2 0x80004000
[38788.032372] Call Trace:
[38788.032379]  __schedule+0x2e3/0x740
[38788.032386]  schedule+0x42/0xb0
[38788.032399]  spl_panic+0xfa/0xfc [spl]
[38788.032405]  ? _cond_resched+0x19/0x30
[38788.032411]  ? down_read+0x13/0xa0
[38788.032506]  ? dbuf_read+0x1f3/0x580 [zfs]
[38788.032606]  ? dmu_bonus_hold_by_dnode+0x149/0x1b0 [zfs]
[38788.032711]  receive_object+0xb73/0xdc0 [zfs]
[38788.032820]  ? dnode_rele_and_unlock+0xc1/0xe0 [zfs]
[38788.032827]  ? _cond_resched+0x19/0x30
[38788.032832]  ? mutex_lock+0x13/0x40
[38788.032939]  receive_writer_thread+0x1d3/0xae0 [zfs]
[38788.032947]  ? set_user_nice.part.0+0xff/0x660
[38788.032958]  ? spl_kmem_free+0x33/0x40 [spl]
[38788.033061]  ? receive_process_write_record+0x180/0x180 [zfs]
[38788.033075]  thread_generic_wrapper+0x79/0x90 [spl]
[38788.033179]  ? receive_process_write_record+0x180/0x180 [zfs]
[38788.033193]  ? thread_generic_wrapper+0x79/0x90 [spl]
[38788.033199]  kthread+0x104/0x140
[38788.033211]  ? __thread_exit+0x20/0x20 [spl]
[38788.033216]  ? kthread_park+0x90/0x90
[38788.033220]  ret_from_fork+0x35/0x40
aerusso commented 2 years ago

@AndrewJDR

  1. Are the affected datasets encrypted?
  2. Do you get the Permanent errors have been detected in the following files line?
  3. Does the "corruption" go away after rebooting and scrubbing twice? (Don't run the zfs send/receive workload during the scrubs)
  4. Do the permanent errors (not just the hung kernel thread) show up right around a zfs send and/or snapshot? (Use zpool events if you haven't rebooted)
  5. Do you have a reliable way to trigger the permanent errors? (I.e., how long do you have to run the send/snapshot workload to trigger the permanent errors)
AndrewJDR commented 2 years ago

@aerusso

  1. Yes the datasets are encrypted (but the sends / receives are not raw/encrypted)
  2. Yes
  3. The first scrub is in progress. One kernel error occurred during the scrub, so far (see below). The scrub appears to still be going. I'll report back if the errors or cleared or after he second scrub, whichever happens first. Unfortunately I've rebooted.
  4. I'm not sure on that yet, and I may not know unless I can clear the errors and try again. I do remember the particular dataset on which the send/receive had an error, so I can try that specific one again.
Oct 16 07:17:42 kernel: [43987.813668] general protection fault: 0000 [#1] SMP PTI
Oct 16 07:17:42 kernel: [43987.813673] CPU: 5 PID: 3610519 Comm: z_rd_int_1 Tainted: P           OE     5.4.0-81-generic #91-Ubuntu
Oct 16 07:17:42 kernel: [43987.813742] RIP: 0010:zio_remove_child+0x73/0x140 [zfs]
Oct 16 07:17:42 kernel: [43987.813744] Code: 48 89 45 d0 e8 de 4e c0 e9 48 8b 45 d0 49 89 84 24 f0 03 00 00 48 8b 83 50 01 00 00 4c 01 e8 48 8b 08 48 8b 50 08 48 89 51 08 <48> 89 0a 48 b9 00 01 00 00 00 00 ad de 48 ba 22 01 00 00 00 00 ad
Oct 16 07:17:42 kernel: [43987.813746] RSP: 0018:ffffa5220c603cf8 EFLAGS: 00010286
Oct 16 07:17:42 kernel: [43987.813747] RAX: ffff93500b7f55f0 RBX: ffff934e59ed7500 RCX: ffff934e59ed7658
Oct 16 07:17:42 kernel: [43987.813748] RDX: fffe934e59ed7658 RSI: ffff934fe8bf0000 RDI: ffff934fe8bf03c8
Oct 16 07:17:42 kernel: [43987.813749] RBP: ffffa5220c603d28 R08: 0000000000000000 R09: ffffa5220c603c10
Oct 16 07:17:42 kernel: [43987.813750] R10: 0000000000000002 R11: 0000000000000001 R12: ffff934fe8bf0000
Oct 16 07:17:42 kernel: [43987.813751] R13: ffff93500b7f55d0 R14: ffff934e59ed78c8 R15: ffff934fe8bf03c8
Oct 16 07:17:42 kernel: [43987.813753] FS:  0000000000000000(0000) GS:ffff93518f940000(0000) knlGS:0000000000000000
Oct 16 07:17:42 kernel: [43987.813754] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 16 07:17:42 kernel: [43987.813755] CR2: 00005619f3c4bda4 CR3: 00000001b540a001 CR4: 00000000001606e0
Oct 16 07:17:42 kernel: [43987.813756] Call Trace:
Oct 16 07:17:42 kernel: [43987.813820]  zio_done+0x4ca/0x1050 [zfs]
Oct 16 07:17:42 kernel: [43987.813876]  zio_execute+0x93/0xf0 [zfs]
Oct 16 07:17:42 kernel: [43987.813883]  taskq_thread+0x2fb/0x510 [spl]
Oct 16 07:17:42 kernel: [43987.813886]  ? wake_up_q+0x70/0x70
Oct 16 07:17:42 kernel: [43987.813938]  ? zio_taskq_member.isra.0.constprop.0+0x60/0x60 [zfs]
Oct 16 07:17:42 kernel: [43987.813941]  kthread+0x104/0x140
Oct 16 07:17:42 kernel: [43987.813945]  ? task_done+0xb0/0xb0 [spl]
Oct 16 07:17:42 kernel: [43987.813948]  ? kthread_park+0x90/0x90
Oct 16 07:17:42 kernel: [43987.813950]  ret_from_fork+0x35/0x40
Oct 16 07:17:42 kernel: [43987.813952] Modules linked in: xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c bpfilter br_netfilter aufs nls_utf8 cifs overlay libarc4 fscache libdes binfmt_misc joydev intel_rapl_msr mei_hdcp intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp snd_soc_rt5640 snd_hda_codec_realtek kvm_intel snd_soc_rl6231 snd_hda_codec_generic ledtrig_audio snd_soc_core snd_hda_codec_hdmi snd_hda_intel snd_compress kvm snd_intel_dspcfg ac97_bus snd_pcm_dmaengine snd_hda_codec input_leds snd_hda_core rapl intel_cstate snd_seq_midi serio_raw mxm_wmi snd_seq_midi_event snd_hwdep mei_me pcoip_host(OE) snd_rawmidi snd_pcm mei snd_seq snd_seq_device snd_timer snd soundcore mac_hid acpi_pad nvidia_uvm(OE) bridge stp llc sch_fq_codel msr parport_pc ppdev lp parport ip_tables x_tables autofs4 zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) icp(POE) zcommon(POE)
Oct 16 07:17:42 kernel: [43987.813982]  znvpair(POE) spl(OE) hid_generic usbhid hid nvidia_drm(POE) nvidia_modeset(POE) crct10dif_pclmul crc32_pclmul nvidia(POE) i915 ghash_clmulni_intel i2c_algo_bit aesni_intel drm_kms_helper crypto_simd syscopyarea sysfillrect sysimgblt cryptd fb_sys_fops glue_helper lpc_ich drm ahci alx i2c_i801 libahci mdio wmi video
Oct 16 07:17:42 kernel: [43987.813996] ---[ end trace c2321f7b3ef4dba0 ]---
Oct 16 07:17:42 kernel: [43987.814049] RIP: 0010:zio_remove_child+0x73/0x140 [zfs]
Oct 16 07:17:42 kernel: [43987.814051] Code: 48 89 45 d0 e8 de 4e c0 e9 48 8b 45 d0 49 89 84 24 f0 03 00 00 48 8b 83 50 01 00 00 4c 01 e8 48 8b 08 48 8b 50 08 48 89 51 08 <48> 89 0a 48 b9 00 01 00 00 00 00 ad de 48 ba 22 01 00 00 00 00 ad
Oct 16 07:17:42 kernel: [43987.814052] RSP: 0018:ffffa5220c603cf8 EFLAGS: 00010286
Oct 16 07:17:42 kernel: [43987.814053] RAX: ffff93500b7f55f0 RBX: ffff934e59ed7500 RCX: ffff934e59ed7658
Oct 16 07:17:42 kernel: [43987.814055] RDX: fffe934e59ed7658 RSI: ffff934fe8bf0000 RDI: ffff934fe8bf03c8
Oct 16 07:17:42 kernel: [43987.814056] RBP: ffffa5220c603d28 R08: 0000000000000000 R09: ffffa5220c603c10
Oct 16 07:17:42 kernel: [43987.814057] R10: 0000000000000002 R11: 0000000000000001 R12: ffff934fe8bf0000
Oct 16 07:17:42 kernel: [43987.814058] R13: ffff93500b7f55d0 R14: ffff934e59ed78c8 R15: ffff934fe8bf03c8
Oct 16 07:17:42 kernel: [43987.814059] FS:  0000000000000000(0000) GS:ffff93518f940000(0000) knlGS:0000000000000000
Oct 16 07:17:42 kernel: [43987.814060] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 16 07:17:42 kernel: [43987.814061] CR2: 00005619f3c4bda4 CR3: 00000001b540a001 CR4: 00000000001606e0
aerusso commented 2 years ago

I am not finding that zio_remove_child backtrace except in #11625 ? I'm now out of my depth, so my suggestion is to make sure your backups are in excellent shape.

putnam commented 2 years ago

I have made a simple file-based repro of this issue (or something seemingly adjacent) in my comment here: https://github.com/openzfs/zfs/issues/12594#issuecomment-945073438

I wasn't aware of the additional kernel issues; so far I have only reproduced the IO error and broken pool state.

aerusso commented 2 years ago

@putnam Unfortunately that reproducer breaks on 0.8.6 for me (as well as on 2.x). That doesn't necessarily mean it isn't the same underlying bug as this (e.g., some other non-bug behavior change may have exposed a preexisting buggy code path). However, it does mean my stupid strategy of just bisecting the change won't work.

(As an aside. the generated corruption is also an ereport.fs.zfs.authentication error.)

AndrewJDR commented 2 years ago

One bit I'd like to clarify, does that reproducer only trigger when using raw sends -w? At least in my case -w is not being used. (But the receiving dataset is encrypted!)

Blackclaws commented 2 years ago

@aerusso I'm also going to chime in here as I run into pretty much exactly the same situation. Running zfs-2.1.4-1 on Centos/Almalinux

Answers to your questions above:

  1. Are the affected datasets encrypted? Yes
  2. Do you get the Permanent errors have been detected in the following files line? Yes referencing snapshots or being a pure reference if deleting the snapshot
  3. Does the "corruption" go away after rebooting and scrubbing twice? (Don't run the zfs send/receive workload during the scrubs) Yes, and even better even before scrubbing you are now able to send without running into the data errors.
  4. Do the permanent errors (not just the hung kernel thread) show up right around a zfs send and/or snapshot? (Use zpool events if you haven't rebooted) Can't say as I've recently rebooted and they haven't shown up again so far. The problem is that they mostly show up during our automatic backup and this breaks said backup which means we have to restart our servers.
  5. Do you have a reliable way to trigger the permanent errors? (I.e., how long do you have to run the send/snapshot workload to trigger the permanent errors) Absolutely not reliable unfortunately, it seems to show up sooner the more I/O is happening during send/receive on the datasets. My guess is that its somehow related to ongoing I/O on the currently being sent dataset. It happens around once per day. If we're lucky two days without.

The issue is affecting us in such a way that our automatic backups break approximately every day. And we either have to accept longer breaks in backups or have to reboot daily. This isn't great for our uptime at all, especially since due to the encryption key its a manual process as well. So any sort of fix on this issue would be greatly appreciated.

aerusso commented 2 years ago

@blackclaws Did you just upgrade to 2.1.4? There's at least one report of the situation improving dramatically after upgrading from 2.1.2 to 2.1.4. I no longer have pools with encryption so I can't confirm this.

Blackclaws commented 2 years ago

As far as I know I've been running 2.1.4 since a while and the problem still exists. But I can also monitor for a week and report back.

cyberpower678 commented 2 years ago

I have a few large datasets that arrive corrupted in the exact same way every time. They're encrypted, and being sent with zfs send -Rwp.

mattchrist commented 2 years ago

I believe I was also experiencing this issue. Upgrading from 2.1.2 to 2.1.4 seems to have resolved it for me.

Maltz42 commented 2 years ago

@aerusso Looks like we have a reliable reproducer with cyberpower678's situation.

zfsbot commented 2 years ago

i think a reliable reproducer needs some steps to actually reproduce it, a test pool image that reliably does so... saying "encrypted datasets sent with -Rwp" doesn't help because that's a workflow I've done on a daily basis here with snapshots, and i regularly recover files from them.. in other words, that doesn't reliably reproduce it here.

zfsbot commented 2 years ago

i don't see any mentions to it, but since it's a laptop, make sure you're not suspending or hibernating with ZFS pools active.

cyberpower678 commented 2 years ago

i think a reliable reproducer needs some steps to actually reproduce it, a test pool image that reliably does so... saying "encrypted datasets sent with -Rwp" doesn't help because that's a workflow I've done on a daily basis here with snapshots, and i regularly recover files from them.. in other words, that doesn't reliably reproduce it here.

I never said I had steps to reliably reproduce it. I said that my larger datasets consistently trigger the same corruption issue. I'm waiting for someone to help guide me on what I can do to try and isolate the trigger so I can produce steps to reliably reproduce the issue, as I have no advanced knowledge of ZFS' mechanics and functions. Just how to use it correctly.

tjikkun commented 2 years ago

As far as I know I've been running 2.1.4 since a while and the problem still exists. But I can also monitor for a week and report back. @Blackclaws Can you share your findings?

Blackclaws commented 2 years ago

Yeah just got back from a one week vacation. Looked at the sync status and it was indeed broken again. Running 2.1.4-1 on Almalinux.

It feels like the frequency however did indeed decrease. It didn't completely stop unfortunately. I'll try to see whether I can get logs of zfs events when doing each sync and post them when the error occurs.

aerusso commented 2 years ago

@cyberpower678 When you say your "larger datasets consistently trigger" the issue, do you mean you can provide a script that causes the corruption on your pool (and the system is otherwise idle)? If so, would you either be willing to share an image of the pool, or do a git-bisect yourself between 0.8.6 and 2.0?

cyberpower678 commented 2 years ago

@cyberpower678 When you say your "larger datasets consistently trigger" the issue, do you mean you can provide a script that causes the corruption on your pool (and the system is otherwise idle)? If so, would you either be willing to share an image of the pool, or do a git-bisect yourself between 0.8.6 and 2.0?

I am not willing to share an image because it has private data, but I will issue the command zfs -Rwpv | zfs receive -sFv , newdataset will be corrupted. All of my smaller datasets alls arrive intact. I have only attempted a few of the larger ones and the first two I sent, both arrived corrupted. As for the git-bisect, I'm not really inclined to do that as I would rather someone more experienced do that.

aerusso commented 2 years ago

@cyberpower678 I entirely sympathize. I would not be willing to share a pool, even encrypted, either. However, this puts us in a deadlock: only you have access to the reproducer. I'll outline what I'd do if I somehow received your (hopefully not too large) pool image:

  1. I'd load it into a qcow image by dd-ing the source into a qemu-nbd device
  2. I'd create a VM as similar to your setup as possible.
  3. I'd snapshot the qcow2 images (pool and VM root)
  4. I'd mount your pool in the VM and run the reproducer
  5. I'd confirm the bug
  6. I'd reset the VM to the qcow2 snapshot
  7. I would repeat 4-6 with Linux 5.4, and your current ZFS (expect corruption)
  8. Repeat 7, but with ZFS 0.8.6 (expect no corruption)
  9. Repeat 7, but with ZFS compiled from git b8a90418f (0.8 branch merge-base; expect no corruption)
  10. I'd do a git bisect to find the commit on master that started the problem

If you're willing to do this (it would be a lot of recompiling), I can walk you through any steps that are not clear.

The end result of this process would be a specific commit that causes corruption.

Maltz42 commented 2 years ago

@cyberpower678 is that private data encrypted? I'm assuming that aerusso doesn't need the key, if that makes you any more comfortable...

Otherwise, I've never done anything like a git-bisect either, though I think I understand the idea. @aerusso, is it possible for a script to do that in an automated way that doesn't require a lot of expertise?

aerusso commented 2 years ago

IMO, git-bisect is the easy part. The loop is basically going to be

  1. make clean
  2. make
  3. make deb (or whatever)
  4. reset the VM/ copy the files to the VM/install
  5. run reproducer
  6. git bisect good or git bisect bad depending on the result
  7. repeat

Setting up the VM and waiting for the copy to complete will be the hard part.

@cyberpower678 Can you confirm if this happens without the raw send? I know there were some bugs associated with raw sending that have had some progress on them.