openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.6k stars 1.75k forks source link

ZFS corruption related to snapshots post-2.0.x upgrade #12014

Open jgoerzen opened 3 years ago

jgoerzen commented 3 years ago

System information

Type Version/Name
Distribution Name Debian
Distribution Version Buster
Linux Kernel 5.10.0-0.bpo.5-amd64
Architecture amd64
ZFS Version 2.0.3-1~bpo10+1
SPL Version 2.0.3-1~bpo10+1

Describe the problem you're observing

Since upgrading to 2.0.x and enabling crypto, every week or so, I start to have issues with my zfs send/receive-based backups. Upon investigating, I will see output like this:

zpool status -v
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 00:03:37 with 0 errors on Mon May  3 16:58:33 2021
config:

    NAME         STATE     READ WRITE CKSUM
    rpool        ONLINE       0     0     0
      nvme0n1p7  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <0xeb51>:<0x0>

Of note, the <0xeb51> is sometimes a snapshot name; if I zfs destroy the snapshot, it is replaced by this tag.

Bug #11688 implies that zfs destroy on the snapshot and then a scrub will fix it. For me, it did not. If I run a scrub without rebooting after seeing this kind of zpool status output, I get the following in very short order, and the scrub (and eventually much of the system) hangs:

[393801.328126] VERIFY3(0 == remove_reference(hdr, NULL, tag)) failed (0 == 1)
[393801.328129] PANIC at arc.c:3790:arc_buf_destroy()
[393801.328130] Showing stack for process 363
[393801.328132] CPU: 2 PID: 363 Comm: z_rd_int Tainted: P     U     OE     5.10.0-0.bpo.5-amd64 #1 Debian 5.10.24-1~bpo10+1
[393801.328133] Hardware name: Dell Inc. XPS 15 7590/0VYV0G, BIOS 1.8.1 07/03/2020
[393801.328134] Call Trace:
[393801.328140]  dump_stack+0x6d/0x88
[393801.328149]  spl_panic+0xd3/0xfb [spl]
[393801.328153]  ? __wake_up_common_lock+0x87/0xc0
[393801.328221]  ? zei_add_range+0x130/0x130 [zfs]
[393801.328225]  ? __cv_broadcast+0x26/0x30 [spl]
[393801.328275]  ? zfs_zevent_post+0x238/0x2a0 [zfs]
[393801.328302]  arc_buf_destroy+0xf3/0x100 [zfs]
[393801.328331]  arc_read_done+0x24d/0x490 [zfs]
[393801.328388]  zio_done+0x43d/0x1020 [zfs]
[393801.328445]  ? zio_vdev_io_assess+0x4d/0x240 [zfs]
[393801.328502]  zio_execute+0x90/0xf0 [zfs]
[393801.328508]  taskq_thread+0x2e7/0x530 [spl]
[393801.328512]  ? wake_up_q+0xa0/0xa0
[393801.328569]  ? zio_taskq_member.isra.11.constprop.17+0x60/0x60 [zfs]
[393801.328574]  ? taskq_thread_spawn+0x50/0x50 [spl]
[393801.328576]  kthread+0x116/0x130
[393801.328578]  ? kthread_park+0x80/0x80
[393801.328581]  ret_from_fork+0x22/0x30

However I want to stress that this backtrace is not the original cause of the problem, and it only appears if I do a scrub without first rebooting.

After that panic, the scrub stalled -- and a second error appeared:

zpool status -v
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub in progress since Sat May  8 08:11:07 2021
    152G scanned at 132M/s, 1.63M issued at 1.41K/s, 172G total
    0B repaired, 0.00% done, no estimated completion time
config:

    NAME         STATE     READ WRITE CKSUM
    rpool        ONLINE       0     0     0
      nvme0n1p7  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <0xeb51>:<0x0>
        rpool/crypt/debian-1/home/jgoerzen/no-backup@[elided]-hourly-2021-05-07_02.17.01--2d:<0x0>

I have found the solution to this issue is to reboot into single-user mode and run a scrub. Sometimes it takes several scrubs, maybe even with some reboots in between, but eventually it will clear up the issue. If I reboot before scrubbing, I do not get the panic or the hung scrub.

I run this same version of ZoL on two other machines, one of which runs this same kernel version. What is unique about this machine?

I made a significant effort to rule out hardware issues, including running several memory tests and the built-in Dell diagnostics. I believe I have rules that out.

Describe how to reproduce the problem

I can't at will. I have to wait for a spell.

Include any warning/errors/backtraces from the system logs

See above

Potentially related bugs

HankB commented 2 years ago

I believe I'm seeing the same issue. I've been running Debian on an XPS 13 with ZFS on root since Buster was Testing. Over the years I saw a couple instances where zpool status reported a couple permanent errors, AFAIK in metadata in snapshots. These eventually went away. More recently (and at some point after upgrading to Bookworm) I had been seeing a lot more, up to 64 permanent errors on the root pool. I thought it might be the drive going bad so I replaced it with another brand and the errors continued. The errors may have started building after I implemented entire pool backup using syncoid. At one point I did get the diagnostic in dmesg output:

[764184.301775] VERIFY3(0 == remove_reference(hdr, NULL, tag)) failed (0 == 1)
[764184.301778] PANIC at arc.c:3839:arc_buf_destroy()
[764184.301780] Showing stack for process 314
[764184.301781] CPU: 4 PID: 314 Comm: z_rd_int_0 Tainted: P        W  OE     5.18.0-2-amd64 #1  Debian 5.18.5-1
[764184.301784] Hardware name: Dell Inc. XPS 13 9370/0F6P3V, BIOS 1.20.0 06/06/2022
[764184.301785] Call Trace:
[764184.301788]  <TASK>
[764184.301790]  dump_stack_lvl+0x45/0x5a
[764184.301796]  spl_panic+0xd1/0xe9 [spl]
[764184.301805]  ? __slab_free+0xa0/0x2d0
[764184.301808]  ? zfs_zevent_post_cb+0x15/0x30 [zfs]
[764184.301893]  ? zfs_zevent_post+0x20f/0x280 [zfs]
[764184.301951]  ? kfree+0x2c5/0x2e0
[764184.301953]  ? preempt_count_add+0x68/0xa0
[764184.301956]  ? _raw_spin_lock+0x13/0x30
[764184.301959]  ? _raw_spin_unlock+0x15/0x30
[764184.301962]  arc_buf_destroy+0xed/0xf0 [zfs]
[764184.302003]  arc_read_done+0x25e/0x490 [zfs]
[764184.302075]  zio_done+0x3fc/0x1150 [zfs]
[764184.302146]  zio_execute+0x83/0x120 [zfs]
[764184.302217]  taskq_thread+0x2cb/0x4f0 [spl]
[764184.302224]  ? wake_up_q+0x90/0x90
[764184.302227]  ? zio_gang_tree_free+0x60/0x60 [zfs]
[764184.302298]  ? taskq_thread_spawn+0x50/0x50 [spl]
[764184.302303]  kthread+0xe8/0x110
[764184.302306]  ? kthread_complete_and_exit+0x20/0x20
[764184.302307]  ret_from_fork+0x22/0x30
[764184.302312]  </TASK>

At present I've installed (dual boot) Debian Bullseye which includes versions

hbarta@tachi:~$ uname -a
Linux tachi 5.10.0-16-amd64 #1 SMP Debian 5.10.127-2 (2022-07-23) x86_64 GNU/Linux
hbarta@tachi:~$ zfs --version
zfs-2.0.3-9
zfs-kmod-2.0.3-9
hbarta@tachi:~$ 

Bookworm uses the 5.18.0 kernel and ZFS 2.1.5. I'm trying to duplicate this problem (permanent errors) on another laptop with a SATA SSD. At present it us on Debian Bullseye with kernel 5.18.0 and ZFS 2.1.5, both from backports, and has not demonstrated the problem.

It seems to be easy to provoke on my XPS 13 with Debian Bookworm. If there is anything I could do to help track this down, I would like to help.

Thanks!

Edit: I believe I've duplicated the error on a system running on a SATA SSD. There was an initial send to populate my home directory several days ago and none since. I have hourly complete pool backups to another host (using syncoid) and not a lot of other activity on this host. This host is on 5.18.0 and ZFS 2.1.5.

0xxon commented 2 years ago

Just to chime in - I assume I have the same error after upgrading from Ubuntu 20.04 to 22.02. This happened on several (but not all) zpools and rendered the volumes unmountable in Ubuntu 22.04 (IO error on mount).

I use encryption and manual snapshots for backups. I opted to downgrade to Ubuntu 22.04/ZFS 0.8.3, which was able to mount everything - and I am currently running scrubs.

Screenshot 2022-08-19 at 17 34 42

0xxon commented 2 years ago

I played around with this a bit:

mat128 commented 2 years ago

@0xxon I suspect #13709 could be helpful in your situation. I went through the same input/output error when mounting encrypted filesystems after upgrading Ubuntu from 20.04 to 22.04. ZFS 0.8.3 to 2.1.4.

phreaker0 commented 1 year ago

i still get the crash with zfs 2.1.9:

[773591.783059] VERIFY3(0 == remove_reference(hdr, NULL, tag)) failed (0 == 1)
[773591.783900] PANIC at arc.c:3847:arc_buf_destroy()

if i scrub without rebooting after <0x0> snapshot permanent errors appear which are destroyed manually.

Two scrubs after a reboot will clear the errors without a crash.

rincebrain commented 1 year ago

I do so wish people plumbed decryption errors back into the error pipeline more sanely. Oh well. (Just speculating, that's one way you get errors logged without R/W/CKSUM error stat count increases.)

Is the backtrace still the same?

phreaker0 commented 1 year ago

Unfortunately i couldn't ssh into the system anymore because it uses ZFS as the root filesystem. Logs are also not available for the same reason. I only got the two relevant lines from the attached screen.

newdayhost commented 1 year ago

Ran into the same issue upgrading the kernel from 5.4.0-121 to 5.15.0-58. on Ubuntu 20.04.5. double scrub only helped at masking the errors, not solving the issue. so I reverted back to older kernel and all is good.

rincebrain commented 1 year ago

It shouldn't mask the errors - scrubbing just checks the checksums. It's also possible for decryption errors to arise, which aren't going to be noticed by scrub since it would need the keys to do that and you don't always have those.

You could give 13709's patch a go. It's certainly helped people in that situation, though without more data I can't speculate if it'd help you. It shouldn't hurt, though.

siilike commented 1 year ago

I am experiencing a similar issue with zfs-2.1.9-1~bpo11+1 on kernel 5.10.0-19-amd64.

From time to time ZFS reports "permanent errors" with certain snapshots and receiving them on another system fails with "Input/output error".

After removing the snapshots along with some of the previous snapshots it works again.

This happens on two different datasets, both encrypted, one is also using a special device.

On other systems that use LUKS for encryption the same backup and replication model works flawlessly.

siilike commented 1 year ago

Some more details:

  1. As of now the issue disappeared in one pool. I destroyed the whole (most?) affected dataset and excluded it from future replication to the backup system as it wasn't actually needed (Docker datasets).
  2. On the other pool it still exists and keeps reappearing, even if I delete a week worth of snapshots.
  3. It works fine after deleting the snapshots and scrub clears the errors, but once a few new snapshots are taken the issue reappears.
  4. I tried renaming pool/A/B to pool/A/C, recreating pool/A/B and rsyncing data from pool/A/C to pool/A/B, but now it is still complaining about the only snapshot the new pool/A/B has. Likely the whole pool/A needs to be destroyed.
  5. On the receiving system I am not seeing #14252 or #12001 any more. I did upgrade to zfs-2.1.7-1~bpo11+1, so perhaps that fixed it.
morph027 commented 1 year ago

Are you using a special replication tool? I've noticed that this keeps stopping after I switched from sanoid to something else.

siilike commented 1 year ago

I used a cron job to recursively create snapshots. The issue started when I set up pruning the unneeded ones.

I use syncoid for replication with --create-bookmark and --no-sync-snap.

Tried deleting most old bookmarks and all snapshots, but still no luck with the second dataset. It works for the first time, for the second snapshot one dataset is bad, for the third another one goes bad, etc.

morph027 commented 1 year ago

Interesting. Had the same issue using sanoid and syncoid. Went away after using a different solution.

tjmnmk commented 1 year ago

Hello,

we are experiencing same issue with 4.19.277 vanilla kernel as well as with 5.10.165 vanilla kernel from kernel.org.

System information

Type Version/Name
Distribution Name Ubuntu
Distribution Version 22.04 (jammy)
Kernel Version 4.19.277 / 5.10.165
Architecture amd64
OpenZFS Version 2.1.9 (https://github.com/openzfs/zfs/releases/download/zfs-2.1.9/zfs-2.1.9.tar.gz without any patches)
Znapzend Version 0.21.0
icp_gcm_impl fastest (on 5.10.165) / generic (on 4.19.277)
icp_aes_impl fastest (on 5.10.165) / generic (on 4.19.277)
encryption aes-256-gcm
compression lz4
root@xxx:~# zpool status -v
  pool: pool1
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 00:42:48 with 0 errors on Tue Mar 14 00:22:17 2023
config:

        NAME        STATE     READ WRITE CKSUM
        pool1       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sda3    ONLINE       0     0     0
            sdb3    ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        pool1/encrypted/lxd/containers/proxy2@2023-03-17-200000:<0x0>
scratchings commented 1 year ago

Interesting. Had the same issue using sanoid and syncoid. Went away after using a different solution.

What are you using instead?

J0riz commented 11 months ago

System information

Type Version/Name
Distribution Name CloudLinux
Distribution Version 8.8 (Anatoly Filipchenko)
Kernel Version 4.18.0-477.27.2
Architecture amd64
OpenZFS Version 2.1.13 (with patch to prevent quota bug underflow https://github.com/openzfs/zfs/issues/3789#issuecomment-165605384 and patch to include #include <asm/fpu/xcr.h> https://github.com/openzfs/zfs/issues/12754 )
encryption aes-256-gcm
compression on

Describe the problem you're observing

We see the similar behaviour when using encrypted datasets and creating snaphots on the same pool. We see this behaviour occurs with ZFS 2.1.13 and using aes-256-gcm encryption.

Although creating this problem takes days/weeks and probably seems to have a higher chance if more data is written to the filesystem. We also automatically take a snapshot every hour which might raise the chance of this behaviour. We get errors that unspecified files have permanent errors in snapshots. Rebooting the server and running two scrubs afterwards resolves the error.

Is there anything we can do to help resolve this issue?

zpool status -v
  pool: userdata
 state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 00:10:39 with 0 errors on Sat Nov 11 17:15:40 2023
config:
    NAME                                STATE     READ WRITE CKSUM
    userdata                            ONLINE       0     0     0
      mirror-0                          ONLINE       0     0     0
        nvme-KCD61LUL7T68_SERIAL0  ONLINE       0     0     0  block size: 512B configured, 4096B native
        nvme-KCD61LUL7T68_SERIAL1  ONLINE       0     0     0  block size: 512B configured, 4096B native
    spares
      nvme-KCD61LUL7T68_SERIAL2    AVAIL  
errors: Permanent errors have been detected in the following files:
        userdata/var-lve@hostbackup_1hour-2023-11-17-2000:<0x0>
        userdata/var-lve@hostbackup_1hour-2023-11-18-0300:<0x0>
timwhite1 commented 10 months ago

@morph027 did you switch to Znapzend or something else? I'd prefer using native encryption but with the current status I wonder if it is best to encrypt via underlying LUKS volumes to avoid replication failures / data corruption.

tjikkun commented 10 months ago

We have some hope that the root cause for this issue is the same as #15526 . We are not sure, because we can not really reproduce it, other then just having it running for a few weeks. When we read about the corruption issue in 2.2 being caused by a long present underlying issue, we had some hope that solving that, would also solve the issue here. We are currently running 2.1.14 for 18 days now, without triggering the corruption. This doesn't mean anything yet, but we're still hopeful. Has anyone run into this issue yet on 2.1.14 or 2.2.2?

phreaker0 commented 10 months ago

@tjikkun I'm running 2.2.2 and unfortunately still have the same issue .

rincebrain commented 10 months ago

Are you still seeing any errors with scrub tripping assertions, or is this "just" the snapshots reporting errors?

If the snapshots are always from things that were sent+received, could you possibly look at which snapshots this happened with, and then we can try to see what differs on both sides?

Blackclaws commented 10 months ago

For us its always the sending side that's (falsely) reporting the corruption, there is no way to compare anything as the snapshots aren't even sent out. Rebooting the system allows sending the snapshots fine again, so I guess its some sort of corruption that happens in memory only and isn't an actual corruption of the underlying data.

Scrubbing without rebooting does not solve the issue either. Scrubs don't destroy or corrupt anything further though, its just that the only thing fixing this is a reboot.

IvanVolosyuk commented 10 months ago

We have some hope that the root cause for this issue is the same as #15526 . We are not sure, because we can not really reproduce it, other then just having it running for a few weeks. When we read about the corruption issue in 2.2 being caused by a long present underlying issue, we had some hope that solving that, would also solve the issue here. We are currently running 2.1.14 for 18 days now, without triggering the corruption. This doesn't mean anything yet, but we're still hopeful. Has anyone run into this issue yet on 2.1.14 or 2.2.2?

The mentioned issue never causes scrub errors as corruption happens during read or omitted read specifically.

ofthesun9 commented 10 months ago

@rincebrain I still have the issue with zfs 2.2.2

The "corrupted" snaphost will be reported/triggered during a "zfs send"

the log will display something like:

Dec 22 18:08:17 styx syncoid[2047838]: Sending incremental Pool1/ENCR/DATA/pv@autosnap_2023-12-22_14:00:30_hourly ... autosnap_2023-12-22_17:00:32_frequently (~ 4.1 MB):
Dec 22 18:08:19 styx syncoid[2047880]: warning: cannot send 'Pool1/ENCR/DATA/pv@autosnap_2023-12-22_16:15:14_frequently': Input/output error
Dec 22 18:08:19 styx zed[2047929]: eid=28826 class=data pool='Pool1' priority=2 err=5 flags=0x180 bookmark=83312:0:0:2078

A reboot will make the snapshot accessible again. Doing twice a scrub will clear the error reported by zpool status.

I noticed that syncoid is doing a zfs send -nvP -I to get the size of the transfer, and then a zfs send -I to actually perform the transfer, that eventually would fail.

I have added a sleep(5); between the two zfs send in syncoid script to see if something different would happen. (too soon to be conclusive for the time being)

J0riz commented 10 months ago

As my colleague @tjikkun hoped unfortunately https://github.com/openzfs/zfs/issues/15526 does not resolve the issue. The issue still exist with ZFS 2.1.14

The 'corrupt' snapshot indeed never gets send after the corrupt snapshots starting to exist. Doing a reboot and two scrubs fixes the issue and makes the snapshot okay again. Also the snapshots gets correctly send to the remote server afterwards.

Don't know if the sending is related to the cause of this issue. Although ZFS starts to notice the corruption during a ZFS send attempt.

 zpool status -v
  pool: userdata
 state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 00:25:21 with 0 errors on Mon Dec 11 17:30:22 2023
config:

    NAME                        STATE     READ WRITE CKSUM
    userdata                    ONLINE       0     0     0
      mirror-0                  ONLINE       0     0     0
        scsi-SERIAL1  ONLINE       0     0     0
        scsi-SERIAL2  ONLINE       0     0     0
      mirror-1                  ONLINE       0     0     0
        scsi-SERIAL3  ONLINE       0     0     0
        scsi-SERIAL4  ONLINE       0     0     0
    spares
      scsi-SERIAL5    AVAIL   
      scsi-SERIAL6    AVAIL   

errors: Permanent errors have been detected in the following files:

        userdata/usr-local@hostbackup_1hour-2024-01-01-2200:<0x0>

Errors during ZFS send attempts of the corrupt snapshot:

Jan 01 23:43:37 storagehost zed[1175206]: eid=95933 class=authentication pool='userdata' bookmark=91038:0:0:0
Jan 02 00:44:11 storagehost zed[1453743]: eid=96236 class=authentication pool='userdata' bookmark=91038:0:0:0
Jan 02 01:44:12 storagehost zed[1720919]: eid=96446 class=authentication pool='userdata' bookmark=91038:0:0:0
Jan 02 02:43:40 storagehost zed[1973009]: eid=96574 class=authentication pool='userdata' bookmark=91038:0:0:0
Jan 02 03:43:40 storagehost zed[2227976]: eid=96704 class=authentication pool='userdata' bookmark=91038:0:0:0
Jan 02 04:43:51 storagehost zed[2479810]: eid=96897 class=authentication pool='userdata' bookmark=91038:0:0:0
etc...
ofthesun9 commented 10 months ago

I noticed that syncoid is doing a zfs send -nvP -I to get the size of the transfer, and then a zfs send -I to actually perform the transfer, that eventually would fail.

I have added a sleep(5); between the two zfs send in syncoid script to see if something different would happen. (too soon to be conclusive for the time being)

Unfortunately, the above attempt failed, I had the bug happening his morning.

I had a look at /proc/spl/kstat/zfs/dbgmsg, and among lot of lines, I found the following, starting to be reported at the time of the "zfs send" operation:

1704431303   zio_crypt.c:476:zio_do_crypt_uio(): error 52
1704431303   zio.c:571:zio_decrypt(): error 5
1704431303   zfeature.c:239:feature_get_refcount(): error 95
1704431303   dmu.c:471:dmu_spill_hold_existing(): error 2
1704431303   sa.c:368:sa_attr_op(): error 2
1704431303   zfs_dir.c:1204:zfs_get_xattrdir(): error 2
1704431303   dsl_dir.c:1347:dsl_dir_tempreserve_impl(): error 28
1704431303   zio_crypt.c:476:zio_do_crypt_uio(): error 52
1704431303   zio.c:571:zio_decrypt(): error 5
1704431303   zfeature.c:239:feature_get_refcount(): error 95
1704431303   zio_crypt.c:476:zio_do_crypt_uio(): error 52
1704431303   zio.c:571:zio_decrypt(): error 5
1704431303   zfeature.c:239:feature_get_refcount(): error 95
1704431303   zio_crypt.c:476:zio_do_crypt_uio(): error 52
1704431303   zio.c:571:zio_decrypt(): error 5
1704431303   zfeature.c:239:feature_get_refcount(): error 95
1704431303   zio_crypt.c:476:zio_do_crypt_uio(): error 52
1704431303   zio.c:571:zio_decrypt(): error 5
1704431303   zfeature.c:239:feature_get_refcount(): error 95
1704431303   dmu.c:471:dmu_spill_hold_existing(): error 2
1704431303   sa.c:368:sa_attr_op(): error 2
1704431303   zfs_dir.c:1204:zfs_get_xattrdir(): error 2
1704431303   dsl_dir.c:1347:dsl_dir_tempreserve_impl(): error 28
1704431303   dsl_dir.c:1347:dsl_dir_tempreserve_impl(): error 28
1704431303   dsl_dir.c:1347:dsl_dir_tempreserve_impl(): error 28
1704431303   dmu.c:471:dmu_spill_hold_existing(): error 2
1704431303   sa.c:368:sa_attr_op(): error 2
1704431303   zfs_dir.c:1204:zfs_get_xattrdir(): error 2
1704431303   dmu.c:471:dmu_spill_hold_existing(): error 2
1704431303   sa.c:368:sa_attr_op(): error 2
1704431303   zfs_dir.c:1204:zfs_get_xattrdir(): error 2
1704431303   zio_crypt.c:476:zio_do_crypt_uio(): error 52
1704431303   zio.c:571:zio_decrypt(): error 5
J0riz commented 9 months ago

On the same staging server the issue came back again after 9 days. We are unable to directly reproduce the behaviour but if we leave the system running with a ZFS version higher than 0.8.6 it occurs with encrypted snapshots after a few days or weeks.

zpool status -v
  pool: userdata
 state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 00:25:22 with 0 errors on Thu Jan 11 17:30:23 2024
config:

    NAME                        STATE     READ WRITE CKSUM
    userdata                    ONLINE       0     0     0
      mirror-0                  ONLINE       0     0     0
        scsi-SERIAL1  ONLINE       0     0     0
        scsi-SERIAL2  ONLINE       0     0     0
      mirror-1                  ONLINE       0     0     0
        scsi-SERIAL3  ONLINE       0     0     0
        scsi-SERIAL4  ONLINE       0     0     0
    spares
      scsi-SERIAL5    AVAIL   
      scsi-SERIAL6    AVAIL   

errors: Permanent errors have been detected in the following files:

        userdata/etc-virtual@hostbackup_1hour-2024-01-12-0700:<0x0>
        userdata/var-container@hostbackup_4hour-2024-01-12-0701:<0x0>

We are unable to read all the data from the local corrupted snaphot as ZFS is unable to mount it:

[root@storagehost hostbackup_4hour-2024-01-12-0701]# pwd
/var/container/.zfs/snapshot/hostbackup_4hour-2024-01-12-0701
[root@storagehost hostbackup_4hour-2024-01-12-0701]# find .
.

The following logs in /proc/spl/kstat/zfs/dbgmsg once when trying to read data from the corrupted snapshot:

1705048095   zfs_ctldir.c:1140:zfsctl_snapshot_mount(): Unable to automount /var/container/.zfs/snapshot/hostbackup_4hour-2024-01-12-0701 error=256

After a while we are again able to read the data in the snapshot content. But nothing particular seems to be logged in /proc/spl/kstat/zfs/dbgmsg what solves it.

[root@storagehost hostbackup_4hour-2024-01-12-0701]# cd /var/container/.zfs/snapshot/hostbackup_4hour-2024-01-12-0701
[root@storagehost hostbackup_4hour-2024-01-12-0701]# ls
folder1 folder2 file1 file2 file3 etc...

The behaviour that a snapshot gets noticed as corrupted as far as we can pinpoint happens during or somewhere arround a ZFS send to external backup server.

Jan 12 08:43:56 storagehost sudo[1214307]: backup : TTY=unknown ; PWD=/script/sendsnapshot ; USER=root ; COMMAND=/sbin/zfs hold remote_sync_anchor userdata/var-container@hostbackup_4hour-2024-01-12-0701
Jan 12 08:43:56 storagehost sudo[1214307]: pam_unix(sudo:session): session opened for user root by (uid=0)
Jan 12 08:43:56 storagehost sudo[1214307]: pam_unix(sudo:session): session closed for user root
Jan 12 08:43:56 storagehost sudo[1214556]: backup : TTY=unknown ; PWD=/script/sendsnapshot ; USER=root ; COMMAND=/sbin/zfs send -I userdata/var-container@hostbackup_1hour-2024-01-12-0600 userdata/var-container@hostbackup_4hour-2024-01-12-0701
Jan 12 08:43:56 storagehost sudo[1214556]: pam_unix(sudo:session): session opened for user root by (uid=0)

Rebooting the server and performing two scrubs again fixes the errors.

I added additional journal and /proc/spl/kstat/zfs/dbgmsg logs in the attachment: journal+dbgmsg-logs.txt

gdevenyi commented 9 months ago

Please rename this issue to include the term "encryption"

dcarosone commented 9 months ago

A small note FWIW, I have a system (laptop) that used to produce this issue reasonably regularly (noted somewhere way above). The issue persisted through changes of OS (ubuntu -> nixos), obviously changes of zfs version over time, and a change of the internal SSD (but not the pool, the swap was a mirror and detach).

Anyway, none of this is conclusive, and of course I'm tempting fate by posting this, but I think the change that made the difference was of replication target. I use znapzend to take and send snapshots, to two destinations.

In the original setup, one of the destinations was a remote server, the other was a second pool on USB disks (with LUKS) that I would attach from time to time. I reconfigured that a while ago to use two different remote servers, and around the same time switched to using raw sends. I'm pretty sure I haven't seen the problem since.

vuongtuha commented 1 month ago

Man, I wish all of you can make confirmation that problem only occurs with encrypted pool (native or LUKS). Little bit nervous for my system

rincebrain commented 1 month ago

LUKS or unencrypted don't seem to have this problem.

I also would suspect 2.2.5+ won't have this problem, if it's #11679, with #16104 in 2.2.5.

scratchings commented 1 month ago

I've been running 2.2.6 on a machine that was exhibiting the snapshot corruption issue for just over a week now, and haven't had a reoccurrence yet, but too early to be certain.

HankB commented 1 month ago

Message ID: @.***>

I think I've probably mentioned this before. I experience "permanent errors" when I do full pool (recursive) backups of my laptop which is running Debian bookworm with root on ZFS (using rpool/bpool split.) Over several years all permanent errors have "rolled off" as older snapshots are cleared up. I leave full pool backups disabled and try them once in a while to see if the problem persists. Several months ago it did.

I perform backups of my home filesystem (excluding things like ~/Downloads) and have not seen any corruption.

I have a desktop that performs full pool backups to a locally connected HDD and have never seen corruption on that.

At present my laptop is running 2.2.5 (with pending update to 2.2.6 on next reboot) on Debian Stable and I use syncoid for backing up.

@.:~$ zfs --version zfs-2.2.6-1~bpo12+1 zfs-kmod-2.2.5-1~bpo12+1 @.:~$

I am not nervous but I also have multiple backups including off-site.

-- Beautiful Sunny Winfield

scratchings commented 1 month ago

I can confirm that 2.2.6 does not fix this snapshot corruption for me (or indeed the kernel panics and hanging zfs processes on the receiving end).

amano-kenji commented 1 month ago

@scratchings I want to make sure your errors weren't caused by RAM bitflips.

If you disassemble your computer, remove dust, and assemble it again, does the issue disappear? A lot of dust tends to cause errors in random computer parts. Sometimes, dust causes errors in RAM or GPU.

J0riz commented 1 month ago

Also can confirm that with ZFS 2.2.6 the problem unfortunately still exists. On a staging system an encrypted snapshot showed up as 'corrupt' after the snapshot was created. Although rebooting the server makes the data in the snapshot available again.

We are still unable to reproduce it. Although writing more data to the filesystem maybe seems to help in the issue to arise. Now we needed to wait about 16 days for the issue to show up.

 zpool status -v -t
  pool: userdata
 state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 00:14:47 with 0 errors on Mon Sep 23 13:42:53 2024
config:

    NAME                                STATE     READ WRITE CKSUM
    userdata                            ONLINE       0     0     0
      mirror-0                          ONLINE       0     0     0
        nvme-SERIAL1  ONLINE       0     0     0  block size: 512B configured, 4096B native  (100% trimmed, completed at Mon Sep 23 13:18:42 2024)
        nvme-SERIAL2  ONLINE       0     0     0  block size: 512B configured, 4096B native  (100% trimmed, completed at Mon Sep 23 13:18:40 2024)
    spares
      nvme-SERIAL3  AVAIL     (untrimmed)

errors: Permanent errors have been detected in the following files:

        userdata/root-fs@hostbackup_1hour-2024-10-01-1800:<0x0>

Errors during ZFS send attempts of the corrupt snapshot:

Oct 01 20:46:20 server zed[1254305]: eid=23332 class=authentication pool='userdata' bookmark=21992:0:0:0
Oct 01 21:32:37 server zed[1465726]: eid=23455 class=authentication pool='userdata' bookmark=21992:0:0:1
Oct 01 21:46:15 server zed[1511860]: eid=23485 class=data pool='userdata' priority=2 err=5 flags=0x180 bookmark=21992:0:0:7
Oct 01 21:46:15 server zed[1511869]: eid=23486 class=data pool='userdata' priority=2 err=5 flags=0x180 bookmark=21992:0:0:642
Oct 01 21:46:15 server zed[1511876]: eid=23487 class=data pool='userdata' priority=2 err=5 flags=0x180 bookmark=21992:0:0:653
Oct 01 21:46:15 server zed[1511881]: eid=23488 class=data pool='userdata' priority=2 err=5 flags=0x180 bookmark=21992:0:0:65
Oct 01 21:46:15 server zed[1511884]: eid=23489 class=data pool='userdata' priority=2 err=5 flags=0x180 bookmark=21992:0:0:718
Oct 01 21:46:15 server zed[1511887]: eid=23490 class=data pool='userdata' priority=2 err=5 flags=0x180 bookmark=21992:0:0:673
Oct 01 21:46:15 server zed[1511892]: eid=23491 class=data pool='userdata' priority=2 err=5 flags=0x180 bookmark=21992:0:0:623
Oct 01 21:46:15 server zed[1511896]: eid=23492 class=data pool='userdata' priority=2 err=5 flags=0x180 bookmark=21992:0:0:19
Oct 01 21:46:15 server zed[1511902]: eid=23494 class=data pool='userdata' priority=2 err=5 flags=0x180 bookmark=21992:0:0:64
Oct 01 21:46:15 server zed[1511904]: eid=23493 class=data pool='userdata' priority=2 err=5 flags=0x180 bookmark=21992:0:0:719
Oct 01 21:46:15 server zed[1511905]: eid=23495 class=data pool='userdata' priority=2 err=5 flags=0x180 bookmark=21992:0:0:8
Oct 01 21:46:15 server zed[1511908]: eid=23496 class=data pool='userdata' priority=2 err=5 flags=0x180 bookmark=21992:0:0:643
Oct 01 21:46:15 server zed[1511910]: eid=23497 class=data pool='userdata' priority=2 err=5 flags=0x180 bookmark=21992:0:0:11

You are unable to access the 'corrupt' data inside the snapshot after it happens. All the other snapshots before and after are fine.

[root@server snapshot]# ls -la host backup_1hour-2024-10-01-1800
ls: cannot access 'hostbackup_1hour-2024-10-01-1800/.': Object is remote
ls: cannot access 'hostbackup_1hour-2024-10-01-1800/..': Object is remote
total 0
d????????? ? ? ? ?            ? .
d????????? ? ? ? ?            ? ..
[root@server snapshot]# ls -la hostbackup_4hour-2024-10-01-2201
total 74
dr-xr-xr-x  14 root root       24 Oct  1 00:44 .
drwxrwxrwx  40 root root        2 Oct  2 14:00 ..
-rw-r--r--   1 root root        0 Jul 14  2022 1
-rw-r--r--   1 root linksafe    0 Apr 11  2023 2
etc ...

After a reboot however I can access the data in the snapshot again. So the data isn't actually corrupt but fine.

[root@server snapshot]# ls -la host backup_1hour-2024-10-01-1800
total 74
dr-xr-xr-x  14 root root       24 Oct  1 00:44 .
drwxrwxrwx  40 root root        2 Oct  2 14:00 ..
-rw-r--r--   1 root root        0 Jul 14  2022 1
-rw-r--r--   1 root linksafe    0 Apr 11  2023 2
etc ...

Performing two scrubs after the reboot also makes the error go away and makes errors: No known data errors show again.

So it is a really annoying bug as ZFS sees data as corrupt even as it actually isn't. Because of this issue we can't use ZFS 2.x for now with encrypted data. We still use ZFS 0.8.6 with encrypted data. For now I would just recommend others to only use ZFS 2.x without encryption.

Just to reiterate: ZFS 0.8.6 doesn't have this issue.

amano-kenji commented 1 month ago

@J0riz If you remove dust from the computer case, does the issue go away? Dust causes electrostatic shock which flips bits in RAM. Random RAM errors can cause software errors.

After thoroughly cleaning my computer case, it seems I stopped seeing various errors.

aerusso commented 1 month ago

@amano-kenji This may have been the root cause of an issue you were having, but it is unlikely that RAM errors would cause this problem in 2.x, but not cause them in 0.8.y. Also, it sounds like you may have been experiencing multiple issues (and I'd expect that people reporting here aren't experiencing a cluster of weird/buggy behavior).

I do recommend everyone seeing these problems run memtest to rule out RAM issues, though. That's far more reliable than just cleaning out dust (though it won't help allergies! :-) .

J0riz commented 1 month ago

Although we are getting off track:

All our (staging) systems are running in a TIer 3+ datacenter. All air is filtered and I never saw dust in our datacenter. We use ECC memory and use rasdaemon to monitor reliability of all our RAM. We perform memtests on our staging servers regularly. If there was any hardware issue we would have known.

Just to reiterate: ZFS 0.8.6 doesn't have this issue with encrypted snapshots showing up with 'permanent errors'. ZFS 0.8.6 has been running reliable on dozen of systems in our datacenter. Also downgrading to ZFS 0.8.6 makes this behaviour go away. I'm not aware of some magical dedust feature in ZFS 0.8.6. That would be nice for my allergies. 🙃

Prompt: Please go back on topic.

cyberpower678 commented 1 month ago

I just want to follow up that I USED to have these issues, mainly brought on from bring an older pool to a newer ZFS where some mess was created because something in 2 didn't like something from an earlier version. I usually fixed my issue by either purging the bad snapshot that triggered the corruption using the various ZFS kernel flags that suppress errors, and performing a scrub, or simply copying the data out if I can't purge the bad snapshot, and then purging and recreating the impacted dataset. It has not resulted in and data corruption/loss for me, and the ZFS pool has been humming along without issue since. So my suspicion is that users still impacted are coming in from older versions of ZFS, and rather than devs trying to figure out every nuance for everyone, which is probably almost impossible, it's probably just easier to recreate the faulty dataset.

For clarity, I'm using encrypted datasets

HankB commented 1 month ago

Just to reiterate: ZFS 0.8.6 doesn't have this issue with encrypted snapshots showing up with 'permanent errors'.

I've been running ZFS with native encryption on a laptop I purchased about 5 years ago. I'm not positive about the version of ZFS I started with but it might have been 0.7.x. At some point I began seeing these "permanent errors" in snapshots and suspected that the NVME SSD was beginning to fail. I purchased a Samsung 980 PRO to replace it, thinking that would likely be the most reliable device I could purchase at the time. The issues continued with the replacement. I found that if I disabled the full pool backups I had been running, the errors eventually went away as the snapshots were deleted. I have not seen this issue with the "normal" backups that just include my user files.

I've re-enabled the full pool backup occasionally to see if the problem persists. It has and when found, I disable that backup and the "permanent errors" eventually go away.

On 2024-09-23 I reinstated the full pool backups to see if "permanent errors" returned. There is now a single permanent error in the pool. The last time I did this, the pool reported over 100 errors before I disabled this backup.

I'm cautiously optimistic that the fixes reported in 2.2.5, 2.2.6 have reduced but not fully eliminated the error. Creation information for the pool is

2023-12-04.22:37:29 zpool create -o ashift=12 -o autotrim=on -O encryption=on -O keylocation=prompt -O keyformat=passphrase -O acltype=posixacl -O xattr=sa -O dnodesize=auto -O compression=lz4 -O normalization=formD -O relatime=on -O canmount=off -O mountpoint=/ -R /mnt rpool /dev/disk/by-id/nvme-HP_SSD_EX950_1TB_HBSE49202700837-part4
phreaker0 commented 1 month ago

I'm also running 2.2.6 and still have the same issues with my encrypted pools on several servers.

It typically happens after the server is running for a couple of days and then errors will shown on the backup replication of my ssd root pool to my hdd storage pool.

This is the way I do my hourly replication currently:

#!/bin/bash

/usr/local/bin/clear-zfs-snapshot-errors.sh
/usr/sbin/syncoid --no-resume -r --skip-parent --no-clone-handling --force-delete --exclude="rpool/var/lib/docker" --sendoptions="Lce" rpool storage/backup/rpool
if [ $? -ne 0 ]; then
  /usr/local/bin/clear-zfs-snapshot-errors.sh
  /usr/sbin/syncoid --no-resume -r --skip-parent --no-sync-snap --no-clone-handling --force-delete --exclude="rpool/var/lib/docker" --sendoptions="Lce" rpool storage/backup/rpool
  code=$?
  if [ $code -ne 0 ]; then
    /usr/local/bin/clear-zfs-snapshot-errors.sh
  fi

  exit $code
fi

The script for destroying the metadata errors in the affected snapshots (if those aren't cleared the zfs replication won't work because of I/O errors):

#!/bin/bash

zpool status -v | grep ':<0x' | grep rpool | sed 's#^\s*##g' | grep '@' | sed 's#:<0x.>$##'  | xargs -n1 --no-run-if-empty zfs destroy
exit 0

If the replication fails I will get the permanents errors. An erroneous replications looks like this:

Oct 02 16:25:38 craig ssd-replication.sh[674751]: INFO: Sending incremental rpool/ROOT@autosnap_2024-09-29_19:00:24_hourly ... syncoid_craig_2024-10-02:16:25:37-GMT02:00 to storage/backup/rpool/ROOT (~ 165 KB):
Oct 02 16:27:32 craig ssd-replication.sh[674751]: INFO: Sending incremental rpool/ROOT/ubuntu@autosnap_2024-09-29_19:00:13_hourly ... syncoid_craig_2024-10-02:16:27:32-GMT02:00 to storage/backup/rpool/ROOT/ubuntu (~ 867.0 MB):
Oct 02 16:27:37 craig ssd-replication.sh[679518]: warning: cannot send 'rpool/ROOT/ubuntu@autosnap_2024-09-30_04:00:06_hourly': Input/output error
Oct 02 16:27:41 craig ssd-replication.sh[679518]: warning: cannot send 'rpool/ROOT/ubuntu@autosnap_2024-09-30_10:00:08_hourly': Input/output error
Oct 02 16:27:42 craig ssd-replication.sh[679516]: cannot receive incremental stream: most recent snapshot of storage/backup/rpool/ROOT/ubuntu does not
Oct 02 16:27:42 craig ssd-replication.sh[679516]: match incremental source
Oct 02 16:27:42 craig ssd-replication.sh[679519]: mbuffer: error: outputThread: error writing to <stdout> at offset 0x27c0000: Broken pipe
Oct 02 16:27:42 craig ssd-replication.sh[679519]: mbuffer: warning: error during output to <stdout>: Broken pipe
Oct 02 16:27:42 craig ssd-replication.sh[679518]: warning: cannot send 'rpool/ROOT/ubuntu@autosnap_2024-09-30_12:00:09_hourly': signal received
Oct 02 16:27:42 craig ssd-replication.sh[679518]: warning: cannot send 'rpool/ROOT/ubuntu@autosnap_2024-09-30_12:30:08_frequently': Broken pipe
Oct 02 16:27:42 craig ssd-replication.sh[679518]: warning: cannot send 'rpool/ROOT/ubuntu@autosnap_2024-09-30_12:45:01_frequently': Broken pipe
Oct 02 16:27:42 craig ssd-replication.sh[679518]: warning: cannot send 'rpool/ROOT/ubuntu@autosnap_2024-09-30_13:00:12_hourly': Broken pipe
Oct 02 16:27:42 craig ssd-replication.sh[679518]: warning: cannot send 'rpool/ROOT/ubuntu@autosnap_2024-09-30_13:00:12_frequently': Broken pipe
Oct 02 16:27:42 craig ssd-replication.sh[679518]: warning: cannot send 'rpool/ROOT/ubuntu@autosnap_2024-09-30_13:15:04_frequently': Broken pipe
Oct 02 16:27:42 craig ssd-replication.sh[679518]: warning: cannot send 'rpool/ROOT/ubuntu@autosnap_2024-09-30_13:30:07_frequently': Broken pipe
Oct 02 16:27:42 craig ssd-replication.sh[679518]: warning: cannot send 'rpool/ROOT/ubuntu@autosnap_2024-09-30_13:45:02_frequently': Input/output error
Oct 02 16:27:42 craig ssd-replication.sh[679518]: warning: cannot send 'rpool/ROOT/ubuntu@autosnap_2024-09-30_14:00:13_hourly': Broken pipe
...
Oct 02 16:27:43 craig ssd-replication.sh[679518]: warning: cannot send 'rpool/ROOT/ubuntu@autosnap_2024-10-02_14:00:10_hourly': Broken pipe
Oct 02 16:27:43 craig ssd-replication.sh[679518]: warning: cannot send 'rpool/ROOT/ubuntu@autosnap_2024-10-02_14:00:10_frequently': Broken pipe
Oct 02 16:27:43 craig ssd-replication.sh[679518]: warning: cannot send 'rpool/ROOT/ubuntu@autosnap_2024-10-02_14:15:03_frequently': Broken pipe
Oct 02 16:27:43 craig ssd-replication.sh[679518]: warning: cannot send 'rpool/ROOT/ubuntu@syncoid_craig_2024-10-02:16:27:32-GMT02:00': Broken pipe
Oct 02 16:27:43 craig ssd-replication.sh[679518]: cannot send 'rpool/ROOT/ubuntu': I/O error
Oct 02 16:27:43 craig ssd-replication.sh[674751]: CRITICAL ERROR:  zfs send -L -c -e  -I 'rpool/ROOT/ubuntu'@'autosnap_2024-09-29_19:00:13_hourly' 'rpool/ROOT/ubuntu'@'syncoid_craig_2024-10-02:16:27:32-GMT02:00' | mbuffer  -q -s 128k -m 16M | pv -p -t -e -r -b -s 909076296 |  zfs receive  -F 'storage/backup/rpool/ROOT/ubuntu' 2>&1 failed: 256

Running two scrubs will clear the existing errors (if the affected snapshots were destroyed before). But once errors starts showing the will only increase from here and only a reboot will make sure that errors won't show for some time.

So far I don't experienced any "real" data errors beside losing the affected snapshots.

So my suspicion is that users still impacted are coming in from older versions of ZFS, and rather than devs trying to figure out every nuance for everyone, which is probably almost impossible, it's probably just easier to recreate the faulty dataset.

I think I already tried that, but I probably will do it again to check if this may work.

rincebrain commented 1 month ago

My suspicion would be that something is unsafely using some piece of metadata in the encrypted dataset being sent, at the same time as something else goes to use it, and you're getting a spurious decrypt/decompress/checksum error from that inconsistent state, and then it goes away on subsequent recheck.

But that's just a guess, it's not like I have one locally that reproduces it. It'd be useful to look at the zpool events messages to see exactly what object produced the error so we can inspect and try to reproduce it, probably.

Maltz42 commented 1 month ago

I've been running ZFS with native encryption on a laptop I purchased about 5 years ago. I'm not positive about the version of ZFS I started with but it might have been 0.7.x. At some point I began seeing these "permanent errors" in snapshots and suspected that the NVME SSD was beginning to fail. I purchased a Samsung 980 PRO to replace it, thinking that would likely be the most reliable device I could purchase at the time. The issues continued with the replacement. I found that if I disabled the full pool backups I had been running, the errors eventually went away as the snapshots were deleted. I have not seen this issue with the "normal" backups that just include my user files.

It's actually pretty well established when these issues first appeared, which makes their longevity even more puzzling/concerning. Native encryption was rolled out with v0.8.0 (May 2019 - 0.8.x is probably what you started on) and the corruption issues first appeared in 2.0.0 (Late 2020). They've gotten better over time, but it's still been years that these issues have persisted in some form or other.

rincebrain commented 1 month ago

It's nobody's responsibility to fix it.

There's not like, assigned areas of the project where it's so-and-so's job to fix this if it breaks, and the original company that contributed the project A) doesn't seem to hit these and B) appears to have stopped contributing some time ago.

The rest of the companies that use OpenZFS, to my knowledge, mostly don't use native encryption, so that leaves random volunteers or any exceptions, to fix these.

That's not intended as an indictment of anyone involved, it's just a statement of "nobody's responsible for making sure it gets done, so it doesn't get done if nobody is responsible for it and not enough people are in the set of {able, motivated by encountering it} to have it happen organically"