Open brenc opened 4 months ago
I think I triggered this same bug but from cancelling a scrub instead, also on FreeBSD 14.1 with the exact same OpenZFS version.
I cancelled and restarted a scrub, then it seemed to get stuck at a certain point and the symptoms seem to match. The "issued" number and the completion percentage made no progress over about 12 hours before I gave up. The "scanned" number was going up but only by about 5-10MB/s for those same hours, and I was seeing 1-2k scrub read IOPS on the pool for the scrub but only 10-15MB/s of reads.
Restarting the scrub just caused it to get stuck at the same point again, and pausing/unpausing seemed to make a little progress but it just got stuck again slightly later. After restarting the machine the scrubbing is back to normal.
This is still happening for us. This month I paused the scrub and rebooted and even though the remote backup process immediately reconnected and started pulling data from the snapshot dirs, the scrub completed. I'm having to reboot once per month.
I seem to be having the same bug, but I've never figured out what triggers it. I have two similar systems with ZFS, and on one of them, the monthly scrub has gotten stuck every single month now for maybe 6-7 months in a row. On the other, it has only happened once -- today.
Both run Debian bookworm with ZFS from backports.
Currently (on both systems):
Kernel: 6.1.0-26-amd64 (Debian 6.1.112-1)
ZFS: zfs-2.2.6-1~bpo12+2
Both are kept updated, so the this has happened on older kernels and ZFS versions as well.
Both run ZFS native encryption (on the pool root). One is a 5-disk RAIDZ2 and the other (that hanged today) is a 2-disk mirror. The second system has a 1-disk zpool that is still scrubbing properly.
It's possible that it's related to the snapshot dir for me as well, as I use it now and then... but I'm doubtful that I use it every single month, as I don't have any automated tools that use it. There hasn't been a monthly scrub on system 1 (RAIDZ2) that has finished by itself since possibly early this year.
Pausing the scrub and rebooting always solves the issue, and as far as I can remember, I've never needed more than one reboot for a scrub to finish. It never seems to get stuck again, until next month.
The zpool status output sometimes progresses a miniscule amount, like 0.01% per 15+ minutes, but typically stands still. The (average) speed displayed always keeps going down towards 0 MB/s, and the remaining time always keeps going up, as no real progress is being made. The disks are still working quite hard, but I've no idea what they are actually doing.
This is what it looks like when it's working (iowait <2%, r/s >500 for most disks):
serenity@hyperion:~ $ iostat -mxy 30
Linux 6.1.0-23-amd64 (hyperion) 08/11/2024 _x86_64_ (8 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
0.49 0.00 7.35 0.96 0.00 91.19
Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
sda 922.60 65.89 1.27 0.14 0.86 73.13 9.43 0.08 0.00 0.00 3.24 8.51 0.00 0.00 0.00 0.00 0.00 0.00 0.33 88.10 0.85 34.25
sdb 904.43 65.88 1.30 0.14 0.89 74.59 9.27 0.08 0.00 0.00 3.48 8.83 0.00 0.00 0.00 0.00 0.00 0.00 0.33 93.00 0.87 35.00
sdc 68.33 65.98 0.00 0.00 41.71 988.76 9.57 0.08 0.00 0.00 1.77 8.56 0.00 0.00 0.00 0.00 0.00 0.00 0.33 46.20 2.88 96.53
sdd 882.97 65.90 1.50 0.17 1.01 76.43 9.43 0.08 0.00 0.00 3.08 8.71 0.00 0.00 0.00 0.00 0.00 0.00 0.33 83.30 0.94 37.19
sde 584.67 66.07 0.37 0.06 2.48 115.71 9.57 0.08 0.03 0.35 3.38 8.70 0.00 0.00 0.00 0.00 0.00 0.00 0.33 90.90 1.51 55.33
avg-cpu: %user %nice %system %iowait %steal %idle
0.35 0.00 6.42 1.52 0.00 91.71
Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
sda 745.07 63.06 1.03 0.14 1.46 86.67 10.17 0.09 0.00 0.00 3.79 9.13 0.00 0.00 0.00 0.00 0.00 0.00 0.40 86.92 1.16 43.68
sdb 754.57 63.09 0.80 0.11 1.43 85.62 10.37 0.09 0.00 0.00 2.86 9.02 0.00 0.00 0.00 0.00 0.00 0.00 0.40 66.67 1.13 42.37
sdc 84.13 63.31 0.00 0.00 33.47 770.56 10.63 0.09 0.00 0.00 1.84 8.59 0.00 0.00 0.00 0.00 0.00 0.00 0.40 43.75 2.85 95.44
sdd 664.97 63.07 1.00 0.15 2.01 97.12 10.27 0.09 0.00 0.00 3.29 8.90 0.00 0.00 0.00 0.00 0.00 0.00 0.40 76.08 1.40 50.20
sde 440.00 62.99 0.27 0.06 3.97 146.60 10.83 0.09 0.00 0.00 4.74 8.63 0.00 0.00 0.00 0.00 0.00 0.00 0.40 77.75 1.83 64.47
Below, it had stopped (iowait >10%, r/s <200 for all disks)
avg-cpu: %user %nice %system %iowait %steal %idle
0.56 0.00 3.21 11.95 0.00 84.28
Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
sda 161.87 5.57 0.00 0.00 17.10 35.23 4.57 0.05 0.00 0.00 4.91 10.39 0.00 0.00 0.00 0.00 0.00 0.00 0.20 89.83 2.81 93.91
sdb 159.80 5.70 0.00 0.00 17.75 36.53 4.67 0.05 0.00 0.00 4.03 10.06 0.00 0.00 0.00 0.00 0.00 0.00 0.20 77.83 2.87 95.64
sdc 150.60 5.50 0.00 0.00 15.60 37.40 5.13 0.05 0.03 0.65 1.86 9.35 0.00 0.00 0.00 0.00 0.00 0.00 0.20 41.00 2.37 78.81
sdd 162.23 5.72 0.00 0.00 17.85 36.10 5.00 0.05 0.00 0.00 4.00 9.52 0.00 0.00 0.00 0.00 0.00 0.00 0.20 92.67 2.93 97.68
sde 151.87 5.57 0.00 0.00 14.42 37.58 4.80 0.05 0.00 0.00 3.72 10.08 0.00 0.00 0.00 0.00 0.00 0.00 0.20 83.83 2.23 73.96
avg-cpu: %user %nice %system %iowait %steal %idle
0.42 0.00 2.26 11.23 0.00 86.10
Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
sda 166.73 5.85 0.00 0.00 17.29 35.92 3.50 0.03 0.00 0.00 3.99 9.07 0.00 0.00 0.00 0.00 0.00 0.00 0.13 84.00 2.91 96.89
sdb 165.30 6.00 0.00 0.00 17.31 37.17 3.53 0.03 0.07 1.85 3.27 8.83 0.00 0.00 0.00 0.00 0.00 0.00 0.13 79.75 2.88 96.15
sdc 163.43 5.78 0.00 0.00 15.55 36.24 3.70 0.03 0.00 0.00 1.54 8.29 0.00 0.00 0.00 0.00 0.00 0.00 0.13 33.50 2.55 84.71
sdd 166.33 5.96 0.00 0.00 17.54 36.69 3.80 0.03 0.00 0.00 3.67 8.39 0.00 0.00 0.00 0.00 0.00 0.00 0.13 92.25 2.94 98.04
sde 163.93 5.72 0.00 0.00 14.48 35.75 3.63 0.03 0.00 0.00 3.55 8.55 0.00 0.00 0.00 0.00 0.00 0.00 0.13 79.50 2.40 79.45
After a reboot, iostat returns to the pre-hang numbers where most disks have 500+ r/s and the scrub proceeds normally, finishing in maybe 12 hours. Without a reboot it won't ever finish.
This is what the second system looks like when the scrub is stuck. The mirror that is stuck is sdb+sdc; sda is a 1-disk zpool that doesn't need redundancy, and the scrub keeps working properly on that one. sdd is the boot SSD.
avg-cpu: %user %nice %system %iowait %steal %idle
0.00 0.00 5.48 15.50 0.00 79.02
Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
sda 173.00 171.57 0.00 0.00 17.26 1015.51 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.99 98.60
sdb 185.00 12.57 0.00 0.00 14.65 69.56 33.50 0.46 0.00 0.00 2.12 14.03 0.00 0.00 0.00 0.00 0.00 0.00 1.00 65.50 2.85 93.00
sdc 199.00 12.15 0.00 0.00 13.13 62.51 34.00 0.46 0.00 0.00 3.06 13.82 0.00 0.00 0.00 0.00 0.00 0.00 1.00 98.00 2.82 97.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
The scrubs haven't ever found any issues while this has happened, and AFAIK the pools keep working as usual during the stuck scrubs.
Properties for the RAIDZ2 pool that has been getting stuck every month:
$ zpool get all laniakea
NAME PROPERTY VALUE SOURCE
laniakea size 18.2T -
laniakea capacity 85% -
laniakea altroot - default
laniakea health ONLINE -
laniakea guid 7937552882459901146 -
laniakea version - default
laniakea bootfs - default
laniakea delegation on default
laniakea autoreplace off default
laniakea cachefile - default
laniakea failmode wait default
laniakea listsnapshots off default
laniakea autoexpand off default
laniakea dedupratio 1.00x -
laniakea free 2.64T -
laniakea allocated 15.6T -
laniakea readonly off -
laniakea ashift 12 local
laniakea comment - default
laniakea expandsize - -
laniakea freeing 0 -
laniakea fragmentation 10% -
laniakea leaked 0 -
laniakea multihost off default
laniakea checkpoint - -
laniakea load_guid 5213226827043222604 -
laniakea autotrim off default
laniakea compatibility off default
laniakea bcloneused 0 -
laniakea bclonesaved 0 -
laniakea bcloneratio 1.00x -
laniakea feature@async_destroy enabled local
laniakea feature@empty_bpobj active local
laniakea feature@lz4_compress active local
laniakea feature@multi_vdev_crash_dump enabled local
laniakea feature@spacemap_histogram active local
laniakea feature@enabled_txg active local
laniakea feature@hole_birth active local
laniakea feature@extensible_dataset active local
laniakea feature@embedded_data active local
laniakea feature@bookmarks enabled local
laniakea feature@filesystem_limits enabled local
laniakea feature@large_blocks enabled local
laniakea feature@large_dnode active local
laniakea feature@sha512 enabled local
laniakea feature@skein enabled local
laniakea feature@edonr enabled local
laniakea feature@userobj_accounting active local
laniakea feature@encryption active local
laniakea feature@project_quota active local
laniakea feature@device_removal enabled local
laniakea feature@obsolete_counts enabled local
laniakea feature@zpool_checkpoint enabled local
laniakea feature@spacemap_v2 active local
laniakea feature@allocation_classes enabled local
laniakea feature@resilver_defer enabled local
laniakea feature@bookmark_v2 enabled local
laniakea feature@redaction_bookmarks enabled local
laniakea feature@redacted_datasets enabled local
laniakea feature@bookmark_written enabled local
laniakea feature@log_spacemap active local
laniakea feature@livelist enabled local
laniakea feature@device_rebuild enabled local
laniakea feature@zstd_compress active local
laniakea feature@draid enabled local
laniakea feature@zilsaxattr active local
laniakea feature@head_errlog active local
laniakea feature@blake3 disabled local
laniakea feature@block_cloning disabled local
laniakea feature@vdev_zaps_v2 disabled local
$ zfs get all laniakea
NAME PROPERTY VALUE SOURCE
laniakea type filesystem -
laniakea creation Fri Aug 18 2:14 2023 -
laniakea used 9.20T -
laniakea available 1.44T -
laniakea referenced 746K -
laniakea compressratio 1.03x -
laniakea mounted yes -
laniakea quota none default
laniakea reservation none default
laniakea recordsize 128K default
laniakea mountpoint /laniakea default
laniakea sharenfs off default
laniakea checksum on default
laniakea compression zstd local
laniakea atime off local
laniakea devices off local
laniakea exec off local
laniakea setuid off local
laniakea readonly off default
laniakea zoned off default
laniakea snapdir hidden local
laniakea aclmode discard default
laniakea aclinherit restricted default
laniakea createtxg 1 -
laniakea canmount on default
laniakea xattr sa local
laniakea copies 1 default
laniakea version 5 -
laniakea utf8only off -
laniakea normalization none -
laniakea casesensitivity sensitive -
laniakea vscan off default
laniakea nbmand off default
laniakea sharesmb off default
laniakea refquota none default
laniakea refreservation none default
laniakea guid 16392863579896121217 -
laniakea primarycache all default
laniakea secondarycache all default
laniakea usedbysnapshots 0B -
laniakea usedbydataset 746K -
laniakea usedbychildren 9.20T -
laniakea usedbyrefreservation 0B -
laniakea logbias latency default
laniakea objsetid 54 -
laniakea dedup off default
laniakea mlslabel none default
laniakea sync standard default
laniakea dnodesize auto local
laniakea refcompressratio 1.23x -
laniakea written 0 -
laniakea logicalused 9.49T -
laniakea logicalreferenced 476K -
laniakea volmode default default
laniakea filesystem_limit none default
laniakea snapshot_limit none default
laniakea filesystem_count none default
laniakea snapshot_count none default
laniakea snapdev hidden default
laniakea acltype off default
laniakea context none default
laniakea fscontext none default
laniakea defcontext none default
laniakea rootcontext none default
laniakea relatime on default
laniakea redundant_metadata all default
laniakea overlay on default
laniakea encryption aes-256-gcm -
laniakea keylocation prompt local
laniakea keyformat passphrase -
laniakea pbkdf2iters 2000000 -
laniakea encryptionroot laniakea -
laniakea keystatus available -
laniakea special_small_blocks 0 default
laniakea snapshots_changed Sun Oct 13 9:02:30 2024 -
laniakea prefetch all default
Update: The scrub on the second machine finished after just running scrub -p followed by scrub to resume, so perhaps a reboot is not necessary at all.
For the time being I added a cron job that pauses and resumes any active scrubs every now and then as an ugly workaround... I guess we'll see how well it works in a month or two.
System information
Describe the problem you're observing
We have a remote backup process that uses rsync to back up files from a server using snapshots in the
.zfs/snapshot
directories. I noticed that after I started using this process subsequent scrubs (which run once per month on the 1st) never complete. I have been troubleshooting this for months and can confidently say that accessing files from these snapshot dirs is what triggers this bug. If I stop the remote backup process, reboot the server, and then run a scrub it completes normally in < 12 hours. As soon as I start the remote backup backup process, subsequent scrubs that run never complete even if left running for 5+ days.I have tried pausing and resuming the scrub as well as stopping the scrub completely and restarting it. Nothing seems to correct this issue except a reboot. It always gets to around 70-80ish% done and hangs unless I reboot.
Some additional observations: scrub speed appears to be much reduced. This is what it looks like while stuck:
While a normal scrub will run at ~3 G/s. There is definitely still high read IO on the drives:
The scrub appears to be running even when no progress is made. % done never advances.
ZFS replication (this server pulls from the primary) essentially grinds to a halt as well.
Pool config:
I don't know if it matters but I'm using encryption. Here are the encryption root settings:
Describe how to reproduce the problem
rsync files from
.zfs/snapshot
directories then run a scrub.Include any warning/errors/backtraces from the system logs
There are no relevant logs as far as I can tell.
I believe that's all the information I have available on this right now. Please let me know if you need anything else.