openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.45k stars 1.73k forks source link

L2ARC failed to rebuild after reboot #11787

Open aaronjwood opened 3 years ago

aaronjwood commented 3 years ago

System information

Type Version/Name
Distribution Name Debian (Proxmox)
Distribution Version 10
Linux Kernel 5.4.103-1-pve
Architecture x64
ZFS Version 2.0.3-pve2
SPL Version 2.0.3-pve2

Describe the problem you're observing

Had a system up and running for months. L2ARC was a few hundred GB. Rebooted the system and found that the L2ARC was empty.

Describe how to reproduce the problem

Not sure about this. I've rebooted in the past and had my L2ARC preserved. I haven't been able to find any logs around this either. If anyone has suggestions on what I can check I'd be happy to post it here. FWIW /sys/module/zfs/parameters/l2arc_rebuild_enabled is set to 1.

One difference from the last reboot I did where it worked: I had a kernel update this time. Is there some special handling that forces the L2ARC to not rebuild when a new kernel version is used or something?

gamanakis commented 3 years ago

I just made a VM with proxmox-6.4 running 5.4.119 and the whole-disk L2ARC device rebuilds successfully there too (zfs-2.0.4-pve1). I plan to test with a real NVMe device on my server by the end of the month.

joakimlemb commented 3 years ago

I did some testing today and have not been able to reproduce the issue. (I made no changes to the ZFS setup). Testet multiple manual reboots and also testet turning the system completely off, waiting 10min and then powering on again. Every time the L2ARC was rebuilt,

Only thing I can think of is last time I rebooted it was due to APT kernel update, but I'm not seeing how that should interfere with ZFS... excerpt from /var/log/apt/term.log:

Setting up pve-kernel-5.4.119-1-pve (5.4.119-1) ...
Removing pve-kernel-5.4.106-1-pve (5.4.106-1) ...
aaronjwood commented 2 years ago

Things have been working pretty well for me since using the L2ARC on a specific partition instead of the whole device. Though sometimes when I reboot my L2ARC size is wildly different. Look what happened this time (the gap is my reboot): Capture

Before the reboot my uncompressed L2ARC size was somewhere between 130-140 GB. How is it possible that the size grew a ton after rebooting? Note that the VM I'm using is running off of a different pool (dedicated NVME) which has no L2ARC so the VM itself would is not contributing to the size at all.

gamanakis commented 2 years ago

It is possible after reboot for the L2ARC size to increase since it restores all buffers present on the device, including ones that have been invalidated. This is by design.

On Sun, Oct 10, 2021, 2:45 AM Aaron Wood @.***> wrote:

Things have been working pretty well for me since using the L2ARC on a specific partition instead of the whole device. Though sometimes when I reboot my L2ARC size is wildly different. Look what happened this time (the gap is my reboot): [image: Capture] https://user-images.githubusercontent.com/446283/136677293-e8cf7f31-4c0c-49d0-8cbb-5dfc9cecdc30.PNG

Before the reboot my uncompressed L2ARC size was somewhere between 130-140 GB. How is it possible that the size grew a ton after rebooting? Note that the VM I'm using is running off of a different pool (dedicated NVME) which has no L2ARC so the VM itself would is not contributing to the size at all.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/openzfs/zfs/issues/11787#issuecomment-939383548, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2Y2IOMYVGFLCHT5FDE4ILUGDO2BANCNFSM4ZWCG4YA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

aaronjwood commented 2 years ago

Great, thanks for confirming that!

aaronjwood commented 2 years ago

I just had the issue again where the L2ARC didn't rebuild after a reboot.

version:        2.1.1-pve1
srcversion:     26CA510B89F9E1A05415FF6
vermagic:       5.13.19-1-pve SMP mod_unload modversions

This time my L2ARC setup is a bit different:

        cache
          sdo1                                      ONLINE       0     0     0
          sdp1                                      ONLINE       0     0     0
          sdn1                                      ONLINE       0     0     0

Not sure what to check here, the behavior seems sporadic...

gamanakis commented 2 years ago

Could you post a cat /proc/spl/kstat/zfs/arcstats?

aaronjwood commented 2 years ago

Sure

13 1 0x01 123 33456 3819723010 4877793644536
name                            type data
hits                            4    16758513
misses                          4    4839220
demand_data_hits                4    4983938
demand_data_misses              4    3700930
demand_metadata_hits            4    11762719
demand_metadata_misses          4    74750
prefetch_data_hits              4    7121
prefetch_data_misses            4    1054043
prefetch_metadata_hits          4    4735
prefetch_metadata_misses        4    9497
mru_hits                        4    6172365
mru_ghost_hits                  4    231877
mfu_hits                        4    10576654
mfu_ghost_hits                  4    132528
deleted                         4    3240807
mutex_miss                      4    884
access_skip                     4    1
evict_skip                      4    3093
evict_not_enough                4    1
evict_l2_cached                 4    31861516288
evict_l2_eligible               4    17000919552
evict_l2_eligible_mfu           4    5533679616
evict_l2_eligible_mru           4    11467239936
evict_l2_ineligible             4    47326208
evict_l2_skip                   4    0
hash_elements                   4    6754248
hash_elements_max               4    7488933
hash_collisions                 4    5733669
hash_chains                     4    1620259
hash_chain_max                  4    9
p                               4    29556899712
c                               4    34359738368
c_min                           4    2104755968
c_max                           4    34359738368
size                            4    34138237744
compressed_size                 4    31164290048
uncompressed_size               4    34767844352
overhead_size                   4    1050404864
hdr_size                        4    1607998064
data_size                       4    31300067840
metadata_size                   4    914627072
dbuf_size                       4    56927232
dnode_size                      4    68893408
bonus_size                      4    21324480
anon_size                       4    6993920
anon_evictable_data             4    0
anon_evictable_metadata         4    0
mru_size                        4    26566782976
mru_evictable_data              4    25178240512
mru_evictable_metadata          4    210531328
mru_ghost_size                  4    4929249792
mru_ghost_evictable_data        4    2124873728
mru_ghost_evictable_metadata    4    2804376064
mfu_size                        4    5640918016
mfu_evictable_data              4    4969264640
mfu_evictable_metadata          4    296652288
mfu_ghost_size                  4    9956258816
mfu_ghost_evictable_data        4    9364298240
mfu_ghost_evictable_metadata    4    591960576
l2_hits                         4    42920
l2_misses                       4    2693922
l2_prefetch_asize               4    3776512
l2_mru_asize                    4    32493990912
l2_mfu_asize                    4    7601672192
l2_bufc_data_asize              4    39115636736
l2_bufc_metadata_asize          4    983802880
l2_feeds                        4    4714
l2_rw_clash                     4    0
l2_read_bytes                   4    405105664
l2_write_bytes                  4    57483476480
l2_writes_sent                  4    1884
l2_writes_done                  4    1884
l2_writes_error                 4    0
l2_writes_lock_retry            4    196
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_evict_l1cached               4    0
l2_free_on_write                4    292
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    43661074432
l2_asize                        4    40099439616
l2_hdr_size                     4    167972640
l2_log_blk_writes               4    6893
l2_log_blk_avg_asize            4    17398
l2_log_blk_asize                4    99170304
l2_log_blk_count                4    6893
l2_data_to_meta_ratio           4    584
l2_rebuild_success              4    0
l2_rebuild_unsupported          4    0
l2_rebuild_io_errors            4    0
l2_rebuild_dh_errors            4    9
l2_rebuild_cksum_lb_errors      4    0
l2_rebuild_lowmem               4    0
l2_rebuild_size                 4    0
l2_rebuild_asize                4    0
l2_rebuild_bufs                 4    0
l2_rebuild_bufs_precached       4    0
l2_rebuild_log_blks             4    0
memory_throttle_count           4    0
memory_direct_count             4    0
memory_indirect_count           4    0
memory_all_bytes                4    67352190976
memory_free_bytes               4    21224951808
memory_available_bytes          3    18871056640
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    0
arc_meta_used                   4    2837742896
arc_meta_limit                  4    25769803776
arc_dnode_limit                 4    2576980377
arc_meta_max                    4    3266657552
arc_meta_min                    4    16777216
async_upgrade_sync              4    26438
demand_hit_predictive_prefetch  4    889837
demand_hit_prescient_prefetch   4    0
arc_need_free                   4    0
arc_sys_free                    4    2353895168
arc_raw_size                    4    0
cached_only_in_progress         4    0
abd_chunk_waste_size            4    427008
gamanakis commented 2 years ago

At this point I am sure L2ARC fails to rebuild because it cannot physically read the device header from the device (ie its not like its reading a header that isn't valid but it actually fails to read from the device). It eludes me though why.

h1z1 commented 2 years ago

Could it be something silly like the device itself isn't available yet on reboot? Are the reboots proper shutdowns or result of something else? (power failure was mentioned).

aaronjwood commented 2 years ago

I can answer the second part :) Other than the one time where I mentioned it was a power failure, everything else has been from me doing sudo reboot

gamanakis commented 2 years ago

I was mistaken, actually the IO error returned is 52, EBADE. In OpenZFS in Linux, EBADE is used to declare checksum errors, and that zio_wait() was modified to actually perform a checksum check embedded in the header. So, it's not a failure to physically read from the device, but more likely a checksum error (ZIO_CHECKSUM_LABEL). The question is why.

gamanakis commented 2 years ago

My assumption is that for some reason the header may be written incompletely when the reboot is happening, leading to corruption in corner cases. I also noticed that I had used a ZIO_PRIORITY_ASYNC_WRITE as a priority in the write command.

@aaronjwood do you have the capacity to try the following patch:

diff --git a/module/zfs/arc.c b/module/zfs/arc.c
index 79e2d4381..dbe029ce7 100644
--- a/module/zfs/arc.c
+++ b/module/zfs/arc.c
@@ -10744,7 +10744,7 @@ l2arc_dev_hdr_update(l2arc_dev_t *dev)

        err = zio_wait(zio_write_phys(NULL, dev->l2ad_vdev,
            VDEV_LABEL_START_SIZE, l2dhdr_asize, abd, ZIO_CHECKSUM_LABEL, NULL,
-           NULL, ZIO_PRIORITY_ASYNC_WRITE, ZIO_FLAG_CANFAIL, B_FALSE));
+           NULL, ZIO_PRIORITY_SYNC_WRITE, ZIO_FLAG_CANFAIL, B_FALSE));

        abd_free(abd);
aaronjwood commented 2 years ago

Sorry, unfortunately I am not able to deploy a custom build of ZFS on this environment :(

alexdrl commented 2 years ago

I was mistaken, actually the IO error returned is 52, EBADE. In OpenZFS in Linux, EBADE is used to declare checksum errors, and that zio_wait() was modified to actually perform a checksum check embedded in the header. So, it's not a failure to physically read from the device, but more likely a checksum error (ZIO_CHECKSUM_LABEL). The question is why.

I am having the exact same error IO 52 on TrueNAS Scale RC.2 after a reboot.

devZer0 commented 1 year ago

on my system with zfs 2.1.6 after reboot, l2arc shows "0" in alloc/capacity column . before reboot, 1,5gb was allocated there. secondarycache=metadata


root@pbs01:~# cat /proc/spl/kstat/zfs/arcstats |grep l2
evict_l2_cached                 4    0
evict_l2_eligible               4    349184
evict_l2_eligible_mfu           4    98304
evict_l2_eligible_mru           4    250880
evict_l2_ineligible             4    6656
evict_l2_skip                   4    0
l2_hits                         4    0
l2_misses                       4    770
l2_prefetch_asize               4    0
l2_mru_asize                    4    0
l2_mfu_asize                    4    0
l2_bufc_data_asize              4    0
l2_bufc_metadata_asize          4    0
l2_feeds                        4    227
l2_rw_clash                     4    0
l2_read_bytes                   4    0
l2_write_bytes                  4    0
l2_writes_sent                  4    0
l2_writes_done                  4    0
l2_writes_error                 4    0
l2_writes_lock_retry            4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_evict_l1cached               4    0
l2_free_on_write                4    0
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    0
l2_asize                        4    0
l2_hdr_size                     4    0
l2_log_blk_writes               4    0
l2_log_blk_avg_asize            4    0
l2_log_blk_asize                4    0
l2_log_blk_count                4    0
l2_data_to_meta_ratio           4    0
l2_rebuild_success              4    0
l2_rebuild_unsupported          4    0
l2_rebuild_io_errors            4    0
l2_rebuild_dh_errors            4    2
l2_rebuild_cksum_lb_errors      4    0
l2_rebuild_lowmem               4    0
l2_rebuild_size                 4    0
l2_rebuild_asize                4    0
l2_rebuild_bufs                 4    0
l2_rebuild_bufs_precached       4    0
l2_rebuild_log_blks             4    0
shodanshok commented 1 year ago

I was just hit but this issue: after a reboot, L2ARC was not rebuilt. Before the reboot it was at 870/380 GB (size/compressed), while now it is at 23.5/11.5 GB.

# arc_summary.py
L2ARC size (adaptive):                                          23.5 GiB
        Compressed:                                    49.0 %   11.5 GiB
# cat arcstats | grep l2_reb
l2_rebuild_success              4    0
l2_rebuild_unsupported          4    6
l2_rebuild_io_errors            4    0
l2_rebuild_dh_errors            4    0
l2_rebuild_cksum_lb_errors      4    0
l2_rebuild_lowmem               4    0
l2_rebuild_size                 4    0
l2_rebuild_asize                4    0
l2_rebuild_bufs                 4    0
l2_rebuild_bufs_precached       4    0
l2_rebuild_log_blks             4    0

What does l2_rebuild_unsupported 4 6 means?

EDIT: from here it seems that no correct L2 headers were found. This is the first time I experience the issue, which happened at the first reboot after I manually trimmed the pool and then enabled autotrim.

My zfs module options are: options zfs l2arc_noprefetch=0 l2arc_headroom=0 l2arc_write_max=268435456 l2arc_trim_ahead=100

Reading the previous posts seems to confirm we have some sort of "interference" between trim (manual and/or automatic) and persistent L2ARC. Running zdb -l on my cache disks:

# zdb -l /dev/disk/by-path/pci-0000\:01\:00.1-ata-3.0-part5 
------------------------------------
LABEL 0
------------------------------------
    version: 5000
    state: 4
    guid: 11135912476541031070
    labels = 0 1 2 3 
------------------------------------
L2ARC device header
------------------------------------
    magic: 6504978260106102853
    version: 1
    pool_guid: 11455676591258953768
    flags: 1
    start_lbps[0]: 7158409728
    start_lbps[1]: 7103791616
    log_blk_ent: 1022
    start: 4194816
    end: 451118891008
    evict: 4194816
    lb_asize_refcount: 2567168
    lb_count_refcount: 117
    trim_action_time: 1673546173
    trim_state: 4

------------------------------------
L2ARC device log blocks
------------------------------------
log_blk_count:   117 with valid cksum
                 0 with invalid cksum
log_blk_asize:   2567168

# zdb -l /dev/disk/by-path/pci-0000\:01\:00.1-ata-4.0-part5 
------------------------------------
LABEL 0
------------------------------------
    version: 5000
    state: 4
    guid: 16280120629556133767
    labels = 0 1 2 3 
------------------------------------
L2ARC device header
------------------------------------
    magic: 6504978260106102853
    version: 1
    pool_guid: 11455676591258953768
    flags: 1
    start_lbps[0]: 7077579776
    start_lbps[1]: 7022830080
    log_blk_ent: 1022
    start: 4194816
    end: 451118891008
    evict: 4194816
    lb_asize_refcount: 2496000
    lb_count_refcount: 114
    trim_action_time: 1673546173
    trim_state: 4

------------------------------------
L2ARC device log blocks
------------------------------------
log_blk_count:   114 with valid cksum
                 0 with invalid cksum
log_blk_asize:   2496000

ZFS debug log show the following info related to my cache disks:

# cat dbgmsg | grep ata-[34]
1673546157   spa_history.c:307:spa_history_log_sync(): txg 42177271 trim vdev=/dev/disk/by-path/pci-0000:01:00.1-ata-3.0-part5 activated
1673546157   spa_history.c:307:spa_history_log_sync(): txg 42177271 trim vdev=/dev/disk/by-path/pci-0000:01:00.1-ata-4.0-part5 activated
1673546173   spa_history.c:307:spa_history_log_sync(): txg 42177278 trim vdev=/dev/disk/by-path/pci-0000:01:00.1-ata-4.0-part5 complete
1673546173   spa_history.c:307:spa_history_log_sync(): txg 42177278 trim vdev=/dev/disk/by-path/pci-0000:01:00.1-ata-3.0-part5 complete

I'm on Rocky Linux 9.1, kernel 5.14.0-162.6.1.el9_1.0.1.x86_64, ZFS 2.1.7-1

Regards.

rokezu commented 1 year ago

Was having the same issue in a newly configured pool to which I recently added L2ARC SSD (a couple of days later). The L2ARC was listed in the zpool iostat output, but 0 bytes alloc across reboots.

Using Arch Linux, so, I issued a zpool.cache file rebuild again now after adding the L2ARC with:

zpool set cachefile=/etc/zfs/zpool.cache POOLNAME

Then regenerated the init image with mkinitcpio (which adds that cache file to the image? I guess?), and restarted and now there are 26GB allocated in the L2ARC. Arch uses mkinitcpio to load the initrd image as in other systems, or so it goes... so, it will load the zfs module through that and probably in turn will use that zpool.cache to load the ZFS pool. I haven't checked the internals of how it works.

It seems to me that upon rebooting, at least in Arch they peek at the cache and load from there, and I didn't had added the L2ARC device to that cache until later. But now it is persistent across reboots (or so it seems after a couple of reboots).

Just a couple of cents for somebody stumbling upon this.

clhedrick commented 1 year ago

I had the same problem. Reboot after crash. L2ARC empty. this was the first reboot after adding it.

arcstat shows l2_rebuild_dh_errors 4 6

I just rebuilt the cachefile, based on a comment here. However zdb -C doesn't show the cache devices, and there is no evidence of them in the cache file. Is this normal?

zdb -l /dev/sdbx1 shows labels. zdb -l /dev/sdbx does not The cache appears to be working normally.

The log file shows

Mar 13 15:29:38 eternal.lcsr.rutgers.edu systemd[1]: Starting Import ZFS pools by cache file... Mar 13 15:29:39 eternal.lcsr.rutgers.edu zpool[5184]: cannot import 'big': one or more devices is currently unavailable Mar 13 15:30:24 eternal.lcsr.rutgers.edu zpool[5184]: cachefile import failed, retrying Mar 13 15:30:24 eternal.lcsr.rutgers.edu systemd[1]: Finished Import ZFS pools by cache file.

I wonder if there's a timing issue. If so, it might be a good idea to add the cache devices to the cache file, so that import will wait until they are ready.

richardm1 commented 1 year ago

I've just hit this bug after removing my cache device (a partition on an NVMe drive) and re-adding it.

arc.c:10440:l2arc_dev_hdr_read(): L2ARC IO error (52) while reading device header, vdev guid:

zpool status shows a cache device:

NAME          STATE     READ WRITE CKSUM
 rustpool      ONLINE       0     0     0
   mirror-0    ONLINE       0     0     0
     sda       ONLINE       0     0     0
     sdb       ONLINE       0     0     0
   mirror-1    ONLINE       0     0     0
     sdc       ONLINE       0     0     0
     sdh       ONLINE       0     0     0
 logs
   nvme1n1p2   ONLINE       0     0     0
 cache
   nvme1n1p3   ONLINE       0     0     0

But arc_summary does not:


------------------------------------------------------------------------
ZFS Subsystem Report                            Thu May 18 21:36:37 2023
Linux 6.2.14-1-clear                                            2.1.11-1
Machine: rm-desktop (x86_64)                                    2.1.11-1

L2ARC not detected, skipping section

I've tried export/re-importing the pool, I've rebooted several times, and I've secure-wiped the partition and re-created it as a bare partition with no filesystem. I've turned off l2arc_rebuild_enabled. I'm running out of ideas for this. I guess I could try throwing the whole drive at it?

shodanshok commented 1 year ago

If I am not mistaken, I sometime saw arc_summary reporting "L2ARC not detected" when L2ARC is totally empty. Have you tried forcing some data to L2ARC (ie: by reading some large file)? What does zpool iostat -v show?

richardm1 commented 1 year ago

If I am not mistaken, I sometime saw arc_summary reporting "L2ARC not detected" when L2ARC is totally empty. Have you tried forcing some data to L2ARC (ie: by reading some large file)? What does zpool iostat -v show?

You're a genius. My arcstat alias with custom columns was erroring-out:

No L2ARC Here
l2hit%
No L2ARC Here
l2hits
No L2ARC Here
l2miss
No L2ARC Here
l2data
No L2ARC Here
l2mfu
No L2ARC Here
l2mru
No L2ARC Here
l2size
No L2ARC Here
l2asize
No L2ARC Here
l2bytes

Then I fired up a VM and it started working a few seconds later:

hit%  hits  miss  phit  ph%  mh%   mfu   mru  mfug  mrug     c  size  free  avail  arcsz  l2hit%  l2hits  l2miss  l2data  l2mfu  l2mru  l2size  l2asize  l2bytes
   0     0     0     0    0    0     0     0     0     0  3.9G  429M  118G   114G   429M       0       0       0       0   5.5K      0     17K     5.5K        0
  78   370   100     0    0   97   252   117     0     0  3.9G  383M  117G   113G   383M       0       0     100    1.6M   2.1M      0    3.5M     2.1M        0
  62   341   209     0    0   92   171   170     0     0  3.9G  400M  115G   111G   400M       0       0     209    6.2M   7.3M      0     10M     7.3M        0

zpool iostat -v did show a few writes to the cache device beforehand -- apparently it wasn't enough.

rokezu commented 1 year ago

Was having the same issue in a newly configured pool to which I recently added L2ARC SSD (a couple of days later). The L2ARC was listed in the zpool iostat output, but 0 bytes alloc across reboots.

Recently had to change the L2ARC SSD drive, and after 10 reboots or so, it refused to hold data (iostat alloc = 0).

I was using the l2arc_headroom=0zfs module parameter beforehand, so I changed it to the default and added an explicit rebuild with l2arc_headroom=2 l2arc_rebuild_enabled=1, and now it is working.

Also, before it started to work again, arc_summary.py showed something like these lines at zero (now it shows a few megabytes) which had me wondering that L2ARC while being filled, wasn't being properly used?:

L2ARC size (adaptive):                                         153.8 GiB
        Header size:                                  0.0 %    0.0 MiB
andrei-trybukhouski commented 1 year ago

I faced the same issue as richardm1

cat /proc/spl/kstat/zfs/dbgmsg 
L2ARC IO error (52) while reading device header
ws1 andrey # arc_summary -s l2arc

------------------------------------------------------------------------
ZFS Subsystem Report                            Fri Jun 09 12:45:07 2023
Linux 6.1.2zen1-v2                                2.1.5-1ubuntu6~22.04.1
Machine: ws1 (x86_64)                             2.1.5-1ubuntu6~22.04.1

L2ARC not detected, skipping section
s1 andrey # zpool status
  pool: pool0
 state: ONLINE
config:

    NAME         STATE     READ WRITE CKSUM
    pool0        ONLINE       0     0     0
      sda1       ONLINE       0     0     0
    cache
      nvme0n1p5  ONLINE       0     0     0

errors: No known data errors

I tried everything that was written here. the cache never survived a reboot. I didn’t see any error before, this one appeared after recreating the pool with ashift=12

shodanshok commented 1 year ago

@gamanakis Any ideas on what is failing here?

gamanakis commented 1 year ago

There seems to be a problem with the ashift of L2ARC not being constant when the pool has an ashift other than 9. issue #14313. This will lead to inability of reading the L2ARC header and rebuilding its contents in ARC.

aaronjwood commented 1 year ago

Excellent find! Glad there is some idea of the root of the problem. The seemingly random loss of the l2arc on reboot has been killer :(

gamanakis commented 1 year ago

Just opened #14963 for #14313. Anyone interested, feel free to give it a try.

rokezu commented 1 year ago

Thanks for looking into this, I can only share that indeed I use ashift=12 for my 4K drives, but didn't specify one for the L2ARC device (didn't catch that in the documentation?), that could be the reason indeed. Can't test it right now as my L2ARC is now working after several reboots and tries, not sure what triggered it to rebuild, but it's fine now!

aaronjwood commented 1 year ago

Thanks for looking into this, I can only share that indeed I use ashift=12 for my 4K drives, but didn't specify one for the L2ARC device (didn't catch that in the documentation?), that could be the reason indeed. Can't test it right now as my L2ARC is now working after several reboots and tries, not sure what triggered it to rebuild, but it's fine now!

Wanted to share the same, I use an ashift of 12 in the pool that my cache device lives in.

FWIW since experiencing the issue long ago my cache device has changed from an HP EX950 to a set of 3 SATA SSDs (different brands and sizes) to now currently an Intel 670p. No matter the device(s) I always hit the issue in a seemingly random manner.

andrei-trybukhouski commented 1 year ago

Thanks gamanakis, I applied https://github.com/openzfs/zfs/pull/14963 to master, but unfortunately it didn't solve the problem. what was done: rebuilt the code, rebooted, removed the old cache, wiped cache dev, added a new zpool add -o ashift=12 pool0 cache /dev/nvme0n1p5 (this is the same old pool with ashift=12), filled the cache, rebooted I don't see the l2arc error anymore, but after a reboot, the pool doesn't mount.


ws1 andrey # zpool import pool0
ws1 andrey # zpool list -v
NAME          SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
pool0          97G  31.9G  65.1G        -         -     0%    32%  1.00x    ONLINE  -
  sdb4       97.2G  31.9G  65.1G        -         -     0%  32.9%      -    ONLINE
cache            -      -      -        -         -      -      -      -         -
  nvme0n1p5     1G      0  1019M        -         -     0%  0.00%      -    ONLINE 
ws1 andrey # cat /proc/spl/kstat/zfs/dbgmsg
timestamp    message 
1686418516   spa.c:6371:spa_tryimport(): spa_tryimport: importing pool0
1686418516   spa_misc.c:417:spa_load_note(): spa_load($import, config trusted): LOADING
1686418516   vdev.c:161:vdev_dbgmsg(): disk vdev '/dev/sdb4': best uberblock found for spa $import. txg 537
1686418516   spa_misc.c:417:spa_load_note(): spa_load($import, config untrusted): using uberblock with txg=537
1686418516   spa.c:8514:spa_async_request(): spa=$import async request task=4096
1686418516   spa.c:8514:spa_async_request(): spa=$import async request task=2048
1686418516   spa_misc.c:417:spa_load_note(): spa_load($import, config trusted): LOADED
1686418516   spa_misc.c:417:spa_load_note(): spa_load($import, config trusted): UNLOADING
1686418516   spa.c:6223:spa_import(): spa_import: importing pool0
1686418516   spa_misc.c:417:spa_load_note(): spa_load(pool0, config trusted): LOADING
1686418516   vdev.c:161:vdev_dbgmsg(): disk vdev '/dev/sdb4': best uberblock found for spa pool0. txg 537
1686418516   spa_misc.c:417:spa_load_note(): spa_load(pool0, config untrusted): using uberblock with txg=537
1686418516   spa.c:8514:spa_async_request(): spa=pool0 async request task=4096
1686418517   spa_misc.c:417:spa_load_note(): spa_load(pool0, config trusted): read 105 log space maps (105 total blocks - blksz = 131072 bytes) in 475 ms
1686418517   mmp.c:239:mmp_thread_start(): MMP thread started pool 'pool0' gethrtime 56674227960
1686418517   spa.c:8514:spa_async_request(): spa=pool0 async request task=1
1686418517   spa.c:8514:spa_async_request(): spa=pool0 async request task=2048
1686418517   spa_misc.c:417:spa_load_note(): spa_load(pool0, config trusted): LOADED
1686418517   spa.c:8514:spa_async_request(): spa=pool0 async request task=4096
1686418517   metaslab.c:2449:metaslab_load_impl(): metaslab_load: txg 539, spa pool0, vdev_id 0, ms_id 1, smp_length 144, unflushed_allocs 159744, unflushed_frees 0, freed 0, defer 0 + 0, unloaded time 56678 ms, loading_time 10 ms, ms_max_size 535977984, max size error 535977984, old_weight 700000000000001, new_weight 700000000000001
1686418517   metaslab.c:2449:metaslab_load_impl(): metaslab_load: txg 539, spa pool0, vdev_id 0, ms_id 2, smp_length 136, unflushed_allocs 684032, unflushed_frees 0, freed 0, defer 0 + 0, unloaded time 56688 ms, loading_time 11 ms, ms_max_size 534876160, max size error 534876160, old_weight 700000000000001, new_weight 700000000000001
1686418517   metaslab.c:2449:metaslab_load_impl(): metaslab_load: txg 539, spa pool0, vdev_id 0, ms_id 3, smp_length 240, unflushed_allocs 139264, unflushed_frees 0, freed 0, defer 0 + 0, unloaded time 56699 ms, loading_time 9 ms, ms_max_size 536326144, max size error 536326144, old_weight 700000000000001, new_weight 700000000000001
1686418517   metaslab.c:2449:metaslab_load_impl(): metaslab_load: txg 539, spa pool0, vdev_id 0, ms_id 4, smp_length 224, unflushed_allocs 65536, unflushed_frees 0, freed 0, defer 0 + 0, unloaded time 56708 ms, loading_time 10 ms, ms_max_size 536334336, max size error 536334336, old_weight 700000000000001, new_weight 700000000000001
1686418517   metaslab.c:2449:metaslab_load_impl(): metaslab_load: txg 539, spa pool0, vdev_id 0, ms_id 5, smp_length 312, unflushed_allocs 45056, unflushed_frees 8192, freed 0, defer 0 + 0, unloaded time 56719 ms, loading_time 6 ms, ms_max_size 536551424, max size error 536547328, old_weight 700000000000001, new_weight 700000000000001
1686418517   metaslab.c:2449:metaslab_load_impl(): metaslab_load: txg 539, spa pool0, vdev_id 0, ms_id 6, smp_length 336, unflushed_allocs 65536, unflushed_frees 4096, freed 0, defer 0 + 0, unloaded time 56726 ms, loading_time 18 ms, ms_max_size 536588288, max size error 536584192, old_weight 700000000000001, new_weight 700000000000001
1686418517   metaslab.c:2449:metaslab_load_impl(): metaslab_load: txg 539, spa pool0, vdev_id 0, ms_id 7, smp_length 200, unflushed_allocs 77824, unflushed_frees 4096, freed 0, defer 0 + 0, unloaded time 56745 ms, loading_time 8 ms, ms_max_size 536702976, max size error 536698880, old_weight 700000000000001, new_weight 700000000000001
1686418517   metaslab.c:2449:metaslab_load_impl(): metaslab_load: txg 539, spa pool0, vdev_id 0, ms_id 8, smp_length 208, unflushed_allocs 237568, unflushed_frees 4096, freed 0, defer 0 + 0, unloaded time 56753 ms, loading_time 13 ms, ms_max_size 535990272, max size error 535986176, old_weight 700000000000001, new_weight 700000000000001
1686418517   metaslab.c:2449:metaslab_load_impl(): metaslab_load: txg 539, spa pool0, vdev_id 0, ms_id 9, smp_length 560, unflushed_allocs 69632, unflushed_frees 12288, freed 0, defer 0 + 0, unloaded time 56766 ms, loading_time 12 ms, ms_max_size 536260608, max size error 536256512, old_weight 700000000000001, new_weight 700000000000001
1686418517   spa_history.c:306:spa_history_log_sync(): txg 539 open pool version 5000; software version zfs-2.1.99-1994-gad9ba3a8a; uts ws1 6.1.2zen1-v2 #1 SMP PREEMPT_DYNAMIC Mon Jan  2 10:23:58 +03 2023 x86_64
... cut 
1686418523   spa_history.c:293:spa_history_log_sync(): command: zpool import pool0 
ws1 andrey # zdb -l nvme0n1p5
------------------------------------
LABEL 0 
------------------------------------
    version: 5000
    state: 4
    guid: 1555297785859709891
    labels = 0 1 2 3 
------------------------------------
L2ARC device header
------------------------------------
    magic: 6504978260106102853
    version: 1
    pool_guid: 400170765385537083
    flags: 1
    start_lbps[0]: 0
    start_lbps[1]: 0
    log_blk_ent: 0
    start: 4198400
    end: 1073217536
    evict: 4198400
    lb_asize_refcount: 0
    lb_count_refcount: 0
    trim_action_time: 0
    trim_state: 0

------------------------------------
L2ARC device log blocks
------------------------------------
No log blocks to read
ws1 andrey # sudo arc_summary -s l2arc

------------------------------------------------------------------------
ZFS Subsystem Report                            Sat Jun 10 20:51:21 2023
Linux 6.1.2zen1-v2                                2.1.99-1994_gad9ba3a8a
Machine: ws1 (x86_64)                             2.1.99-1994_gad9ba3a8a

L2ARC not detected, skipping section
ws1 andrey # cat /etc/modprobe.d/zfs.conf 
options zfs l2arc_rebuild_enabled=1
options zfs l2arc_headroom=2

I don't see any mention of nvme0n1p5 in the log. this is normal?

gamanakis commented 1 year ago

Your cache device is 1GB and for persistent L2ARC to work it needs to be greater than 1GB. Try again with a slightly bigger partition.

yhaenggi commented 1 year ago

or set this accordingly: https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html#l2arc-rebuild-blocks-min-l2size

andrei-trybukhouski commented 1 year ago

thanks a lot! this solved my issue. it seems I can get rid of btrfs and bcache already :)

DavidFair commented 4 months ago

I'm still seeing this issue and have to workaround this with #14963 in v2.2.3 when relying on an implicit ashift value.

(Full command for reference zpool add -o ashift=n media-pool cache /dev/disk/by-id/nvme-<UUID>-part4)

I'm happy to add and remove the cache devices whilst testing, but can't recreate the pool if someone has a potential commit they want me to test.

Pool Details: The pool and cache devices were created with v2.2.3, the pool was created first with ashift left as default (0). Cache devices we added / removed later

zpool get ashift syncthing-pool
NAME        PROPERTY  VALUE   SOURCE
syncthing-pool  ashift    0       default

Running zdb gives the true ashift value:

MOS Configuration:
        version: 5000
        name: 'syncthing-pool'
        state: 0
        txg: 94840
        pool_guid: 11492548645590784878
        errata: 0
        hostid: 2360010168
        hostname: 'pve'
        com.delphix:has_per_vdev_zaps
        vdev_children: 1
        vdev_tree:
            type: 'root'
            id: 0
            guid: 11492548645590784878
            create_txg: 4
            com.klarasystems:vdev_zap_root: 129
            children[0]:
                type: 'raidz'
                id: 0
                guid: 17832893951284810316
                nparity: 1
                metaslab_array: 256
                metaslab_shift: 34
                ashift: 12

ZFS options:

options zfs l2arc_rebuild_enabled=1 zfs_txg_timeout=30 l2arc_trim_ahead=100

(Not sure how to get the cache vdev ashift, I'm happy to edit and add this)