openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.45k stars 1.73k forks source link

L2ARC failed to rebuild after reboot #11787

Open aaronjwood opened 3 years ago

aaronjwood commented 3 years ago

System information

Type Version/Name
Distribution Name Debian (Proxmox)
Distribution Version 10
Linux Kernel 5.4.103-1-pve
Architecture x64
ZFS Version 2.0.3-pve2
SPL Version 2.0.3-pve2

Describe the problem you're observing

Had a system up and running for months. L2ARC was a few hundred GB. Rebooted the system and found that the L2ARC was empty.

Describe how to reproduce the problem

Not sure about this. I've rebooted in the past and had my L2ARC preserved. I haven't been able to find any logs around this either. If anyone has suggestions on what I can check I'd be happy to post it here. FWIW /sys/module/zfs/parameters/l2arc_rebuild_enabled is set to 1.

One difference from the last reboot I did where it worked: I had a kernel update this time. Is there some special handling that forces the L2ARC to not rebuild when a new kernel version is used or something?

gamanakis commented 3 years ago

Is there some special handling that forces the L2ARC to not rebuild when a new kernel version is used or something?

No, that is not the case. Curious though what caused this. If it happens again could you post a "cat /proc/spl/kstat/zfs/arcstats" after it fails to rebuild the L2ARC contents? That might give us a clue.

You can also look in zpool history to see if there is anything L2ARC related.

aaronjwood commented 3 years ago

Is there some special handling that forces the L2ARC to not rebuild when a new kernel version is used or something?

No, that is not the case. Curious though what caused this. If it happens again could you post a "cat /proc/spl/kstat/zfs/arcstats" after it fails to rebuild the L2ARC contents? That might give us a clue.

You can also look in zpool history to see if there is anything L2ARC related.

Yeah, will do. Haven't been able to reproduce it yet. I don't see anything relevant in zpool history unfortunately.

aaronjwood commented 3 years ago

Still haven't reproduced this yet but saw something that I thought I should share:

...
l2_rebuild_success              4    0
l2_rebuild_unsupported          4    0
l2_rebuild_io_errors            4    0
l2_rebuild_dh_errors            4    1
l2_rebuild_cksum_lb_errors      4    0
l2_rebuild_lowmem               4    0
l2_rebuild_size                 4    0
l2_rebuild_asize                4    0
l2_rebuild_bufs                 4    0
l2_rebuild_bufs_precached       4    0
l2_rebuild_log_blks             4    0
...

Note the l2_rebuild_dh_errors field. I'm not clear on what this field means or if it's relevant to the issue I hit.

gamanakis commented 3 years ago

This means the header of the L2ARC device (used for rebuilding its contents) was corrupted for some reason.

On Mon, Apr 26, 2021, 5:42 AM Aaron Wood @.***> wrote:

Still haven't reproduced this yet but saw something that I thought I should share:

... l2_rebuild_success 4 0 l2_rebuild_unsupported 4 0 l2_rebuild_io_errors 4 0 l2_rebuild_dh_errors 4 1 l2_rebuild_cksum_lb_errors 4 0 l2_rebuild_lowmem 4 0 l2_rebuild_size 4 0 l2_rebuild_asize 4 0 l2_rebuild_bufs 4 0 l2_rebuild_bufs_precached 4 0 l2_rebuild_log_blks 4 0 ...

Note the l2_rebuild_dh_errors field. I'm not clear on what this field means or if it's relevant to the issue I hit.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/openzfs/zfs/issues/11787#issuecomment-826480019, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2Y2IPKSKYGVFGQSPUBJ63TKTOLJANCNFSM4ZWCG4YA .

aaronjwood commented 3 years ago

This would cause the behavior I saw, right? Is there anything else within ZFS that I can look at that might indicate why this header got corrupted?

aaronjwood commented 3 years ago

I just had this happen to me again last night due to a power outage. Once my server came back I saw my L2 ARC was empty again. FWIW l2_rebuild_dh_errors is now sitting at 3.

Anything else I can check here to get more information on why I keep failing to rebuild?

gamanakis commented 3 years ago

Try a "zdb -lll /dev/cache device" . It seems the header of the L2ARC device is corrupted and that is why ZFS cannot read it. Only way to restore normal operation would be removing and re-adding the cache device.

On Tue, May 4, 2021, 5:10 PM Aaron Wood @.***> wrote:

I just had this happen to me again last night due to a power outage. Once my server came back I saw my L2 ARC was empty again. FWIW l2_rebuild_dh_errors is now sitting at 3.

Anything else I can check here to get more information on why I keep failing to rebuild?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/openzfs/zfs/issues/11787#issuecomment-832018213, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2Y2IPHVHIXX66U3SVP7ILTMAE5PANCNFSM4ZWCG4YA .

aaronjwood commented 3 years ago
zdb -lll /dev/nvme0n1
failed to unpack label 0
failed to unpack label 1
failed to unpack label 2
failed to unpack label 3

I can try removing and adding it back. It's interesting you say that because I had done that (for another reason) after upgrading to ZFS 2.0.0. I'll do it again right now and see if I can reproduce the issue again.

aaronjwood commented 3 years ago

Even after readding the device:

zdb -lll /dev/nvme0n1
failed to unpack label 0
failed to unpack label 1
failed to unpack label 2
failed to unpack label 3

Also tried with the ID which is how I actually add it to the pool:

zdb -lll /dev/disk/by-id/nvme-HP_SSD_EX950_2TB_HBSE59340600764
failed to unpack label 0
failed to unpack label 1
failed to unpack label 2
failed to unpack label 3
gamanakis commented 3 years ago

This tells us the whole label of the cache device is corrupted and not only the header that is required for persistent L2ARC. I have never seen this before, either in production or in local VM testing.

On Tue, May 4, 2021, 7:50 PM Aaron Wood @.***> wrote:

zdb -lll /dev/nvme0n1 failed to unpack label 0 failed to unpack label 1 failed to unpack label 2 failed to unpack label 3

I can try removing and adding it back. It's interesting you say that because I had done that (for another reason) after upgrading to ZFS 2.0.0. I'll do it again right now and see if I can reproduce the issue again.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/openzfs/zfs/issues/11787#issuecomment-832127865, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2Y2IMYPRFQMKWTJSXCC73TMAXWFANCNFSM4ZWCG4YA .

gamanakis commented 3 years ago

Are you sure there is no underlying hardware problem?

You can try zeroing out manually the label and header of the removed device with "zpool labelclear" and then adding it again.

On Tue, May 4, 2021, 7:56 PM Georgios Amanakis @.***> wrote:

This tells us the whole label of the cache device is corrupted and not only the header that is required for persistent L2ARC. I have never seen this before, either in production or in local VM testing.

On Tue, May 4, 2021, 7:50 PM Aaron Wood @.***> wrote:

zdb -lll /dev/nvme0n1 failed to unpack label 0 failed to unpack label 1 failed to unpack label 2 failed to unpack label 3

I can try removing and adding it back. It's interesting you say that because I had done that (for another reason) after upgrading to ZFS 2.0.0. I'll do it again right now and see if I can reproduce the issue again.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/openzfs/zfs/issues/11787#issuecomment-832127865, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2Y2IMYPRFQMKWTJSXCC73TMAXWFANCNFSM4ZWCG4YA .

aaronjwood commented 3 years ago

Hah, wow! Guess I am lucky :) Some more info for context:

The drive seems to work fine as is, regardless of being used as an L2ARC or not. Here's more data about the drive:

=== START OF INFORMATION SECTION ===
Model Number:                       HP SSD EX950 2TB
Serial Number:                      HBSE59340600764
Firmware Version:                   SS0411B
PCI Vendor/Subsystem ID:            0x126f
IEEE OUI Identifier:                0x000000
Controller ID:                      1
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          2,000,073,818,112 [2.00 TB]
Namespace 1 Utilization:            27,793,907,712 [27.7 GB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Tue May  4 10:59:34 2021 PDT
Firmware Updates (0x14):            2 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x0f):         S/H_per_NS Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg
Maximum Data Transfer Size:         64 Pages
Warning  Comp. Temp. Threshold:     75 Celsius
Critical Comp. Temp. Threshold:     80 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     9.00W       -        -    0  0  0  0        0       0
 1 +     4.60W       -        -    1  1  1  1        0       0
 2 +     3.80W       -        -    2  2  2  2        0       0
 3 -   0.0450W       -        -    3  3  3  3     2000    2000
 4 -   0.0040W       -        -    4  4  4  4    15000   15000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        30 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    2%
Data Units Read:                    100,762,616 [51.5 TB]
Data Units Written:                 81,736,109 [41.8 TB]
Host Read Commands:                 842,433,926
Host Write Commands:                1,132,098,062
Controller Busy Time:               21,608
Power Cycles:                       262
Power On Hours:                     7,994
Unsafe Shutdowns:                   103
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged
Disk /dev/nvme0n1: 1.8 TiB, 2000073818112 bytes, 3906394176 sectors
Disk model: HP SSD EX950 2TB
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 755545AB-39E5-8747-989A-8216460E3F6E

Device              Start        End    Sectors  Size Type
/dev/nvme0n1p1       2048 3906377727 3906375680  1.8T Solaris /usr & Apple ZFS
/dev/nvme0n1p9 3906377728 3906394111      16384    8M Solaris reserved 1

ZFS is the one managing the label, partition, etc. of the entire disk right?

I'll try your suggestion above and post back in a bit.

aaronjwood commented 3 years ago
zpool remove vault nvme-HP_SSD_EX950_2TB_HBSE59340600764
zpool labelclear nvme-HP_SSD_EX950_2TB_HBSE59340600764
zpool add vault cache nvme-HP_SSD_EX950_2TB_HBSE59340600764
zdb -lll /dev/disk/by-id/nvme-HP_SSD_EX950_2TB_HBSE59340600764
failed to unpack label 0
failed to unpack label 1
failed to unpack label 2
failed to unpack label 3

Is ZFS corrupting this?

aaronjwood commented 3 years ago

FWIW this also fails:

zdb -lll /dev/nvme0n1p9

but this works:

zdb -lll /dev/nvme0n1p1

It dumps WAY too much info for me to post here. Let me know if you want me to attach it if you think it'd be useful/relevant.

gamanakis commented 3 years ago

Could you do a zdb -l /dev/nvme0n1p1 and post here?

aaronjwood commented 3 years ago

For sure:

------------------------------------
LABEL 0
------------------------------------
    version: 5000
    state: 4
    guid: 6053299143396366987
    labels = 0 1 2 3
------------------------------------
L2ARC device header
------------------------------------
    magic: 6504978260106102853
    version: 1
    pool_guid: 8691708886196254037
    flags: 1
    start_lbps[0]: 31282282496
    start_lbps[1]: 31148523520
    log_blk_ent: 1022
    start: 4198400
    end: 2000063823872
    evict: 4198400
    lb_asize_refcount: 6148096
    lb_count_refcount: 315
    trim_action_time: 1620151866
    trim_state: 4

------------------------------------
L2ARC device log blocks
------------------------------------
log_blk_count:   315 with valid cksum
         0 with invalid cksum
log_blk_asize:   6148096
gamanakis commented 3 years ago

So the device is still on its first pass (has not been filled to the end), has written about 31GB out of a total capacity of about 2TB. You also seem to have TRIM for L2ARC enabled, right?

I suspect that the TRIM code may be interfering with the label (trimming it out), although the device is still on its first pass. You also use Proxmox, so ZFS is installed from packages, correct? Is the reported ZFS version (zfs -V) 2.0.3-pve2 as in the original post?

aaronjwood commented 3 years ago

Yeah, it was much more full before the power loss event last night :) Since I've been removing/adding it today it's pretty small right now as you said.

I do have TRIM enabled. I noticed that a TRIM is kicked off whenever I add this drive to a pool. Guessing that is intended behavior from ZFS. My conf for reference:

options zfs zfs_arc_max=40802189312
options zfs l2arc_noprefetch=0
options zfs l2arc_write_max=1074000000
options zfs l2arc_trim_ahead=100

Yes to your last question about Proxmox and ZFS. Since I've reported the issue there have been a few patches that came in:

zfs-2.0.4-pve1
zfs-kmod-2.0.4-pve1
gamanakis commented 3 years ago

Yes, adding a cache device to a pool with l2arc_trim_ahead>0 triggers a TRIM of the whole device.

Notably your device reports trim_state=4 meaning that TRIM was successfully completed. My initial thought was that TRIM of the cache device was interrupted somehow leaving it in an incomplete state and leading to re-trimming every time the pool was imported. This doesn't seem to be the case here though.

aaronjwood commented 3 years ago

Interesting. Let me know if there's any other info you want to see, or if there's anything else you want me to try. I'm not quite sure what to do otherwise.

aaronjwood commented 3 years ago

This is a longshot but there shouldn't be any weirdness in working with the reported (fake) sector size of 512/512, right?

gamanakis commented 3 years ago

I do not think so, the sector size is handled by core ZFS code, the L2ARC code has barely anything to do with it.

aaronjwood commented 3 years ago

How come we do see the header when doing zdb -l nvme-HP_SSD_EX950_2TB_HBSE59340600764 or zdb -l /dev/nvme0n1p1 but not zdb -l /dev/nvme0n1 or zdb -l /dev/disk/by-id/nvme-HP_SSD_EX950_2TB_HBSE59340600764? Should the header be present on both partitions here?

BTW I tried the process of removing the L2ARC device, labelclear'ing it, and adding it back all while l2arc_trim_ahead was set to 0. Still see the same behavior.

aaronjwood commented 3 years ago

One other thing I noticed: when adding the device to the pool with l2arc_trim_ahead set to 100 I see this until the trimming has completed:

zdb -l nvme-HP_SSD_EX950_2TB_HBSE59340600764
------------------------------------
LABEL 0
------------------------------------
    version: 5000
    state: 4
    guid: 2538846175622296943
    labels = 0 1 2 3
L2ARC device header not found

Once it's finished it sees the header:

zdb -l nvme-HP_SSD_EX950_2TB_HBSE59340600764
------------------------------------
LABEL 0
------------------------------------
    version: 5000
    state: 4
    guid: 2538846175622296943
    labels = 0 1 2 3
------------------------------------
L2ARC device header
------------------------------------
    magic: 6504978260106102853
    version: 1
    pool_guid: 8691708886196254037
    flags: 1
    start_lbps[0]: 15741607936
    start_lbps[1]: 15607631872
    log_blk_ent: 1022
    start: 4198400
    end: 2000063823872
    evict: 4198400
    lb_asize_refcount: 2437120
    lb_count_refcount: 119
    trim_action_time: 1621575040
    trim_state: 4

------------------------------------
L2ARC device log blocks
------------------------------------
log_blk_count:   119 with valid cksum
                 0 with invalid cksum
log_blk_asize:   2437120

Is this expected behavior? The L2ARC device shows as ONLINE from a zpool status as it is trimming (when the L2ARC header isn't present).

gamanakis commented 3 years ago

As far as I remember this is expected behavior. Also the label of a vdev is present in both partitions on the disk, that is normal.

Could you share all the non default ZFS module parameters you use?

On Fri, May 21, 2021, 7:40 AM Aaron Wood @.***> wrote:

One other thing I noticed: when adding the device to the pool with l2arc_trim_ahead set to 100 I see this until the trimming has completed:

zdb -l nvme-HP_SSD_EX950_2TB_HBSE59340600764

LABEL 0

version: 5000
state: 4
guid: 2538846175622296943
labels = 0 1 2 3

L2ARC device header not found

Once it's finished it's sees the header:

zdb -l nvme-HP_SSD_EX950_2TB_HBSE59340600764

LABEL 0

version: 5000
state: 4
guid: 2538846175622296943
labels = 0 1 2 3

L2ARC device header

magic: 6504978260106102853
version: 1
pool_guid: 8691708886196254037
flags: 1
start_lbps[0]: 15741607936
start_lbps[1]: 15607631872
log_blk_ent: 1022
start: 4198400
end: 2000063823872
evict: 4198400
lb_asize_refcount: 2437120
lb_count_refcount: 119
trim_action_time: 1621575040
trim_state: 4

L2ARC device log blocks

log_blk_count: 119 with valid cksum 0 with invalid cksum log_blk_asize: 2437120

Is this expected behavior? The L2ARC device shows as ONLINE from a zpool status as it is trimming (when the L2ARC header isn't present).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/openzfs/zfs/issues/11787#issuecomment-845668850, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2Y2ILRSTTERWUW5YSA45DTOXW5BANCNFSM4ZWCG4YA .

aaronjwood commented 3 years ago

Ok, thanks. Here's what I have:

options zfs zfs_arc_max=40802189312
options zfs l2arc_noprefetch=0
options zfs l2arc_write_max=1074000000
options zfs l2arc_trim_ahead=100

FWIW I've also tried upgrading to 5.11 yesterday (https://forum.proxmox.com/threads/kernel-5-11.86225/#post-378338) but the same behavior persists with all of the different scenarios that I've gone through above. You can assume going forward that I'm now on Linux server 5.11.17-1-pve #1 SMP PVE 5.11.17-1~bpo10 (Wed, 12 May 2021 12:45:37 +0200) x86_64 GNU/Linux

gamanakis commented 3 years ago

I do not think kernel updates affect this. In your case the label is destroyed before the device is completely filled with data (ie l2ad_first=1 seen as flags=1 in the zdb output) which excludes a good chunk of code.

Upon importing the pool the following happens: 1) l2arc_add_vdev() creates the l2arc related pointer storing the properties of the device 2) l2arc_rebuild_vdev() reads the device header, if faulty and l2arc_trim_ahead is set then the device is marked to be trimmed.

If the device is marked for trimming or for rebuild, no writes take place until those flags are cleared. Let me take another look.

gamanakis commented 3 years ago

Do you get this problem only after power outage or does it also occur after a normal export of the pool or reboot?

aaronjwood commented 3 years ago

It happens with pretty much every reboot.

gamanakis commented 3 years ago

Let's verify this is TRIM related, as I suspect. Could you try with l2arc_trim_ahead = 0 and see if the L2ARC is preserved?

aaronjwood commented 3 years ago

Yeah, I tried that a few days ago actually. Still had the same issue :(

gamanakis commented 3 years ago

I just found that comment. This means it is not TRIM related as I thought. The l2arc_rebuild_dh_errors in arcstat is thrown when ZFS fails to zio_read from the cache device, most probably because the label of the device is missing. Could you post a zpool status and a cat /proc/spl/kstat/zfs/dbgmsg?

aaronjwood commented 3 years ago
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 00:04:58 with 0 errors on Sun May  9 00:28:59 2021
config:

        NAME                                           STATE     READ WRITE CKSUM
        rpool                                          ONLINE       0     0     0
          nvme-HP_SSD_EX950_2TB_HBSE59340600778-part3  ONLINE       0     0     0

errors: No known data errors

  pool: vault
 state: ONLINE
  scan: scrub repaired 0B in 11:59:27 with 0 errors on Sun May  9 12:23:30 2021
config:

        NAME                                     STATE     READ WRITE CKSUM
        vault                                    ONLINE       0     0     0
          mirror-0                               ONLINE       0     0     0
            wwn-0x5000cca26ff4644b               ONLINE       0     0     0
            wwn-0x5000cca291c9ead5               ONLINE       0     0     0
          mirror-1                               ONLINE       0     0     0
            wwn-0x5000cca298c17afd               ONLINE       0     0     0
            wwn-0x5000cca264f71b1e               ONLINE       0     0     0
          mirror-2                               ONLINE       0     0     0
            wwn-0x5000cca264c8edc3               ONLINE       0     0     0
            wwn-0x5000cca26fe9939b               ONLINE       0     0     0
          mirror-3                               ONLINE       0     0     0
            wwn-0x50014ee2679b71c0               ONLINE       0     0     0
            wwn-0x50014ee2bcf1412b               ONLINE       0     0     0
          mirror-4                               ONLINE       0     0     0
            wwn-0x5000cca298c27a50               ONLINE       0     0     0
            wwn-0x5000cca29bcb02d8               ONLINE       0     0     0
        cache
          nvme-HP_SSD_EX950_2TB_HBSE59340600764  ONLINE       0     0     0

errors: No known data errors

Attached the dbgmsg since it's too big. dbgmsg.txt

joakimlemb commented 3 years ago

I have the same issue, the common denominator I found compared to @aaronjwood is I'm also using a NVME drive, and also running the Proxmox version of Debian: Linux server01 5.4.119-1-pve #1 SMP PVE 5.4.119-1 (Tue, 01 Jun 2021 15:32:00 +0200) x86_64 GNU/Linux zfs-2.0.4-pve1 zfs-kmod-2.0.4-pve1

Running all zfs settings at default with the following exceptions:

echo 0 > /sys/module/zfs/parameters/l2arc_noprefetch
echo $(expr 384 \* 1024 \* 1024) > /sys/module/zfs/parameters/l2arc_write_max
echo $(expr 128 \* 1024 \* 1024) > /sys/module/zfs/parameters/l2arc_write_boost

zpool status:

  pool: disk-stiron1
 state: ONLINE
  scan: scrub repaired 0B in 07:20:55 with 0 errors on Sun Jun 13 07:45:00 2021
config:

        NAME                                                  STATE     READ WRITE CKSUM
        disk-stiron1                                          ONLINE       0     0     0
          ata-ST8000NE0004-1ZF11G_ZA24X4HW                    ONLINE       0     0     0
        cache
          nvme-KINGSTON_SA2000M81000G_50026B7684D2BC31-part1  ONLINE       0     0     0

errors: No known data errors

zdb -l nvme-KINGSTON_SA2000M81000G_50026B7684D2BC31-part1

zdb -l nvme-KINGSTON_SA2000M81000G_50026B7684D2BC31
------------------------------------
LABEL 0
------------------------------------
    version: 5000
    state: 4
    guid: 15604254069511643822
    labels = 0 1 2 3
------------------------------------
L2ARC device header
------------------------------------
    magic: 6504978260106102853
    version: 1
    pool_guid: 10127015420669945674
    flags: 1
    start_lbps[0]: 21990946304
    start_lbps[1]: 21458048512
    log_blk_ent: 1022
    start: 4194816
    end: 330066034688
    evict: 4194816
    lb_asize_refcount: 499200
    lb_count_refcount: 42
    trim_action_time: 0
    trim_state: 0

------------------------------------
L2ARC device log blocks
------------------------------------
log_blk_count:   42 with valid cksum
                 0 with invalid cksum
log_blk_asize:   499200

cat /proc/spl/kstat/zfs/dbgmsg|grep -i "l2|rebuild"

1624013069   arc.c:9888:l2arc_dev_hdr_read(): L2ARC IO error (52) while reading device header, vdev guid: 6508425963602447417
1624013071   arc.c:9888:l2arc_dev_hdr_read(): L2ARC IO error (52) while reading device header, vdev guid: 6508425963602447417
1624013079   spa_history.c:309:spa_history_log_sync(): txg 222523 L2ARC rebuild no valid log blocks 
gamanakis commented 3 years ago

I am keen to find out what is causing this. Would you be able to test a possible fix?

aaronjwood commented 3 years ago

I'm open to it if the risk is low. I have a lot of critical data in both of my pools so if it's something that could possibly corrupt or destroy anything I don't think I'd be comfortable trying it.

gamanakis commented 3 years ago

Let's pinpoint when the corruption is occurring: 1) Wait for ZFS to write some data on the cache device 2) Make sure the label is valid before we export the pool with zdb -l /dev/disk/cache device 3) Export the pool with zpool export pool 4) See if the label is still valid (without importing the pool) with zdb -l /dev/disk/cache device

aaronjwood commented 3 years ago

In my case number 2 never happens. I have not been able to find any point in time where the label is valid on my drive.

gamanakis commented 3 years ago

I was able to reproduce this in a VM. Seems to occur with 2.1.99 too, and occurs whenever a whole drive is passed as a cache device.

aaronjwood commented 3 years ago

Awesome! Happy to hear that :) Is there something I can do to avoid this bug for now? Is it possible to partition the drive and only pass the partition as a cache device?

Just for my understanding, is it not right to pass the whole drive as a cache device? I thought ZFS needs/wants access to entire disks.

gamanakis commented 3 years ago

This dates back to at least 0.8.3. You can avoid it by partitioning the drive and passing the partition as the cache device. I will see if I can come up with a better solution.

On Sun, Jun 20, 2021, 9:19 PM Aaron Wood @.***> wrote:

Awesome! Happy to hear that :) Is there something I can do to avoid this bug for now? Is it possible to partition the drive and only pass the partition as a cache device?

Just for my understanding, is it not right to pass the whole drive as a cache device? I thought ZFS needs/wants access to entire disks.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/openzfs/zfs/issues/11787#issuecomment-864599903, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2Y2INCG2GQZJ6LSOCALOTTTY5M5ANCNFSM4ZWCG4YA .

aaronjwood commented 3 years ago

Can confirm this works:

parted -a optimal /dev/disk/by-id/nvme-HP_SSD_EX950_2TB_HBSE59340600764 mkpart primary 0% 100%
zpool add vault cache /dev/nvme1n1p1
zdb -l /dev/nvme1n1p1
------------------------------------
LABEL 0
------------------------------------
    version: 5000
    state: 4
    guid: 1221595849551603502
    labels = 0 1 2 3
------------------------------------
L2ARC device header
------------------------------------
    magic: 6504978260106102853
    version: 1
    pool_guid: 8691708886196254037
    flags: 1
    start_lbps[0]: 1349279744
    start_lbps[1]: 1342681088
    log_blk_ent: 1022
    start: 4198400
    end: 2000072212480
    evict: 4198400
    lb_asize_refcount: 2859008
    lb_count_refcount: 165
    trim_action_time: 1624226571
    trim_state: 4

------------------------------------
L2ARC device log blocks
------------------------------------
log_blk_count:   165 with valid cksum
                 0 with invalid cksum
log_blk_asize:   2859008

Will switch back to using the whole disk + using disk IDs as soon as the fix lands in ZFS. Thanks much for the temporary solution, was dying to get this working!

aaronjwood commented 3 years ago

Maybe I spoke too soon here. While everything looked good from zdb's point of view after I rebooted I saw my L2ARC device empty again. I thought I may have looked too soon but I am watching my Grafana charts now and it isn't recovering :( Capture Note that I had ~13 GiB in my L2ARC before I rebooted.

jumbi77 commented 3 years ago

I know its off-topic, but can you may post your grafana zfs dashboard json (e.g. via github gist)? Much thanks in advance.

aaronjwood commented 3 years ago

Hmm, I tried filling up my L2ARC to about ~13 GiB again, rebooted, and now it seems it was preserved. No idea why my previous reboot didn't work. Will continue to monitor this and let you know if it happens again...

aaronjwood commented 3 years ago

@jumbi77 my dashboard has a LOT more than what's in that screenshot, but definitely feel free to use it :) https://gist.github.com/aaronjwood/06950a0ade37c0b87a0bc6de53316d61

aaronjwood commented 3 years ago

I just rebooted 3 more times, and between each time I filled the L2ARC drive by another 10 GiB or so. Still seems to be holding. So losing data from my first reboot was caused by me messing something up, not accounting for something, or running into another rare issue/bug. I guess for now things seem solid.

gamanakis commented 3 years ago

In the VM things are interesting. Since at least 0.8.3 the behavior when using a whole disk for L2ARC is as follows: zdb -l /dev/cache returns failed to unpack label (all of them) zdb -l /dev/cache1 returns normal output including the header zdb -l /dev/cache9 returns failed to unpack label (all of them)

However, L2ARC contents are preserved between reboots, ie persistent L2ARC works as it should when using a whole disk.

aaronjwood commented 3 years ago

Strange. Outside of the zdb output I've posted throughout this issue I always saw my L2ARC size go to 0 after a reboot. You don't see that behavior in your VM?

gamanakis commented 3 years ago

Strange. Outside of the zdb output I've posted throughout this issue I always saw my L2ARC size go to 0 after a reboot. You don't see that behavior in your VM?

No, in the VM the L2ARC contents are restored correctly after a reboot (running OpenZFS master branch compiled directly from source).