openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.42k stars 1.72k forks source link

Cyclic: Silent device failure and self-healing in combination with ZVOL destroy and async_destroy #15150

Open sebveit opened 1 year ago

sebveit commented 1 year ago

TLDR: Even if I was able to solve the problem, the pool leaked some space. In my case it was only 128 mb, so I can live with that. I'm still thinking that this is some kind of a bug, because ZFS went into a cycle that it couldn't break itself.

System information

Type Version/Name
Distribution Name FreeBSD
Distribution Version 13.2-RELEASE-p2
Kernel Version 13.2-RELEASE-p2
Architecture amd64
OpenZFS Version zfs-2.1.9-FreeBSD_g92e0d9d18

Both pools (zroot and zdata) are not encrypted (no zfs builtin and/or geli). The system has ECC RAM.

Describe the problem you're observing

I observed a problem with one of my zvols (Permanent errors have been detected) right after a scrub. Since I don't need that zvol anymore I've decided to destroy it and started scrub after that, but this time I'm getting the same error with hex values.

After the scrub finished I rebooted the system and now I'm seeing a lot of checksum errors when utilizing zpool status -vx (and no program or VM is accessing the pool).

The scrub seemed to have repaired 2.73M in 09:42:34 with 0 errors but I'm still getting the mentioned error hint: Permanent errors have been detected in the following files

Trying to get more information via zdb fails when issuing the following command: zdb -b zdata

The output of dmesg seems to be fine with no errors right after system reboot. The only error message displayed is related to zdb: pid 93858 (zdb), jid 0, uid 0: exited on signal 6 (core dumped)

Include any warning/errors/backtraces from the system logs

zdb -b zdata

Traversing all blocks to verify nothing leaked ...

loading concrete vdev 0, metaslab 115 of 116 ...
0 == bptree_iterate(spa->spa_meta_objset, spa->spa_dsl_pool->dp_bptree_obj, B_FALSE, count_block_cb, &zcb, NULL) (0x0 == 0x61)
ASSERT at /usr/src/sys/contrib/openzfs/cmd/zdb/zdb.c:6346:dump_block_stats()Abbruch(core dumped)
zpool status -xv
  pool: zdata
 state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 2.73M in 09:42:34 with 0 errors on Wed Aug  2 19:30:41 2023
config:

    NAME        STATE     READ WRITE CKSUM
    zdata       ONLINE       0     0     0
      raidz2-0  ONLINE       0     0     0
        ada0    ONLINE       0     0   580
        ada1    ONLINE       0     0   580
        ada2    ONLINE       0     0   580
        ada3    ONLINE       0     0   580
        ada4    ONLINE       0     0   580
        ada5    ONLINE       0     0   580
        ada6    ONLINE       0     0   580
        ada7    ONLINE       0     0   580

errors: Permanent errors have been detected in the following files:

        <0xffffffffffffffff>:<0x1>
zpool iostat -v zdata
              capacity     operations     bandwidth 
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
zdata       20.2T  8.79T      7     85  58.9K   503K
  raidz2-0  20.2T  8.79T      7     85  58.9K   503K
    ada0        -      -      0      9  7.78K  63.1K
    ada1        -      -      0      9  7.79K  63.1K
    ada2        -      -      0      8  7.84K  63.3K
    ada3        -      -      0      9  7.77K  63.2K
    ada4        -      -      0     11  6.06K  62.2K
    ada5        -      -      0     11  7.75K  62.5K
    ada6        -      -      0     11  7.78K  62.4K
    ada7        -      -      0     11  6.14K  62.7K
----------  -----  -----  -----  -----  -----  -----
zdb -m zdata

Metaslabs:
    vdev          0      ms_unflushed_phys object 407
    metaslabs   116   offset                spacemap          free        
    ---------------   -------------------   ---------------   ------------
    metaslab      0   offset            0   spacemap   4173   free    42.8G
space map object 4173:
  smp_length = 0x19d990
  smp_alloc = 0x354fb0a000
    Flush data:
    unflushed txg=15381680

    metaslab      1   offset   4000000000   spacemap  37660   free    93.9G

... metaslabs 2 till 114 omitted for clarity ...

space map object 8694:
  smp_length = 0x4f210
  smp_alloc = 0x1749420000
    Flush data:
    unflushed txg=15381677

    metaslab    115   offset 1cc000000000   spacemap  50659   free    53.3G
space map object 50659:
  smp_length = 0x5f398
  smp_alloc = 0x32ab697000
    Flush data:
    unflushed txg=15381677

Log Space Maps in Pool:
Log Spacemap object 497 txg 15381675
space map object 497:
  smp_length = 0x470
  smp_alloc = 0x0
Log Spacemap object 651 txg 15381676
space map object 651:
  smp_length = 0x450
  smp_alloc = 0x9000
Log Spacemap object 652 txg 15381677
space map object 652:
  smp_length = 0x4d0
  smp_alloc = 0xffffffffffff7000
Log Spacemap object 578 txg 15381678
space map object 578:
  smp_length = 0x420
  smp_alloc = 0x0
Log Spacemap object 579 txg 15381679
space map object 579:
  smp_length = 0x4b0
  smp_alloc = 0x0
Log Spacemap object 580 txg 15381680
space map object 580:
  smp_length = 0x430
  smp_alloc = 0x0
Log Spacemap object 498 txg 15381681
space map object 498:
  smp_length = 0x460
  smp_alloc = 0x0
Log Spacemap object 581 txg 15381682
space map object 581:
  smp_length = 0x450
  smp_alloc = 0x0
Log Spacemap object 582 txg 15381683
space map object 582:
  smp_length = 0x380
  smp_alloc = 0x0

Log Space Map Obsolete Entry Statistics:
2        valid entries out of 57       - txg 15381675
2        valid entries out of 53       - txg 15381676
9        valid entries out of 62       - txg 15381677
13       valid entries out of 54       - txg 15381678
34       valid entries out of 61       - txg 15381679
34       valid entries out of 54       - txg 15381680
43       valid entries out of 57       - txg 15381681
51       valid entries out of 56       - txg 15381682
45       valid entries out of 45       - txg 15381683
233      valid entries out of 499      - total
sebveit commented 1 year ago

It turnes out an incomplete zvol deletion was hindering the pool and zdb correctly asserts with (0x0 == 0x61). One of my physical drives (ada3) was silently failing. After I've run several scrubs on the pool it finally signaled the failure via (ada3:ahcich3:0:0:0): CAM status: Command timeout in dmesg. smartctl -x /dev/ada3 showed 73 errors.

I've taken the failing device offline via zpool offline zdata ada3 and issued sysctl vfs.zfs.free_leak_on_eio=1 because of the permanent error regarding the destroyed zvol. After doing so, zdb executed normally and the affected pool showed no further checksum errors or events.

I might have lost some data (128M leaked) but I have to wait for a spare disk and then I'll figure it out.

As of now the affected pool looks like this:

# zpool status -v zdata
  pool: zdata
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 08:48:29 with 0 errors on Fri Aug  4 17:49:56 2023
config:

    NAME        STATE     READ WRITE CKSUM
    zdata       DEGRADED     0     0     0
      raidz2-0  DEGRADED     0     0     0
        ada0    ONLINE       0     0     0
        ada1    ONLINE       0     0     0
        ada2    ONLINE       0     0     0
        ada3    OFFLINE      0     0     0
        ada4    ONLINE       0     0     0
        ada5    ONLINE       0     0     0
        ada6    ONLINE       0     0     0
        ada7    ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <0xffffffffffffffff>:<0x1>

zdb before (free_leak_on_eio=0):

# zdb -b -A zdata

Traversing all blocks to verify nothing leaked ...

loading concrete vdev 0, metaslab 115 of 116 ...
0 == bptree_iterate(spa->spa_meta_objset, spa->spa_dsl_pool->dp_bptree_obj, B_FALSE, count_block_cb, &zcb, NULL) (0x0 == 0x61)
6.94G completed (1169MB/s) estimated time remaining: 5hr 01min 49sec 
17.8T completed (20581MB/s) estimated time remaining: 0hr 02min 00sec
leaked space: vdev 0, offset 0x14cfbe000, size 294912
…
leaked space: vdev 0, offset 0x1cfdfeb58000, size 11243520
block traversal size 19644664246272 != alloc 22224349704192 (leaked 2579685457920)

    bp count:             113656434
    ganged count:                 0
    bp logical:      14709804564480      avg: 129423
    bp physical:     13932827621888      avg: 122587     compression:   1.06
    bp allocated:    19644664246272      avg: 172842     compression:   0.75
    bp deduped:                   0    ref>1:      0   deduplication:   1.00
    Normal class:    22224285573120     used: 70.31%
    Embedded log class       64143360     used:  0.02%

    additional, non-pointer bps of type 0:      30803
    Dittoed blocks on same vdev: 116017

async_destroy before (free_leak_on_eio=0):

# zpool get all zdata
...
zdata  freeing                        1.67T                         -
...
zdata  leaked                         0                             -
...
zdata  feature@async_destroy          active                        local

zdb after (free_leak_on_eio=1):

# zdb -b zdata

Traversing all blocks to verify nothing leaked ...

loading concrete vdev 0, metaslab 115 of 116 ...
17.8T completed (36129MB/s) estimated time remaining: 0hr 01min 06sec
leaked space: vdev 0, offset 0x40675cf0000, size 61440
leaked space: vdev 0, offset 0x1393e2f8e000, size 188743680
leaked space: vdev 0, offset 0x139407ac5000, size 61440
block traversal size 19644681965568 != alloc 19644870832128 (leaked 188866560)

        bp count:             113656522
        ganged count:                 0
        bp logical:      14709814276608      avg: 129423
        bp physical:     13932836321792      avg: 122587     compression:   1.06
        bp allocated:    19644681965568      avg: 172842     compression:   0.75
        bp deduped:                   0    ref>1:      0   deduplication:   1.00
        Normal class:    22105317494784     used: 69.93%
        Embedded log class       64561152     used:  0.02%

        additional, non-pointer bps of type 0:      30809
        Dittoed blocks on same vdev: 116054

async_destroy after (free_leak_on_eio=1):

# zpool get all zdata
...
zdata  freeing                        0                             -
...
zdata  leaked                         128M                          -
...
zdata  feature@async_destroy          enabled                       local

I was getting massive amounts of checkum errors on zpool status -v zdata. zpool events spamed every 5 seconds the same event but with different ena, time and eid:

# zpool events -v
Aug  4 2023 22:18:24.128208128 ereport.fs.zfs.data
        class = "ereport.fs.zfs.data"
        ena = 0x8754996b2f300801
        detector = (embedded nvlist)
                version = 0x0
                scheme = "zfs"
                pool = 0x1095f287e8ca55f5
        (end detector)
        pool = "zdata"
        pool_guid = 0x1095f287e8ca55f5
        pool_state = 0x0
        pool_context = 0x0
        pool_failmode = "wait"
        zio_err = 0x61
        zio_flags = 0x808881
        zio_stage = 0x1000000
        zio_pipeline = 0x1080000
        zio_delay = 0x0
        zio_timestamp = 0x0
        zio_delta = 0x0
        zio_priority = 0x2
        zio_objset = 0xffffffffffffffff
        zio_object = 0x1
        zio_level = 0x1
        zio_blkid = 0x4a83
        time = 0x64cd5d10 0x7a44d00 
        eid = 0x5566

I hope that I'll get zfs to get rid of the permanent errors recorded by zpool status -v zdata, after I've resilvered the device. Maybe this information is helping someone else having similar issues.

This issue might be related to the following issues: https://github.com/openzfs/zfs/issues/3094#issuecomment-74444961, https://github.com/openzfs/zfs/issues/3108, https://github.com/openzfs/zfs/issues/8989#issuecomment-509229197, https://bugs.freebsd.org/bugzilla/show_bug.cgi?format=multiple&id=218954

sebveit commented 1 year ago

Further information on the flag sysctl vfs.zfs.free_leak_on_eio=1 used in this case, can be accessed under https://man.freebsd.org/cgi/man.cgi?query=zfs&sektion=4&manpath=FreeBSD:

zfs_free_leak_on_eio=0|1 (int) If destroy encounters an EIO while reading metadata (e.g. indirect blocks), space referenced by the missing metadata can not be freed. Normally this causes the background destroy to become "stalled", as it is unable to make forward progress. While in this stalled state, all remaining space to free from the error-encountering filesystem is "temporarily leaked". Set this flag to cause it to ignore the EIO, permanently leak the space from indirect blocks that can not be read, and continue to free everything else that it can.

The default "stalling" behavior is useful if the storage partially fails (i.e. some but not all I/O operations fail), and then later recovers. In this case, we will be able to continue pool operations while it is partially failed, and when it recovers, we can continue to free the space, with no leaks. Note, however, that this case is actually fairly rare.

Typically pools either

  1. fail completely (but perhaps temporarily, e.g. due to a top-level vdev going offline), or
  2. have localized, permanent errors (e.g. disk returns the wrong data due to bit flip or firmware bug). In the former case, this setting does not matter because the pool will be suspended and the sync thread will not be able to make forward progress regardless. In the latter, because the error is permanent, the best we can do is leak the minimum amount of space, which is what setting this flag will do. It is therefore reasonable for this flag to normally be set, but we chose the more conservative approach of not setting it, so that there is no possibility of leaking space in the "partial temporary" failure case.
sebveit commented 1 year ago

After replacing/resilvering the failed device (/dev/ada3) the status of the pool went to normal:

# zpool status -v zdata
  pool: zdata
 state: ONLINE
  scan: resilvered 2.18T in 08:32:56 with 0 errors on Fri Aug 11 22:25:49 2023
config:

    NAME        STATE     READ WRITE CKSUM
    zdata       ONLINE       0     0     0
      raidz2-0  ONLINE       0     0     0
        ada0    ONLINE       0     0     0
        ada1    ONLINE       0     0     0
        ada2    ONLINE       0     0     0
        ada3    ONLINE       0     0     0
        ada4    ONLINE       0     0     0
        ada5    ONLINE       0     0     0
        ada6    ONLINE       0     0     0
        ada7    ONLINE       0     0     0

errors: No known data errors
sebveit commented 1 year ago

Output of zdb after replacing the physical device:

zdb -b zdata

Traversing all blocks to verify nothing leaked ...

loading concrete vdev 0, metaslab 115 of 116 ...
17.8T completed (38924MB/s) estimated time remaining: 0hr 00min 01sec
leaked space: vdev 0, offset 0x40675cf0000, size 61440
leaked space: vdev 0, offset 0x1393e2f8e000, size 188743680
leaked space: vdev 0, offset 0x139407ac5000, size 61440
block traversal size 19641772400640 != alloc 19641961267200 (leaked 188866560)

    bp count:             113654585
    ganged count:                 0
    bp logical:      14709565895168      avg: 129423
    bp physical:     13930702580224      avg: 122570     compression:   1.06
    bp allocated:    19641772400640      avg: 172819     compression:   0.75
    bp deduped:                   0    ref>1:      0   deduplication:   1.00
    Normal class:    19641900969984     used: 62.14%
    Embedded log class       60334080     used:  0.02%

    additional, non-pointer bps of type 0:      31587
    Dittoed blocks on same vdev: 116041

This command executes normally without any assertions unlike before.

sebveit commented 1 year ago

Even if I was able to solve the problem, the pool leaked some space. In my case it was only 128 mb, so I can live with that. I'm still thinking that this is some kind of a bug, because ZFS went into a cycle that it couldn't break itself:

  1. The culprit (/dev/ada3) was falsely stating, that it was doing its job just fine.
  2. Since ZFS wasn't aware of the device failure (it was showing as online/operational in zpool status), it hopelessly tried to do its job of correcting false (meta)data. (e.g. the mentioned permanent error by zpool status)
  3. While ZFS was trying to fix that problem, zpool events and /var/log/messages were flooded with the same error regarding the ZVOL, which lead to truncation of previous errors/events. Therefore I was not able to find out what events led to this situation.

The hopeless act of self-healing and the flooding of the logs with the same error put additional load on the pool.

I was forced to manually "open the gates" by triggering sysctl vfs.zfs.free_leak_on_eio=1 as suggested by the previously mentioned man-page of zfs. After doing so and replacing the failed disk, the pool went to normal.

You might say, that I only had to replace the failed disk and then everything would be figured out by ZFS. Maybe, but the quoted man-page also says:

Typically pools either

  1. fail completely (but perhaps temporarily, e.g. due to a top-level vdev going offline), or
  2. have localized, permanent errors (e.g. disk returns the wrong data due to bit flip or firmware bug).

In the former case, this setting does not matter because the pool will be suspended and the sync thread will not be able to make forward progress regardless. In the latter, because the error is permanent, the best we can do is leak the minimum amount of space, which is what setting this flag will do. It is therefore reasonable for this flag to normally be set, but we chose the more conservative approach of not setting it, so that there is no possibility of leaking space in the "partial temporary" failure case.