openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.52k stars 1.74k forks source link

CHKSUM-errors on a just attached disk #8688

Open rstrlcpy opened 5 years ago

rstrlcpy commented 5 years ago

System information

Type Version/Name
Distribution Name Ubuntu Server
Distribution Version 18.04.2 LTS
Linux Kernel 4.15.0-48
Architecture x86_64
ZFS Version 0.7.5-1ubuntu16.4
SPL Version 0.7.5-1ubuntu1

Describe the problem you're observing

The issue initially was found on a Nexenta's system, but can be successfully reproduced on linux. We thought the issue is a regression after integration of Sequential-resilvering, but it can be reproduced with disabled seq-resilvering, but not so quickly. So looks like this an old general issue.

There is a code, that handles non-finished resilvering, but it seems it not fully covers it. So that scrub can issue READ zios to just attached disk, that causes the CHKSUM errors.

Describe how to reproduce the problem

Create 2 RAM disks (the size of each disk is 200MB)

modprobe brd rd_nr=2 rd_size=204800

Create a mirrored pool on the disks

zpool create -o cachefile=none rtest mirror /dev/ram0 /dev/ram1

Run the following 3 scripts in parallel.

while true; do
        dd if=/dev/urandom of=/rtest/f bs=1M count=90
done
while true; do
        zpool scrub rtest
done
while true; do
        zpool detach rtest ram1
        zpool attach rtest ram0 /dev/ram1
        sleep 10
done

And monitor the status of the pool

zpool status rtest 1

Include any warning/errors/backtraces from the system logs

  pool: rtest
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: scrub in progress since Mon Apr 29 20:39:50 2019
        87.5M scanned out of 88.0M at 87.5M/s, 0h0m to go
        0B repaired, 99.43% done
config:

        NAME        STATE     READ WRITE CKSUM
        rtest       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            ram0    ONLINE       0     0     0
            ram1    ONLINE       0     0    10

errors: No known data errors
richardelling commented 5 years ago

Hi Roman, I might have another reproducer for a similar issue. Can you add the ereport?

rstrlcpy commented 5 years ago

erepot? What is it?

richardelling commented 5 years ago

FMA ereport. On illumos fmdump -eV or ZoL zpool events -v

rstrlcpy commented 5 years ago
# zpool status rtest 1
  pool: rtest
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: scrub repaired 0B in 0h0m with 0 errors on Tue Apr 30 19:29:46 2019
config:

        NAME        STATE     READ WRITE CKSUM
        rtest       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            ram0    ONLINE       0     0     0
            ram1    ONLINE       0     0    47

errors: No known data errors

There are no errors in the report:

click me

``` TIME CLASS Apr 30 2019 19:29:42.878638170 sysevent.fs.zfs.scrub_start version = 0x0 class = "sysevent.fs.zfs.scrub_start" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 time = 0x5cc8a226 0x345ef45a eid = 0x205 Apr 30 2019 19:29:42.878638170 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 history_hostname = "zol-devel" history_internal_str = "func=1 mintxg=0 maxtxg=2473" history_internal_name = "scan setup" history_txg = 0x9a9 history_time = 0x5cc8a226 time = 0x5cc8a226 0x345ef45a eid = 0x206 Apr 30 2019 19:29:43.038657548 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 history_hostname = "zol-devel" history_internal_str = "errors=0" history_internal_name = "scan done" history_txg = 0x9aa history_time = 0x5cc8a227 time = 0x5cc8a227 0x24dde0c eid = 0x207 Apr 30 2019 19:29:43.038657548 sysevent.fs.zfs.scrub_finish version = 0x0 class = "sysevent.fs.zfs.scrub_finish" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 time = 0x5cc8a227 0x24dde0c eid = 0x208 Apr 30 2019 19:29:43.090663852 sysevent.fs.zfs.scrub_start version = 0x0 class = "sysevent.fs.zfs.scrub_start" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 time = 0x5cc8a227 0x5676bac eid = 0x209 Apr 30 2019 19:29:43.090663852 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 history_hostname = "zol-devel" history_internal_str = "func=1 mintxg=0 maxtxg=2476" history_internal_name = "scan setup" history_txg = 0x9ac history_time = 0x5cc8a227 time = 0x5cc8a227 0x5676bac eid = 0x20a Apr 30 2019 19:29:43.270685676 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 history_hostname = "zol-devel" history_internal_str = "errors=0" history_internal_name = "scan done" history_txg = 0x9ad history_time = 0x5cc8a227 time = 0x5cc8a227 0x102255ec eid = 0x20b Apr 30 2019 19:29:43.270685676 sysevent.fs.zfs.scrub_finish version = 0x0 class = "sysevent.fs.zfs.scrub_finish" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 time = 0x5cc8a227 0x102255ec eid = 0x20c Apr 30 2019 19:29:43.322691981 sysevent.fs.zfs.scrub_start version = 0x0 class = "sysevent.fs.zfs.scrub_start" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 time = 0x5cc8a227 0x133be38d eid = 0x20d Apr 30 2019 19:29:43.322691981 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 history_hostname = "zol-devel" history_internal_str = "func=1 mintxg=0 maxtxg=2479" history_internal_name = "scan setup" history_txg = 0x9af history_time = 0x5cc8a227 time = 0x5cc8a227 0x133be38d eid = 0x20e Apr 30 2019 19:29:43.490712349 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 history_hostname = "zol-devel" history_internal_str = "errors=0" history_internal_name = "scan done" history_txg = 0x9b0 history_time = 0x5cc8a227 time = 0x5cc8a227 0x1d3fad1d eid = 0x20f Apr 30 2019 19:29:43.490712349 sysevent.fs.zfs.scrub_finish version = 0x0 class = "sysevent.fs.zfs.scrub_finish" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 time = 0x5cc8a227 0x1d3fad1d eid = 0x210 Apr 30 2019 19:29:43.554720109 sysevent.fs.zfs.scrub_start version = 0x0 class = "sysevent.fs.zfs.scrub_start" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 time = 0x5cc8a227 0x21105b6d eid = 0x211 Apr 30 2019 19:29:43.562721079 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 history_hostname = "zol-devel" history_internal_str = "func=1 mintxg=0 maxtxg=2486" history_internal_name = "scan setup" history_txg = 0x9b6 history_time = 0x5cc8a227 time = 0x5cc8a227 0x218a7137 eid = 0x212 Apr 30 2019 19:29:43.742742902 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 history_hostname = "zol-devel" history_internal_str = "errors=0" history_internal_name = "scan done" history_txg = 0x9b7 history_time = 0x5cc8a227 time = 0x5cc8a227 0x2c455b76 eid = 0x213 Apr 30 2019 19:29:43.742742902 sysevent.fs.zfs.scrub_finish version = 0x0 class = "sysevent.fs.zfs.scrub_finish" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 time = 0x5cc8a227 0x2c455b76 eid = 0x214 Apr 30 2019 19:29:43.766745812 sysevent.fs.zfs.scrub_start version = 0x0 class = "sysevent.fs.zfs.scrub_start" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 time = 0x5cc8a227 0x2db39cd4 eid = 0x215 Apr 30 2019 19:29:43.766745812 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 history_hostname = "zol-devel" history_internal_str = "func=1 mintxg=0 maxtxg=2489" history_internal_name = "scan setup" history_txg = 0x9b9 history_time = 0x5cc8a227 time = 0x5cc8a227 0x2db39cd4 eid = 0x216 Apr 30 2019 19:29:43.958769091 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 history_hostname = "zol-devel" history_internal_str = "errors=0" history_internal_name = "scan done" history_txg = 0x9ba history_time = 0x5cc8a227 time = 0x5cc8a227 0x3925a7c3 eid = 0x217 Apr 30 2019 19:29:43.958769091 sysevent.fs.zfs.scrub_finish version = 0x0 class = "sysevent.fs.zfs.scrub_finish" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 time = 0x5cc8a227 0x3925a7c3 eid = 0x218 Apr 30 2019 19:29:44.002774424 sysevent.fs.zfs.scrub_start version = 0x0 class = "sysevent.fs.zfs.scrub_start" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 time = 0x5cc8a228 0x2a5598 eid = 0x219 Apr 30 2019 19:29:44.002774424 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 history_hostname = "zol-devel" history_internal_str = "func=1 mintxg=0 maxtxg=2492" history_internal_name = "scan setup" history_txg = 0x9bc history_time = 0x5cc8a228 time = 0x5cc8a228 0x2a5598 eid = 0x21a Apr 30 2019 19:29:44.222801136 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 history_hostname = "zol-devel" history_internal_str = "errors=0" history_internal_name = "scan done" history_txg = 0x9bd history_time = 0x5cc8a228 time = 0x5cc8a228 0xd47acf0 eid = 0x21b Apr 30 2019 19:29:44.222801136 sysevent.fs.zfs.scrub_finish version = 0x0 class = "sysevent.fs.zfs.scrub_finish" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 time = 0x5cc8a228 0xd47acf0 eid = 0x21c Apr 30 2019 19:29:44.250804535 sysevent.fs.zfs.scrub_start version = 0x0 class = "sysevent.fs.zfs.scrub_start" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 time = 0x5cc8a228 0xef2f937 eid = 0x21d Apr 30 2019 19:29:44.250804535 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 history_hostname = "zol-devel" history_internal_str = "func=1 mintxg=0 maxtxg=2495" history_internal_name = "scan setup" history_txg = 0x9bf history_time = 0x5cc8a228 time = 0x5cc8a228 0xef2f937 eid = 0x21e Apr 30 2019 19:29:44.438827360 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 history_hostname = "zol-devel" history_internal_str = "errors=0" history_internal_name = "scan done" history_txg = 0x9c0 history_time = 0x5cc8a228 time = 0x5cc8a228 0x1a27f960 eid = 0x21f Apr 30 2019 19:29:44.438827360 sysevent.fs.zfs.scrub_finish version = 0x0 class = "sysevent.fs.zfs.scrub_finish" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 time = 0x5cc8a228 0x1a27f960 eid = 0x220 Apr 30 2019 19:29:44.474831731 sysevent.fs.zfs.scrub_start version = 0x0 class = "sysevent.fs.zfs.scrub_start" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 time = 0x5cc8a228 0x1c4d5b73 eid = 0x221 Apr 30 2019 19:29:44.474831731 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 history_hostname = "zol-devel" history_internal_str = "func=1 mintxg=0 maxtxg=2498" history_internal_name = "scan setup" history_txg = 0x9c2 history_time = 0x5cc8a228 time = 0x5cc8a228 0x1c4d5b73 eid = 0x222 Apr 30 2019 19:29:44.662854556 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 history_hostname = "zol-devel" history_internal_str = "errors=0" history_internal_name = "scan done" history_txg = 0x9c3 history_time = 0x5cc8a228 time = 0x5cc8a228 0x27825b9c eid = 0x223 Apr 30 2019 19:29:44.662854556 sysevent.fs.zfs.scrub_finish version = 0x0 class = "sysevent.fs.zfs.scrub_finish" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 time = 0x5cc8a228 0x27825b9c eid = 0x224 Apr 30 2019 19:29:44.690857955 sysevent.fs.zfs.scrub_start version = 0x0 class = "sysevent.fs.zfs.scrub_start" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 time = 0x5cc8a228 0x292da7e3 eid = 0x225 Apr 30 2019 19:29:44.690857955 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 history_hostname = "zol-devel" history_internal_str = "func=1 mintxg=0 maxtxg=2501" history_internal_name = "scan setup" history_txg = 0x9c5 history_time = 0x5cc8a228 time = 0x5cc8a228 0x292da7e3 eid = 0x226 Apr 30 2019 19:29:44.874880295 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 history_hostname = "zol-devel" history_internal_str = "errors=0" history_internal_name = "scan done" history_txg = 0x9c6 history_time = 0x5cc8a228 time = 0x5cc8a228 0x34259d27 eid = 0x227 Apr 30 2019 19:29:44.874880295 sysevent.fs.zfs.scrub_finish version = 0x0 class = "sysevent.fs.zfs.scrub_finish" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 time = 0x5cc8a228 0x34259d27 eid = 0x228 Apr 30 2019 19:29:44.930887093 sysevent.fs.zfs.scrub_start version = 0x0 class = "sysevent.fs.zfs.scrub_start" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 time = 0x5cc8a228 0x377c35b5 eid = 0x229 Apr 30 2019 19:29:44.930887093 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 history_hostname = "zol-devel" history_internal_str = "func=1 mintxg=0 maxtxg=2506" history_internal_name = "scan setup" history_txg = 0x9ca history_time = 0x5cc8a228 time = 0x5cc8a228 0x377c35b5 eid = 0x22a Apr 30 2019 19:29:45.122910444 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 history_hostname = "zol-devel" history_internal_str = "errors=0" history_internal_name = "scan done" history_txg = 0x9cb history_time = 0x5cc8a229 time = 0x5cc8a229 0x75376ec eid = 0x22b Apr 30 2019 19:29:45.122910444 sysevent.fs.zfs.scrub_finish version = 0x0 class = "sysevent.fs.zfs.scrub_finish" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 time = 0x5cc8a229 0x75376ec eid = 0x22c Apr 30 2019 19:29:45.170916288 sysevent.fs.zfs.scrub_start version = 0x0 class = "sysevent.fs.zfs.scrub_start" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 time = 0x5cc8a229 0xa2ff9c0 eid = 0x22d Apr 30 2019 19:29:45.170916288 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 history_hostname = "zol-devel" history_internal_str = "func=1 mintxg=0 maxtxg=2509" history_internal_name = "scan setup" history_txg = 0x9cd history_time = 0x5cc8a229 time = 0x5cc8a229 0xa2ff9c0 eid = 0x22e Apr 30 2019 19:29:45.390943072 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 history_hostname = "zol-devel" history_internal_str = "errors=0" history_internal_name = "scan done" history_txg = 0x9ce history_time = 0x5cc8a229 time = 0x5cc8a229 0x174d5160 eid = 0x22f Apr 30 2019 19:29:45.390943072 sysevent.fs.zfs.scrub_finish version = 0x0 class = "sysevent.fs.zfs.scrub_finish" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 time = 0x5cc8a229 0x174d5160 eid = 0x230 Apr 30 2019 19:29:45.470952811 sysevent.fs.zfs.scrub_start version = 0x0 class = "sysevent.fs.zfs.scrub_start" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 time = 0x5cc8a229 0x1c122b6b eid = 0x231 Apr 30 2019 19:29:45.470952811 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 history_hostname = "zol-devel" history_internal_str = "func=1 mintxg=0 maxtxg=2516" history_internal_name = "scan setup" history_txg = 0x9d4 history_time = 0x5cc8a229 time = 0x5cc8a229 0x1c122b6b eid = 0x232 Apr 30 2019 19:29:45.670977160 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 history_hostname = "zol-devel" history_internal_str = "errors=0" history_internal_name = "scan done" history_txg = 0x9d5 history_time = 0x5cc8a229 time = 0x5cc8a229 0x27fe4c88 eid = 0x233 Apr 30 2019 19:29:45.670977160 sysevent.fs.zfs.scrub_finish version = 0x0 class = "sysevent.fs.zfs.scrub_finish" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 time = 0x5cc8a229 0x27fe4c88 eid = 0x234 Apr 30 2019 19:29:45.694980082 sysevent.fs.zfs.scrub_start version = 0x0 class = "sysevent.fs.zfs.scrub_start" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 time = 0x5cc8a229 0x296c8df2 eid = 0x235 Apr 30 2019 19:29:45.694980082 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 history_hostname = "zol-devel" history_internal_str = "func=1 mintxg=0 maxtxg=2519" history_internal_name = "scan setup" history_txg = 0x9d7 history_time = 0x5cc8a229 time = 0x5cc8a229 0x296c8df2 eid = 0x236 Apr 30 2019 19:29:45.883002970 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 history_hostname = "zol-devel" history_internal_str = "errors=0" history_internal_name = "scan done" history_txg = 0x9d8 history_time = 0x5cc8a229 time = 0x5cc8a229 0x34a18e5a eid = 0x237 Apr 30 2019 19:29:45.883002970 sysevent.fs.zfs.scrub_finish version = 0x0 class = "sysevent.fs.zfs.scrub_finish" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 time = 0x5cc8a229 0x34a18e5a eid = 0x238 Apr 30 2019 19:29:45.939009788 sysevent.fs.zfs.scrub_start version = 0x0 class = "sysevent.fs.zfs.scrub_start" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 time = 0x5cc8a229 0x37f826fc eid = 0x239 Apr 30 2019 19:29:45.939009788 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 history_hostname = "zol-devel" history_internal_str = "func=1 mintxg=0 maxtxg=2525" history_internal_name = "scan setup" history_txg = 0x9dd history_time = 0x5cc8a229 time = 0x5cc8a229 0x37f826fc eid = 0x23a Apr 30 2019 19:29:46.123032209 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 history_hostname = "zol-devel" history_internal_str = "errors=0" history_internal_name = "scan done" history_txg = 0x9de history_time = 0x5cc8a22a time = 0x5cc8a22a 0x7555291 eid = 0x23b Apr 30 2019 19:29:46.123032209 sysevent.fs.zfs.scrub_finish version = 0x0 class = "sysevent.fs.zfs.scrub_finish" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 time = 0x5cc8a22a 0x7555291 eid = 0x23c Apr 30 2019 19:29:46.151035622 sysevent.fs.zfs.scrub_start version = 0x0 class = "sysevent.fs.zfs.scrub_start" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 time = 0x5cc8a22a 0x9009ee6 eid = 0x23d Apr 30 2019 19:29:46.151035622 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 history_hostname = "zol-devel" history_internal_str = "func=1 mintxg=0 maxtxg=2528" history_internal_name = "scan setup" history_txg = 0x9e0 history_time = 0x5cc8a22a time = 0x5cc8a22a 0x9009ee6 eid = 0x23e Apr 30 2019 19:29:46.335058054 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 history_hostname = "zol-devel" history_internal_str = "errors=0" history_internal_name = "scan done" history_txg = 0x9e1 history_time = 0x5cc8a22a time = 0x5cc8a22a 0x13f89486 eid = 0x23f Apr 30 2019 19:29:46.335058054 sysevent.fs.zfs.scrub_finish version = 0x0 class = "sysevent.fs.zfs.scrub_finish" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 time = 0x5cc8a22a 0x13f89486 eid = 0x240 Apr 30 2019 19:29:46.371062443 sysevent.fs.zfs.scrub_start version = 0x0 class = "sysevent.fs.zfs.scrub_start" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 time = 0x5cc8a22a 0x161df6ab eid = 0x241 Apr 30 2019 19:29:46.371062443 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 history_hostname = "zol-devel" history_internal_str = "func=1 mintxg=0 maxtxg=2531" history_internal_name = "scan setup" history_txg = 0x9e3 history_time = 0x5cc8a22a time = 0x5cc8a22a 0x161df6ab eid = 0x242 Apr 30 2019 19:29:46.563085850 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 history_hostname = "zol-devel" history_internal_str = "errors=0" history_internal_name = "scan done" history_txg = 0x9e4 history_time = 0x5cc8a22a time = 0x5cc8a22a 0x2190021a eid = 0x243 Apr 30 2019 19:29:46.563085850 sysevent.fs.zfs.scrub_finish version = 0x0 class = "sysevent.fs.zfs.scrub_finish" pool = "rtest" pool_guid = 0x4888d3300934ecba pool_state = 0x0 pool_context = 0x0 time = 0x5cc8a22a 0x2190021a eid = 0x244 ```

richardelling commented 5 years ago

!?! every checksum error should also generate an ereport. Does your build include 2bbec1c91 ?

rstrlcpy commented 5 years ago

It seem the change is part of 0.8.x release, but the test system is 0.7.5. Will try to build a 0.8.x release and repeat the test

rstrlcpy commented 5 years ago

I've built 0.8.0-rc4. The same issue and there are no any related events.

tonyhutter commented 5 years ago

I can reproduce @Ramzec results with 0.8.0-rc4 on an Ubuntu 18 VM. I also didn't see any checksum events:

hutter@ubuntu18:~/zfs$ ./cmd/zpool/zpool status
  pool: rtest
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: scrub in progress since Thu Feb 21 20:18:09 2019
    87.3M scanned at 87.3M/s, 22.2M issued at 22.2M/s, 88.5M total
    0B repaired, 25.02% done, 0 days 00:00:02 to go
config:

    NAME        STATE     READ WRITE CKSUM
    rtest       ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        ram0    ONLINE       0     0   162
        ram1    ONLINE       0     0    54

errors: No known data errors
hutter@ubuntu18:~/zfs$ ./cmd/zpool/zpool events
TIME                           CLASS
cannot get event: permission denied
hutter@ubuntu18:~/zfs$ sudo ./cmd/zpool/zpool events
[sudo] password for hutter: 
TIME                           CLASS
dropped 285 events
Feb 21 2019 20:18:22.900000000 sysevent.fs.zfs.history_event
Feb 21 2019 20:18:22.900000000 sysevent.fs.zfs.scrub_finish
Feb 21 2019 20:18:22.924000000 sysevent.fs.zfs.scrub_start
Feb 21 2019 20:18:22.924000000 sysevent.fs.zfs.history_event
Feb 21 2019 20:18:23.036000000 sysevent.fs.zfs.history_event
Feb 21 2019 20:18:23.036000000 sysevent.fs.zfs.scrub_finish
Feb 21 2019 20:18:23.056000000 sysevent.fs.zfs.scrub_start
Feb 21 2019 20:18:23.056000000 sysevent.fs.zfs.history_event
Feb 21 2019 20:18:23.116000000 sysevent.fs.zfs.history_event
Feb 21 2019 20:18:23.116000000 sysevent.fs.zfs.scrub_finish
Feb 21 2019 20:18:23.140000000 sysevent.fs.zfs.scrub_start
Feb 21 2019 20:18:23.140000000 sysevent.fs.zfs.history_event
Feb 21 2019 20:18:23.244000000 sysevent.fs.zfs.history_event
Feb 21 2019 20:18:23.244000000 sysevent.fs.zfs.scrub_finish
Feb 21 2019 20:18:23.288000000 sysevent.fs.zfs.scrub_start
Feb 21 2019 20:18:23.292000000 sysevent.fs.zfs.history_event
Feb 21 2019 20:18:23.428000000 sysevent.fs.zfs.history_event
Feb 21 2019 20:18:23.428000000 sysevent.fs.zfs.scrub_finish
Feb 21 2019 20:18:23.468000000 sysevent.fs.zfs.scrub_start
Feb 21 2019 20:18:23.468000000 sysevent.fs.zfs.history_event
Feb 21 2019 20:18:23.572000000 sysevent.fs.zfs.history_event
Feb 21 2019 20:18:23.572000000 sysevent.fs.zfs.scrub_finish
Feb 21 2019 20:18:23.600000000 sysevent.fs.zfs.scrub_start
Feb 21 2019 20:18:23.600000000 sysevent.fs.zfs.history_event
Feb 21 2019 20:18:23.720000000 sysevent.fs.zfs.history_event
Feb 21 2019 20:18:23.720000000 sysevent.fs.zfs.scrub_finish
Feb 21 2019 20:18:23.764000000 sysevent.fs.zfs.scrub_start
Feb 21 2019 20:18:23.764000000 sysevent.fs.zfs.history_event
Feb 21 2019 20:18:23.908000000 sysevent.fs.zfs.history_event
...
rstrlcpy commented 5 years ago

Here is what is known by now: scrub might issue ZIO to right now attached disk, that causes the described CHKSUM errors. To solve this I implemented the following fix. With the fix ZFS will not start ZIO-scrub for a vdev, that has missing data

diff --git a/usr/src/uts/common/fs/zfs/vdev_mirror.c b/usr/src/uts/common/fs/zfs/vdev_mirror.c
index 4a869c9ea1..4843d2f7c8 100644
--- a/usr/src/uts/common/fs/zfs/vdev_mirror.c
+++ b/usr/src/uts/common/fs/zfs/vdev_mirror.c
@@ -445,6 +445,14 @@ vdev_mirror_io_start(zio_t *zio)
                                        continue;
                                }

+                               /*
+                                * Possible the VDEV was attached in near past
+                                * and resilvering is not finished yet for it,
+                                * so we need to skip to avoid CHKSUM errors
+                                */
+                               if (vdev_dtl_contains(mc->mc_vd, DTL_MISSING, zio->io_txg, 1))
+                                       continue;
+
                                if (spec_case && c != target)
                                        continue;

Also I found two code-blocks, where ZFS checks vdevs and if some of them is required resilvering. The first one switches code-path from SCRUB to RESILVER if a vdev is required resilvering. The another one doesn't do that switch. The following change fixes the issue:

diff --git a/usr/src/uts/common/fs/zfs/dsl_scan.c b/usr/src/uts/common/fs/zfs/dsl_scan.c
index f39afbb4cb..6950822846 100644
--- a/usr/src/uts/common/fs/zfs/dsl_scan.c
+++ b/usr/src/uts/common/fs/zfs/dsl_scan.c
@@ -470,6 +474,7 @@ dsl_scan_setup_sync(void *arg, dmu_tx_t *tx)
                    &scn->scn_phys.scn_min_txg, &scn->scn_phys.scn_max_txg)) {
                        spa_event_notify(spa, NULL, NULL,
                            ESC_ZFS_RESILVER_START);
+                       scn->scn_phys.scn_func = POOL_SCAN_RESILVER;
                } else {
                        spa_event_notify(spa, NULL, NULL, ESC_ZFS_SCRUB_START);
                }
@@ -506,7 +511,8 @@ dsl_scan_setup_sync(void *arg, dmu_tx_t *tx)

        spa_history_log_internal(spa, "scan setup", tx,
            "func=%u mintxg=%llu maxtxg=%llu",
-           *funcp, scn->scn_phys.scn_min_txg, scn->scn_phys.scn_max_txg);
+           scn->scn_phys.scn_func, scn->scn_phys.scn_min_txg,
+           scn->scn_phys.scn_max_txg);
 }

 /* ARGSUSED */

Unfortunately the changes didn't help. I still saw CHKSUM errors, but on another VDEV. Later I found that all CHKSUM-errors are related to blocks, that were created at the same TXG that is syncing now.

Finally I implemented the following solution:

diff --git a/usr/src/uts/common/fs/zfs/dsl_scan.c b/usr/src/uts/common/fs/zfs/dsl_scan.c
index f39afbb4cb..6950822846 100644
--- a/usr/src/uts/common/fs/zfs/dsl_scan.c
+++ b/usr/src/uts/common/fs/zfs/dsl_scan.c
@@ -428,6 +428,10 @@ dsl_scan_setup_check(void *arg, dmu_tx_t *tx)
        if (dsl_scan_is_running(scn))
                return (SET_ERROR(EBUSY));

+       if (dmu_tx_is_syncing(tx) && scn->scn_restart_txg != 0 &&
+           scn->scn_restart_txg > tx->tx_txg)
+               return (SET_ERROR(EAGAIN));
+
        return (0);
 }

With the change SCRUB started by a user will be rescheduled if a vdev is required resilvering.

With the last change my test VM was running more than 2 days without any CHKSUM errors.

Any idea, thoughts...

shodanshok commented 5 years ago

@tonyhutter pointed out that I may had a similar problem, first described on the mailing list. I am using CentOS 7.6 with kernel 3.10.0-957.12.1.el7.x86_64 and ZFS/SPL version 0.7.13-1

Yesterday I replaced a failed disk, with path pci-0000:02:00.1-ata-2.0. As this chassis in not hotplug, I powered off the machine and replaced the disk. On server restart, having autoreplace=on, zed immediately replaced the disk and a resilver started.

A zpool status issued just after booting shows 1.15K CKSUM error on the replaced disk - pci-0000:02:00.1-ata-2.0, which was marked as DEGRADED. This surprised me: the disk was brand new, with absolutely no zfs labels, so how/why zfs tried to access it?

A subsequent scrub found no error, but CKSUM remained at 1.15K until I issued a "zpool clear tank". It seems as if zfs tried to read from the replaced disk (during pool import?), found it "corrupted" (ie: it contains no valid data) and marked it as degraded.

I failed to replicate the issue in a VM, though; maybe it is timing sensitive (ie: there is a race condition)? Thanks.

mscottie commented 4 years ago

I'm seeing very similar behavior since upgrading to 0.8.2-1. One of my zpools with <1 year old 10TB drives all reported checksum errors right after the upgrade. (No errors ever before 0.8.2-1.) I also was experiencing the restarting resilver issue. Doing some digging, I saw a tip about stopping zed. Once I stopped zed, all my problems went away.

` pool: tank state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Fri Oct 25 21:28:17 2019 72.8G scanned at 9.10G/s, 64.9G issued at 8.11G/s, 29.1T total 322M resilvered, 0.22% done, 0 days 01:01:13 to go config:

NAME                                   STATE     READ WRITE CKSUM
tank                                   DEGRADED     0     0     0
  raidz2-0                             DEGRADED     0     0     0
    ata-ST10000NM0086-2AA101_ZA28MD7E  DEGRADED     0     0   415  too many errors
    ata-ST10000NM0086-2AA101_ZA28MCNG  DEGRADED     0     0    64  too many errors
    ata-ST10000NM0086-2AA101_ZA25NSFA  DEGRADED     0     0   271  too many errors
    ata-ST10000NM0086-2AA101_ZA28MD0H  DEGRADED     0     0   704  too many errors  (resilvering)
    ata-ST10000NM0086-2AA101_ZA26F2MG  DEGRADED     0     0    34  too many errors  (resilvering)
    ata-ST10000NM0086-2AA101_ZA26E1X0  DEGRADED     0     0    52  too many errors  (resilvering)
    ata-ST10000NM0086-2AA101_ZA26E1SS  DEGRADED     0     0    57  too many errors
    ata-ST10000NM0086-2AA101_ZA28M62Z  ONLINE       0     0     4
    ata-ST10000NM0086-2AA101_ZA26DN39  DEGRADED     0     0    59  too many errors  (resilvering)
    ata-ST10000NM0086-2AA101_ZA26E2M3  DEGRADED     0     0    44  too many errors

` CentOS 7.7 kernel 3.10.0-1062.4.1.el7.x86_64.

stale[bot] commented 3 years ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

mscottie commented 3 years ago

anti-stale-bot-post

bang-uin commented 3 years ago

Hi everyone.

I'm on Proxmox VE 6.3 with ZFS 0.8.5 and Kernel 5.4.73 and I think I hit the same or a similar Bug.

First i thought RAM or (brandnew) Disks are bad so i tested each component separately without any issue. So decided to test on a different server to eliminate possible errors related to the storage components but i still got errors. So i did more tests on both Servers with HDDs, SSDs and RAM Disks and all show the same problem.

~ # modprobe brd rd_nr=2 rd_size=10485760
~ # zpool create test2 mirror /dev/ram0 /dev/ram1

~ # zpool status
  pool: test2
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        test2       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            ram0    ONLINE       0     0     0
            ram1    ONLINE       0     0     0

errors: No known data errors

~ # zfs create -V 5G -o volblocksize=8k test2/testvol
~ # dd if=/dev/urandom of=/dev/zvol/test2/testvol bs=1M status=progress

~ # zpool status
  pool: test2
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        test2       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            ram0    ONLINE       0     0     0
            ram1    ONLINE       0     0     0

errors: No known data errors

~ # zpool scrub test2
~ # zpool detach test2 ram1
~ # zpool add test2 ram1
~ # zpool remove test2 ram0
~ # zpool attach test2 ram1 ram0

~ # zpool status
  pool: test2
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: resilvered 5.11G in 0 days 00:00:15 with 0 errors on Fri Nov 27 07:57:58 2020
remove: Removal of vdev 0 copied 5.05G in 0h0m, completed on Fri Nov 27 07:57:32 2020
    10.2K memory used for removed device mappings
config:

        NAME          STATE     READ WRITE CKSUM
        test2         DEGRADED     0     0     0
          mirror-1    DEGRADED     0     0     0
            ram1      ONLINE       0     0     0
            ram0      DEGRADED     0     0    18  too many errors

errors: No known data errors

Here are the zpool events from a third test where it happened on the second try or so (not sure anymore :) ). test2_ereport.txt

stuartthebruce commented 3 years ago

I may have seen the same issue under Rocky Linux 8.4 and ZFS 2.1.0

root@zfsscratch ~]# zpool replace scratch
36000402009e01d5d415ae51700000000 36000402002e022454a69ead900000000
cannot open '/dev/mapper/36000402002e022454a69ead900000000': Device or
resource busy
         mirror-40                              ONLINE       0     0     0
           36000402002e022454a69908c00000000    ONLINE       0     0     0
           spare-1                              ONLINE       0     0     0
             36000402009e01d5d415ae51700000000  ONLINE       2     0     1
             36000402002e022454a69ead900000000  ONLINE       0     0    97
...
       spares
...
         36000402002e022454a69ead900000000      INUSE     currently in use

The resilver finished and I am now running a zpool clear && zpool scrub.

ioguix commented 2 years ago

Hi,

I experienced the same checksum error report after attaching a brand new disk to an existing vpool of a single disk. The system runs Debian 11 with ZFS 2.0.6.

An automatic scrub was running when I zpool attache'd the device. I did not stopped it manually as the zpool-attach(8) manual says:

new_device begins to resilver immediately and any running scrub is cancelled.

The brand new disk is ata-HGST_HUS728T8TALE6L4_VRH4JTEK. Here is the command I ran (from zpool history):

2021-11-14.02:26:53 zpool attach tank ata-TOSHIBA_HDWG480_71Q0A07SFR0H /dev/disk/by-id/ata-HGST_HUS728T8TALE6L4_VRH4JTEK

I checked the status few minutes after attaching the device, it was already reporting checksum errors:

# zpool status tank
  pool: tank
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sun Nov 14 02:26:42 2021
        1.25T scanned at 5.17G/s, 33.9G issued at 140M/s, 2.43T total
        33.9G resilvered, 1.36% done, 04:59:16 to go
config:

        NAME                                   STATE     READ WRITE CKSUM
        tank                                   ONLINE       0     0     0
          mirror-0                             ONLINE       0     0     0
            ata-TOSHIBA_HDWG480_71Q0A07SFR0H   ONLINE       0     0     0
            ata-HGST_HUS728T8TALE6L4_VRH4JTEK  ONLINE       0     0 2.72K  (resilvering)

errors: No known data errors

The resilvering is now finished, the checksum errors did not change:

# zpool status
  pool: tank
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: resilvered 2.44T in 04:17:38 with 0 errors on Sun Nov 14 06:44:20 2021
config:

        NAME                                   STATE     READ WRITE CKSUM
        tank                                   ONLINE       0     0     0
          mirror-0                             ONLINE       0     0     0
            ata-TOSHIBA_HDWG480_71Q0A07SFR0H   ONLINE       0     0     0
            ata-HGST_HUS728T8TALE6L4_VRH4JTEK  ONLINE       0     0 2.72K

errors: No known data errors

The events:

Nov 14 2021 00:24:01.940729556 sysevent.fs.zfs.scrub_start
        version = 0x0
        class = "sysevent.fs.zfs.scrub_start"
        pool = "tank"
        pool_guid = 0xbb90af7e6f5976c2
        pool_state = 0x0
        pool_context = 0x0
        time = 0x61904911 0x381264d4 
        eid = 0xf

Nov 14 2021 00:24:01.940729556 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "tank"
        pool_guid = 0xbb90af7e6f5976c2
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "larco"
        history_internal_str = "func=1 mintxg=0 maxtxg=307454"
        history_internal_name = "scan setup"
        history_txg = 0x4b0fe
        history_time = 0x61904911
        time = 0x61904911 0x381264d4 
        eid = 0x10

Nov 14 2021 02:26:35.458148864 sysevent.fs.zfs.vdev_attach
        version = 0x0
        class = "sysevent.fs.zfs.vdev_attach"
        pool = "tank"
        pool_guid = 0xbb90af7e6f5976c2
        pool_state = 0x0
        pool_context = 0x0
        vdev_guid = 0xa35d7634b2634316
        vdev_state = "ONLINE" (0x7)
        vdev_path = "/dev/disk/by-id/ata-HGST_HUS728T8TALE6L4_VRH4JTEK-part1"
        vdev_devid = "ata-HGST_HUS728T8TALE6L4_VRH4JTEK-part1"
        time = 0x619065cb 0x1b4ecc00 
        eid = 0x11

Nov 14 2021 02:26:42.858314108 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "tank"
        pool_guid = 0xbb90af7e6f5976c2
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "larco"
        history_internal_str = "errors=0"
        history_internal_name = "scan aborted, restarting"
        history_txg = 0x4b644
        history_time = 0x619065d2
        time = 0x619065d2 0x3328d57c
        eid = 0x12

Nov 14 2021 02:26:42.858314108 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "tank"
        pool_guid = 0xbb90af7e6f5976c2
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "larco"
        history_internal_str = "errors=0"
        history_internal_name = "starting deferred resilver"
        history_txg = 0x4b644
        history_time = 0x619065d2
        time = 0x619065d2 0x3328d57c
        eid = 0x13

Nov 14 2021 02:26:42.858314108 sysevent.fs.zfs.resilver_start
        version = 0x0
        class = "sysevent.fs.zfs.resilver_start"
        pool = "tank"
        pool_guid = 0xbb90af7e6f5976c2
        pool_state = 0x0
        pool_context = 0x0
        resilver_type = "healing"
        time = 0x619065d2 0x3328d57c
        history_hostname = "larco"
        history_internal_str = "errors=0"
        history_internal_name = "starting deferred resilver"
        history_txg = 0x4b644
        history_time = 0x619065d2
        time = 0x619065d2 0x3328d57c
        eid = 0x13

Nov 14 2021 02:26:42.858314108 sysevent.fs.zfs.resilver_start
        version = 0x0
        class = "sysevent.fs.zfs.resilver_start"
        pool = "tank"
        pool_guid = 0xbb90af7e6f5976c2
        pool_state = 0x0
        pool_context = 0x0
        resilver_type = "healing"
        time = 0x619065d2 0x3328d57c

Nov 14 2021 02:26:42.858314108 sysevent.fs.zfs.resilver_start
        version = 0x0
        class = "sysevent.fs.zfs.resilver_start"
        pool = "tank"
        pool_guid = 0xbb90af7e6f5976c2
        pool_state = 0x0
        pool_context = 0x0
        resilver_type = "healing"
        time = 0x619065d2 0x3328d57c
        eid = 0x14

Nov 14 2021 02:26:42.858314108 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "tank"
        pool_guid = 0xbb90af7e6f5976c2
        pool_state = 0x0
        eid = 0x14

Nov 14 2021 02:26:42.858314108 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "tank"
        pool_guid = 0xbb90af7e6f5976c2
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "larco"
        history_internal_str = "func=2 mintxg=3 maxtxg=308804"
        history_internal_name = "scan setup"
        history_txg = 0x4b644
        history_time = 0x619065d2
        time = 0x619065d2 0x3328d57c
        eid = 0x15

Nov 14 2021 02:26:46.318391389 sysevent.fs.zfs.config_sync
        version = 0x0
        class = "sysevent.fs.zfs.config_sync"
        pool = "tank"
        pool_guid = 0xbb90af7e6f5976c2
        pool_state = 0x0
        pool_context = 0x0
        time = 0x619065d6 0x12fa445d
        eid = 0x16

Nov 14 2021 02:26:53.542552777 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "tank"
        pool_guid = 0xbb90af7e6f5976c2
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "larco"
        history_internal_str = "attach vdev=/dev/disk/by-id/ata-HGST_HUS728T8TALE6L4_VRH4JTEK-part1 to vdev=/dev/disk/by-id/ata-TOSHIBA_HDWG480_71Q0A07SFR0H-part1"
        history_internal_name = "vdev attach"
        history_txg = 0x4b646
        history_time = 0x619065dd
        time = 0x619065dd 0x2056b2c9
        eid = 0x17

Nov 14 2021 06:44:20.973882171 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "tank"
        pool_guid = 0xbb90af7e6f5976c2
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "larco"
        history_internal_str = "errors=0"
        history_internal_name = "scan done"
        history_txg = 0x4c11f
        history_time = 0x6190a234
        time = 0x6190a234 0x3a0c433b
        eid = 0x18

Nov 14 2021 06:44:20.973882171 sysevent.fs.zfs.resilver_finish
        version = 0x0
        class = "sysevent.fs.zfs.resilver_finish"
        pool = "tank"
        pool_guid = 0xbb90af7e6f5976c2
        pool_state = 0x0
        pool_context = 0x0
        resilver_type = "healing"
        time = 0x6190a234 0x3a0c433b
        eid = 0x19
ioguix commented 2 years ago

For the record, I clear'ed then scrub'ed my pool. No error reported:

# zpool status
  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 04:02:52 with 0 errors on Mon Nov 15 04:13:53 2021
config:

    NAME                                   STATE     READ WRITE CKSUM
    tank                                   ONLINE       0     0     0
      mirror-0                             ONLINE       0     0     0
        ata-TOSHIBA_HDWG480_71Q0A07SFR0H   ONLINE       0     0     0
        ata-HGST_HUS728T8TALE6L4_VRH4JTEK  ONLINE       0     0     0

errors: No known data errors
stale[bot] commented 1 year ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

c89williams commented 1 year ago

I too am experiencing this error. A vpool containing one disk has another disk attached to it (mirror), and both resilvering and scrubbing causes checksum errors on the new disk. I can repeat the scrub numerous times and it always finds new checksum errors every scan.

I've reseated the drive onto the backplane a few times with no change. SMART data shows absolutely nothing wrong with either disk. Previous disks connected to this backplane slot have never caused errors. Running Debian 11, ZFS 2.1.4.

  pool: tank
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 16K in 00:22:01 with 0 errors on Fri Jan  6 00:07:10 2023
config:

        NAME                              STATE     READ WRITE CKSUM
        tank                              DEGRADED     0     0     0
          mirror-0                        DEGRADED     0     0     0
            wwn-0x55cd2e404c47175f-part3  ONLINE       0     0     0
            wwn-0x55cd2e404c471471-part3  DEGRADED     0     0   166  too many errors

errors: No known data errors
shodanshok commented 1 year ago

@c89williams rather than re-seating the disk, try zpool clear tank; zpool scrub tank. If, after the scrub, no error are shown, then your disk is ok.

c89williams commented 1 year ago

@shodanshok I have tried as you stated, clearing the error counters and rescrubbing, however checksum errors continue to occur.

shodanshok commented 1 year ago

@c89williams in this case, I don't think you have the same issue as described here. Have you tried rebooting your machine and re-scrubbing the pool?