Error counters aren't increased when I/O error occurs.

ab-oe commented 8 years ago

Hello, I got some I/O errors on one device used in mirror. I can see these errors in zpool event output:

Dec 28 2015 15:59:02.888314999 ereport.fs.zfs.io
Dec 28 2015 15:59:12.633185000 ereport.fs.zfs.io
Dec 28 2015 15:59:17.395564998 ereport.fs.zfs.io
Dec 28 2015 15:59:45.869794999 ereport.fs.zfs.io
Dec 28 2015 16:00:15.054379997 ereport.fs.zfs.io
Dec 28 2015 16:00:46.670179999 ereport.fs.zfs.io

but the zpool status command shows that there are no errors on devices:

  pool: Pool-1
 state: ONLINE
  scan: none requested
config:

    NAME                                          STATE     READ WRITE CKSUM
    Pool-1                                        ONLINE       0     0     0
      mirror-0                                    ONLINE       0     0     0
        scsi-SSCST_BIO12341234123412344341ecac    ONLINE       0     0     0
        scsi-0QEMU_QEMU_HARDDISK_drive-scsi0-0-3  ONLINE       0     0     0
errors: No known data errors

situation changes after executing zpool scrub command:

  pool: Pool-1
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: scrub repaired 112K in 0h0m with 0 errors on Mon Dec 28 16:14:28 2015
config:

    NAME                                          STATE     READ WRITE CKSUM
    Pool-1                                        ONLINE       0     0     0
      mirror-0                                    ONLINE       0     0     0
        scsi-SSCST_BIO12341234123412344341ecac    ONLINE       8     0   110
        scsi-0QEMU_QEMU_HARDDISK_drive-scsi0-0-3  ONLINE       0     0     0
errors: No known data errors

is there any I/O errors threshold that will put device with errors in faulty or unavailable state?

behlendorf commented 8 years ago

@ab-oe it depends on exactly the type of error and what I/O was being issues. These are checksum errors which mean the drive returned data successfully... it was just the wrong data. Read or write I/O error on the other hand indicate an error from the driver of block layer and will take the device offline. The exact behavior is more complicated than you might initiially expect and not particularly well documented. You can configured the ZFS Event Daemon (ZED) to detect these kind of errors and proactively degrade the device.

ab-oe commented 8 years ago

@behlendorf thank you for your response. Yes that were I/O errors on block device. In this particular case simulated by error blocks in dm device exported over iSCSI to confirm this behaviour.

[ 2934.764321] sd 1:0:0:20: [sdaa] Unhandled sense code
[ 2934.764326] sd 1:0:0:20: [sdaa]  
[ 2934.764327] Result: hostbyte=0x00 driverbyte=0x08
[ 2934.764329] sd 1:0:0:20: [sdaa]  
[ 2934.764330] Sense Key : 0x3 [current] 
[ 2934.764332] sd 1:0:0:20: [sdaa]  
[ 2934.764333] ASC=0x11 ASCQ=0x0

but I though that these errors will increase the vs_write_errors counter.

behlendorf commented 8 years ago

@ab-oe it depends. If the I/O that caused the error was for a prefetch then the counter don't get bumped. It's also possible that the SCSI mid-layer retried the I/O and ZFS was never notified of the failure even though it was reported on the console.

ab-oe commented 8 years ago

@behlendorf it can't be retried by SCSI mid layer with success because it is permanent I/O error. Device mapper table for that device looks as follow:

0 65534 linear 230:2192 0
65534 1 error 
65535 65533 linear 230:2192 65535
131068 1 error 
131069 65533 linear 230:2192 131069
196602 1 error 
196603 65533 linear 230:2192 196603
262136 1 error 
262137 65533 linear 230:2192 262137
327670 1 error 
327671 65533 linear 230:2192 327671
393204 1 error 
...
20905347 65533 linear 230:2192 20905347
20970880 1 error 
20970881 639 linear 230:2192 20970881

or does ZFS retry I/O in different place on SCSI device?

ZFS is notified about I/O errors because I see I/O error events:

Dec 28 2015 15:59:02.888314999 ereport.fs.zfs.io
Dec 28 2015 15:59:12.633185000 ereport.fs.zfs.io
Dec 28 2015 15:59:17.395564998 ereport.fs.zfs.io
Dec 28 2015 15:59:45.869794999 ereport.fs.zfs.io
Dec 28 2015 16:00:15.054379997 ereport.fs.zfs.io
Dec 28 2015 16:00:46.670179999 ereport.fs.zfs.io

so ZFS also see the error but statistic counters are left untouched...

behlendorf commented 8 years ago

That does sound like a possible bug, but the logic here is complicated so it's hard to say for certain without carefully looking at the exact failure mode. If you'd like to have a look yourself check out the zio_done()->vdev_stat_update() call path. The zio_done() function is called when an I/O is completed and this is where the event is posted. It then calls vdev_stat_update() which is responsible for updating the counters.

ab-oe commented 8 years ago

@behlendorf thank you for the cluses. The vdev_stat_update() function returns when the ZIO_FLAG_IO_RETRY is not set. The comment suggests that it should return when the ZIO_FLAG_IO_RETRY is set. Shouldn't it look like:

--- vdev.c.orig
+++ vdev.c
@@ -2828,7 +2828,7 @@ vdev_stat_update(zio_t *zio, uint64_t ps
     * innocuous reasons (bus resets, MPxIO link failure, etc).
     */
    if (zio->io_error == EIO &&
-       !(zio->io_flags & ZIO_FLAG_IO_RETRY))
+       (zio->io_flags & ZIO_FLAG_IO_RETRY))
        return;

    /*

after the patch counters are properly increased for write I/O errors (one error reported per faulty sector) but every ~10s resilvering process starts.

behlendorf commented 8 years ago

@ab-oe I think this is correct and does what the comment describes. The first I/O error is suppressed from the stats pending a retry to avoid alerting the user due to an innocuous reason. The subsequent retry zio will have ZIO_FLAG_IO_RETRY set and be counted. That said, this logic comes unaltered from Illumos where, unlike Linux, the lower layers won't be retrying the I/O. We do try and set FASTFAIL on Linux but the lower stack may not honor it.

ab-oe commented 8 years ago

@behlendorf I understand. I though that ZIO_FLAG_IO_RETRY means that I/O will be retried. Thank you for explanation of this. Currently the problem is that this I/O is never retried for device with bad blocks. This issue was found on dying HDD, there were some I/O errors but pool was online and clean what is not true.

I performed another test to check if this will be ever retried. There are two 128M device mapper devices (named: ed0, ed1) exported over iSCSI. One of them has one bad sector in the middle, the second one is clear.

Device mapper table.

ed0: 0 131072 linear 8:0 0
ed0: 131072 1 error 
ed0: 131073 131071 linear 8:0 131073
ed1: 0 262144 linear 8:16 0

The zpool structure looks as follows

zpool status
  pool: tank
 state: ONLINE
  scan: none requested
config:

    NAME                                        STATE     READ WRITE CKSUM
    tank                                        ONLINE       0     0     0
      mirror-0                                  ONLINE       0     0     0
        scsi-SSCST_BIOc062cda8e06c7b7b12c65f21  ONLINE       0     0     0
        scsi-SSCST_BIOf92cc6518cf9cddc44ecb24b  ONLINE       0     0     0
errors: No known data errors

I fill pool with random data with: dd if=/dev/urandom of=/tank/file bs=1M

In dmesg I/O error is reported only once:

[ 1359.601665] sd 1:0:0:0: [sdg] Device not ready
[ 1359.601675] sd 1:0:0:0: [sdg]  
[ 1359.601681] Result: hostbyte=0x00 driverbyte=0x08
[ 1359.601687] sd 1:0:0:0: [sdg]  
[ 1359.601692] Sense Key : 0x2 [current] 
[ 1359.601699] sd 1:0:0:0: [sdg]  
[ 1359.601704] ASC=0x4 ASCQ=0xa
[ 1359.601710] sd 1:0:0:0: [sdg] CDB: 
[ 1359.601715] cdb[0]=0x2a: 2a 00 00 01 ff b0 00 01 00 00
[ 1359.601724] end_request: I/O error, dev sdg, sector 130992

it is also visible in zpool events:

Jan  5 2016 09:24:27.610000000 ereport.fs.zfs.io
        class = "ereport.fs.zfs.io"
        ena = 0x13c8f6d2b3100001
        detector = (embedded nvlist)
                version = 0x0
                scheme = "zfs"
                pool = 0xf17ca451a8f80f67
                vdev = 0x85b7d435b3144c0b
        (end detector)
        pool = "tank"
        pool_guid = 0xf17ca451a8f80f67
        pool_context = 0x0
        pool_failmode = "wait"
        vdev_guid = 0x85b7d435b3144c0b
        vdev_type = "disk"
        vdev_path = "/dev/disk/by-id/scsi-SSCST_BIOf92cc6518cf9cddc44ecb24b-part1"
        vdev_ashift = 0xc
        vdev_complete_ts = 0x13c8f6d1cfc
        vdev_delta_ts = 0x479eec
        vdev_read_errors = 0x0
        vdev_write_errors = 0x0
        vdev_cksum_errors = 0x0
        parent_guid = 0x1f29cacc1ce2901d
        parent_type = "mirror"
        vdev_spare_paths = 
        vdev_spare_guids = 
        zio_err = 0x5
        zio_flags = 0xc0880
        zio_stage = 0x200000
        zio_pipeline = 0x210000
        zio_delay = 0x1
        zio_timestamp = 0x13c8c29cac0
        zio_delta = 0x26e7afa
        zio_offset = 0x3ef6000
        zio_size = 0x20000
        zio_objset = 0x15
        zio_object = 0x8
        zio_level = 0x0
        zio_blkid = 0x149
        time = 0x568b7dbb 0x245bdc80 
        eid = 0x9

then thecond dd...

again I/O error is reported once, so there is no retry

[ 1382.171967] sd 1:0:0:0: [sdg] Device not ready
[ 1382.171970] sd 1:0:0:0: [sdg]  
[ 1382.171971] Result: hostbyte=0x00 driverbyte=0x08
[ 1382.171972] sd 1:0:0:0: [sdg]  
[ 1382.171973] Sense Key : 0x2 [current] 
[ 1382.171974] sd 1:0:0:0: [sdg]  
[ 1382.171975] ASC=0x4 ASCQ=0xa
[ 1382.171976] sd 1:0:0:0: [sdg] CDB: 
[ 1382.171977] cdb[0]=0x2a: 2a 00 00 01 ff b0 00 01 00 00
[ 1382.171980] end_request: I/O error, dev sdg, sector 130992

in zpool events I got another one I/O error event:

Jan  5 2016 09:24:50.170000000 ereport.fs.zfs.io
        class = "ereport.fs.zfs.io"
        ena = 0x141d04ac29200001
        detector = (embedded nvlist)
                version = 0x0
                scheme = "zfs"
                pool = 0xf17ca451a8f80f67
                vdev = 0x85b7d435b3144c0b
        (end detector)
        pool = "tank"
        pool_guid = 0xf17ca451a8f80f67
        pool_context = 0x0
        pool_failmode = "wait"
        vdev_guid = 0x85b7d435b3144c0b
        vdev_type = "disk"
        vdev_path = "/dev/disk/by-id/scsi-SSCST_BIOf92cc6518cf9cddc44ecb24b-part1"
        vdev_ashift = 0xc
        vdev_complete_ts = 0x141d04ab6d8
        vdev_delta_ts = 0x1e3e46
        vdev_read_errors = 0x0
        vdev_write_errors = 0x0
        vdev_cksum_errors = 0x0
        parent_guid = 0x1f29cacc1ce2901d
        parent_type = "mirror"
        vdev_spare_paths = 
        vdev_spare_guids = 
        zio_err = 0x5
        zio_flags = 0xc0880
        zio_stage = 0x200000
        zio_pipeline = 0x210000
        zio_delay = 0x1
        zio_timestamp = 0x141cdcdaa6b
        zio_delta = 0x21118f5
        zio_offset = 0x3ef6000
        zio_size = 0x20000
        zio_objset = 0x15
        zio_object = 0x8
        zio_level = 0x0
        zio_blkid = 0x1c6
        time = 0x568b7dd2 0xa21fe80 
        eid = 0xa

but still error counter is not increased. For the next test I changed device mapper table to:

ed0: 0 131072 linear 8:0 0
ed0: 131072 131000 error 
ed0: 262072 72 linear 8:0 262072

now almost half of device contains bad sectors.

The test with dd reports a lot of errors in dmesg and zpool events but device remained ONLINE for a long time then it became FAULTY but the error counters are clear. Currently I have no idea how to generate write I/O error which will increase these counters. In my opinion better is to inform the user that there were errors while doing I/O on his device than fully suppress these errors. The second approach is to move this part of code

    if (zio->io_error == EIO &&
       !(zio->io_flags & ZIO_FLAG_IO_RETRY))
        return;

just after incrementing the counters to avoid endless resilvering in the case of permanent bad sector.

By the way the comment in code says that MPxIO link failure will report I/O error. I can't agree with that. Only all paths down failure will generate error. Single path failure will be handled by the multipath driver and won't be passed elsewhere.

dswartz commented 8 years ago

I have to say if it's really working like this, there seems no point in having the error counters in the zpool status...

Arkadiusz Bubała notifications@github.com wrote:

@behlendorf I understand. I though that ZIO_FLAG_IO_RETRY means that I/O will be retried. Thank you for explanation of this. Currently the problem is that this I/O is never retried for device with bad blocks. This issue was found on dying HDD, there were some I/O errors but pool was online and clean what is not true.

I performed another test to check if this will be ever retried. There are two 128M device mapper devices (named: ed0, ed1) exported over iSCSI. One of them has one bad sector in the middle, the second one is clear.

Device mapper table.

ed0: 0 131072 linear 8:0 0 ed0: 131072 1 error ed0: 131073 131071 linear 8:0 131073 ed1: 0 262144 linear 8:16 0

The zpool structure looks as follows

zpool status pool: tank state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 scsi-SSCST_BIOc062cda8e06c7b7b12c65f21 ONLINE 0 0 0 scsi-SSCST_BIOf92cc6518cf9cddc44ecb24b ONLINE 0 0 0 errors: No known data errors

I fill pool with random data with: dd if=/dev/urandom of=/tank/file bs=1M

In dmesg I/O error is reported only once:

[ 1359.601665] sd 1:0:0:0: [sdg] Device not ready [ 1359.601675] sd 1:0:0:0: [sdg] [ 1359.601681] Result: hostbyte=0x00 driverbyte=0x08 [ 1359.601687] sd 1:0:0:0: [sdg] [ 1359.601692] Sense Key : 0x2 [current] [ 1359.601699] sd 1:0:0:0: [sdg] [ 1359.601704] ASC=0x4 ASCQ=0xa [ 1359.601710] sd 1:0:0:0: [sdg] CDB: [ 1359.601715] cdb[0]=0x2a: 2a 00 00 01 ff b0 00 01 00 00 [ 1359.601724] end_request: I/O error, dev sdg, sector 130992

it is also visible in zpool events:

Jan 5 2016 09:24:27.610000000 ereport.fs.zfs.io class = "ereport.fs.zfs.io" ena = 0x13c8f6d2b3100001 detector = (embedded nvlist) version = 0x0 scheme = "zfs" pool = 0xf17ca451a8f80f67 vdev = 0x85b7d435b3144c0b (end detector) pool = "tank" pool_guid = 0xf17ca451a8f80f67 pool_context = 0x0 pool_failmode = "wait" vdev_guid = 0x85b7d435b3144c0b vdev_type = "disk" vdev_path = "/dev/disk/by-id/scsi-SSCST_BIOf92cc6518cf9cddc44ecb24b-part1" vdev_ashift = 0xc vdev_complete_ts = 0x13c8f6d1cfc vdev_delta_ts = 0x479eec vdev_read_errors = 0x0 vdev_write_errors = 0x0 vdev_cksum_errors = 0x0 parent_guid = 0x1f29cacc1ce2901d parent_type = "mirror" vdev_spare_paths = vdev_spare_guids = zio_err = 0x5 zio_flags = 0xc0880 zio_stage = 0x200000 zio_pipeline = 0x210000 zio_delay = 0x1 zio_timestamp = 0x13c8c29cac0 zio_delta = 0x26e7afa zio_offset = 0x3ef6000 zio_size = 0x20000 zio_objset = 0x15 zio_object = 0x8 zio_level = 0x0 zio_blkid = 0x149 time = 0x568b7dbb 0x245bdc80 eid = 0x9

then thecond dd...

again I/O error is reported once, so there is no retry

[ 1382.171967] sd 1:0:0:0: [sdg] Device not ready [ 1382.171970] sd 1:0:0:0: [sdg] [ 1382.171971] Result: hostbyte=0x00 driverbyte=0x08 [ 1382.171972] sd 1:0:0:0: [sdg] [ 1382.171973] Sense Key : 0x2 [current] [ 1382.171974] sd 1:0:0:0: [sdg] [ 1382.171975] ASC=0x4 ASCQ=0xa [ 1382.171976] sd 1:0:0:0: [sdg] CDB: [ 1382.171977] cdb[0]=0x2a: 2a 00 00 01 ff b0 00 01 00 00 [ 1382.171980] end_request: I/O error, dev sdg, sector 130992

in zpool events I got another one I/O error event:

Jan 5 2016 09:24:50.170000000 ereport.fs.zfs.io class = "ereport.fs.zfs.io" ena = 0x141d04ac29200001 detector = (embedded nvlist) version = 0x0 scheme = "zfs" pool = 0xf17ca451a8f80f67 vdev = 0x85b7d435b3144c0b (end detector) pool = "tank" pool_guid = 0xf17ca451a8f80f67 pool_context = 0x0 pool_failmode = "wait" vdev_guid = 0x85b7d435b3144c0b vdev_type = "disk" vdev_path = "/dev/disk/by-id/scsi-SSCST_BIOf92cc6518cf9cddc44ecb24b-part1" vdev_ashift = 0xc vdev_complete_ts = 0x141d04ab6d8 vdev_delta_ts = 0x1e3e46 vdev_read_errors = 0x0 vdev_write_errors = 0x0 vdev_cksum_errors = 0x0 parent_guid = 0x1f29cacc1ce2901d parent_type = "mirror" vdev_spare_paths = vdev_spare_guids = zio_err = 0x5 zio_flags = 0xc0880 zio_stage = 0x200000 zio_pipeline = 0x210000 zio_delay = 0x1 zio_timestamp = 0x141cdcdaa6b zio_delta = 0x21118f5 zio_offset = 0x3ef6000 zio_size = 0x20000 zio_objset = 0x15 zio_object = 0x8 zio_level = 0x0 zio_blkid = 0x1c6 time = 0x568b7dd2 0xa21fe80 eid = 0xa

but still error counter is not increased. For the next test I changed device mapper table to:

ed0: 0 131072 linear 8:0 0 ed0: 131072 131000 error ed0: 262072 72 linear 8:0 262072

now almost half of device contains bad sectors.

The test with dd reports a lot of errors in dmesg and zpool events but device remained ONLINE for a long time then it became FAULTY but the error counters are clear. Currently I have no idea how to generate write I/O error which will increase these counters. In my opinion better is to inform the user that there were errors while doing I/O on his device than fully suppress these errors. The second approach is to move this part of code

if (zio->io_error == EIO && !(zio->io_flags & ZIO_FLAG_IO_RETRY)) return;

just after incrementing the counters to avoid endless resilvering in the case of permanent bad sector.

By the way the comment in code says that MPxIO link failure will report I/O error. I can't agree with that. Only all paths down failure will generate error. Single path failure will be handled by the multipath driver and won't be passed elsewhere.

— Reply to this email directly or view it on GitHub.

JdK28 commented 8 years ago

I've reported a similar issue once, see https://github.com/zfsonlinux/zfs/issues/1814 Also the following issue is around the same subject and contains some useful information: https://github.com/zfsonlinux/zfs/issues/2246

As far as I understand the counters are ONLY increased when errors occur during scrubbing. I think this is the intended behavior, however I do not have a freebsd or illumos system to check the behavior on them. Also, ONLY during a scrub the bad data is repaired. In 'normal' operation zfs will leave the bad data on the disks and use the redundancy and checksums to be able to return the correct data. Please correct me if I am wrong. The above referenced two issues have more information.

In my opinion this is not self-healing and more 'healing when scrubbed'. I also would like the counters to be always updated as this will give indications about drives about to die. I rather not wait until a scrub to tell me my disk is becoming bad.

behlendorf commented 8 years ago

@ab-oe I'd take a look at zio_vdev_io_assess() next. It's responsible for determining if ZFS should retry a given I/O.

You can use zinject to artificially create certain errors. Take a look at the zinject(8) man page.

that MPxIO link failure will report I/O error

This may have been true under Illumos which is where the comment originally came from. None of this logic was modified when it was layered on top of Linuxs block layer so things like this should probably be carefully revisited.

kabbey commented 8 years ago

Hi, I discovered that a server I have is behaving like this: zed sees the errors, zfs status shows none, start scrub then errors are reported by the zfs status. Is this affecting only a specific version of zfs or all? I see this with 0.6.5.6 where the error was not reported until being scrubbed. I found it when I noticed the zed daemon move to the top of the list in top while the io processes fell to the bottom of the list in top, IO had essentially ground to a "halt". There is a major hardware issue somewhere as I've seen other errors occasionally but have not found the cause. Hope to find it soon.

Maybe zfs can report that the server hardware should be checked if the errors did not originate within zfs code (bug) nor the disks themselves. Clearly zfs, zed, "observes" the effects of the server system feeding errors to zfs. Is this the correct description?

After beginning the scrub zed send continuous messages to the syslog.

JdK28 commented 7 years ago

I checked the zio_vdev_io_assess() function, which contains the following:

/*
     * If the I/O failed, determine whether we should attempt to retry it.
     *
     * On retry, we cut in line in the issue queue, since we don't want
     * compression/checksumming/etc. work to prevent our (cheap) IO reissue.
     */
    if (zio->io_error && vd == NULL &&
        !(zio->io_flags & (ZIO_FLAG_DONT_RETRY | ZIO_FLAG_IO_RETRY))) {
        ASSERT(!(zio->io_flags & ZIO_FLAG_DONT_QUEUE)); /* not a leaf */
        ASSERT(!(zio->io_flags & ZIO_FLAG_IO_BYPASS));  /* not a leaf */
        zio->io_error = 0;
        zio->io_flags |= ZIO_FLAG_IO_RETRY |
            ZIO_FLAG_DONT_CACHE | ZIO_FLAG_DONT_AGGREGATE;
        zio->io_stage = ZIO_STAGE_VDEV_IO_START >> 1;
        zio_taskq_dispatch(zio, ZIO_TASKQ_ISSUE,
            zio_requeue_io_start_cut_in_line);
        return (ZIO_PIPELINE_STOP);
    }

Looks all pretty logical to me. The zio gets the IO_RETRY flag, which causes the error to be suppressed in the vdev_stat_update function. Then the zio is dispatched again, but i guess somehow this doesn't happen if I read the reply from @ab-oe . I do wonder why vd == NULL? Is this correct for this case? I've compared it to Illumos and there its the same. So, I guess its correct Could the Linux block layer be so smart that it just ignores a the same request as it probably already did a retry itself the first time?

JdK28 commented 7 years ago

As mentioned earlier by behlendorf the Linux block layer does do retries as opposed to illumos. Would it therefore not be a better idea to disable the retry mechanism in ZFS completely and let Linux handle the retries? Would this effect any possible configuration negatively?

richardelling commented 7 years ago

For the archive: Solaris/illumos sd and scsi_vhci layers do retries and have for 30+ years. Indeed, for any multipathing layer, retries are their primary function.

behlendorf commented 5 years ago

Duplicate of #4851

openzfs / zfs

Error counters aren't increased when I/O error occurs. #4149