Blocked I/O with failmode=continue

stuartthebruce commented 6 years ago

System information

Type	Version/Name
Distribution Name	Scientific Linux
Distribution Version	7.5
Linux Kernel	3.10.0-862.14.4.el7
Architecture	x86_64
ZFS Version	zfs-0.7.11-1.el7_5
SPL Version	0.7.11-1.el7_5

Describe the problem you're observing

On a zpool with failmode=continue I/O continues to block resulting in un-killable application processes.

Describe how to reproduce the problem

zpool create data1 single_HDD zpool set failmode=continue data1 Start applications performing I/O on zpool and wait for HDD to fail. Attempt to kill application processes and note they end up in the Z state.

Include any warning/errors/backtraces from the system logs

[root@node2126 ~]# zpool list data1
NAME    SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
data1  3.62T   564K  3.62T         -     0%     0%  1.00x  UNAVAIL  -

[root@node2126 ~]# zpool status data1
  pool: data1
 state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://zfsonlinux.org/msg/ZFS-8000-JQ
  scan: none requested
config:

    NAME                      STATE     READ WRITE CKSUM
    data1                     UNAVAIL      0     0     0  insufficient replicas
      wwn-0x5000c5009cf653f1  FAULTED      6     0     0  too many errors
errors: List of errors unavailable: pool I/O is currently suspended

errors: 7 data errors, use '-v' for a list

After attempting to kill application pid 33345 it is blocked in the Zombie state and holding kernel resources I need to re-use (in particular a socket).

[root@node2126 ~]# top
   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 33345 hdfs      20   0       0      0      0 Z   0.0  0.0   0:06.59 java

[root@node2126 ~]# lsof | awk '$2 == "33345"'
...
java       33345  35895               hdfs  877u     unix 0xffffa06b792d7000       0t0     146793 socket
java       33345  35895               hdfs  878uW     REG               0,41        31          2 /data1/in_use.lock

What I need is for failmode=continue to not block I/O and allow this process to exit so I can start another one to manage a replacement disk in a new pool without having to reboot, i.e., I don't need to be able to destroy the original zpool, though that would be nice as indicated in other open issues.

richardelling commented 6 years ago

I think this is a dup of https://github.com/zfsonlinux/zfs/issues/6649

stuartthebruce commented 6 years ago

I think this is a dup of #6649

That ticket is for failmode=wait whereas this ticket is for failmode=continue.

richardelling commented 6 years ago

yes, but failmode isn't the issue here. The issue is how to remove a suspended pool from the system.

stuartthebruce commented 6 years ago

I was hoping that failmode=continue would obviate the need to wait for the enhancement to allow the removal of a suspended pool. My immediate need is to simply return an error and not block. I can live with an unusable suspended pool in the system until I need to reboot for another reason.

GregorKopka commented 6 years ago

The issue seems to stem from the failmode=continue not aborting existing write requests (as one might expect), but only new ones. From man zpool:

continue Returns EIO to any new write I/O requests but allows reads to any of the remaining healthy devices. Any write requests that have yet to be committed to disk would be blocked.

Also man zpool dosn't specify exactly what happens to reads from unhealthy devices (which, for a suspended pool, can possibly be all of them).

Thus, while the behaviour seen by the OP is kind-of as documented, failmode=continue is IMHO quite useless when effectively behaving identical to failmode=wait (hanging, unkillable I/O, for the in-flight ones when suspension occured) and should be made to cleanly abort all outstanding I/O (writes and reads) that can't complete as the pool went into suspension.

Possibly a timeout for zio in general (to abort them with a clean error condition on long enough time of inactivity) could solve the issue with I/O being stuck in an unkillable state?

dweeezil commented 6 years ago

This is related to the work I'm doing to support the "abandonment" of a pool from which, for example, IO has "hung" because the completions are no longer arriving (due to flaky hardware, bad driver, etc.) and for which I worked up a proof-of-concept at the OpenZFS hackathon this year. This issue is sort-of a different instance of the problem (in which a pool can't be exported).

The work to support abandoning a pool for which IO has hung is going to leverage the similarly-named "continue" mode of the zio deadman. I've got a patch almost ready to post as a PR which fixes some of the problems with zio deadman.

This particular issue will require somewhat different handling but it is something I've planned on addressing as part of the larger "zpool abandon" feature.

stuartthebruce commented 6 years ago

This particular issue will require somewhat different handling but it is something I've planned on addressing as part of the larger "zpool abandon" feature.

@dweeezil most excellent! I have a large number of unreliable HDD in a Hadoop cluster I would be willing to use to test a ZFS patch with when it is available. I most interested in the ability to optionally not block for, "pool I/O is currently suspended", however, I am also interested in testing the ability to abandon and destroy a zpool without having to reboot. Many thanks for working on this.

bgly commented 6 years ago

I would also like to test this feature, the deadman continue helps alot but sometimes I lose connection and wont be able to recover the connection and would like to not have to reboot.

bgly commented 6 years ago

WIP - Fix issues with zio deadman "continue" mode #8021

stuartthebruce commented 5 years ago

@dweeezil do you have a rough estimate on when external testing would be helpful?

stuartthebruce commented 5 years ago

Does 0.8.0 changes this behavior? Or make it any easier to implement a fix?

stale[bot] commented 4 years ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

GregorKopka commented 4 years ago

I'm opposing staleness. This might be old, but it's an issue.

behlendorf commented 4 years ago

@GregorKopka I've tagged this issue as a defect. I've also added the "Status: Understood" tag which will prevent the bot from marking it again.

stale[bot] commented 3 years ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

GregorKopka commented 3 years ago

@behlendorf

Looks to me as if this bot bot has started a rebellion against mankind.

sean- commented 2 years ago

Bumping this issue because this failure mode appears to be reasonably common to trigger with EBS volumes that go on walkabout.

devZer0 commented 2 years ago

hello, i also came across this issue as i have a problem with hung kvm VMs if some unreliable single-disk zfs storage going nuts ,which is only used for unimportant/backup task.

i would expect that when failmode=continue being set, that read and write request return EIO. at least for writes, the manpage explicitly tells that EIO is being returned.

but that simply does not happen, whatever read or write it issued , both getting blocked and result in process in uninterruptible state and those cannot be killed, so this is definitively a bug

openzfs / zfs