ZFS 'just' hangs - Githubissues

System information

Type	Version/Name
Distribution Name	Debian
Distribution Version	9.1
Linux Kernel	4.13.4
Architecture	x86_64
ZFS Version	0.7.0-97_g5f88d2c8a (that is zfs with encryption from tcaputi, with cytrinox's work on debian packaging github.com/cytrinox/zfs )
SPL Version	0.7.0-15_g275146c ditto see above

Describe the problem you're observing

ZFS 'just' hang. As in no operations where possible, no (kernel-) threads done any work, no disk I/O. Issuing the command line tools (zpool / zfs / zdb) stuck as well. The Kernel threads eventually got a 'hung task timeout' warning in the kernel. logs and dmesg show nothing (except the hung task info).

Describe how to reproduce the problem

I had a rather complex situation here, likely not (easily) reproducible. Eventually I rebooted the system. I am now trying to reproduce the problem.

A list and annotations about what was going on:

everything below on encrypted datasets/zvols
resilvering to a striped md devices (2x4TB disks as one 8TB)
git annex was running checksumming a media library
the zfs-auto-snapshot service was running but the zvol's are excluded
created a zvol
qemu-img convert from one old .qcow2 to a zvol The first image conversion skyrocketed the system load, first one was at load 30-40 but eventually completed over night. The next image brought the ZFS down, load gone up to 70+ and no progress/effects as described above.
- I've noticed that there was a lot pointless write load to the l2arc, i tried to offline them around the time the filesystem got stuck, load was already above 70, possibly the filesystem hang already.

Trying to reproduce with no success so far:

the high load condition with 'qemu-img' is reproducible
online/offline the cache devices a few times works.

pool configuration:

pool: data state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Wed Oct 4 14:58:57 2017 6,84T scanned out of 16,3T at 13,9M/s, 198h44m to go 1,37T resilvered, 41,89% done config:

NAME                                               STATE     READ WRITE CKSUM
data                                               DEGRADED     0     0     0
  raidz2-0                                         DEGRADED     0     0     0
    replacing-0                                    DEGRADED     0     0     0
      /root/spare1                                 OFFLINE      0     0     0
      md-uuid-97addc40:1ac606c6:bbc357af:9c81950d  ONLINE       0     0     0  (resilvering)
    /root/spare2                                   OFFLINE      0     0     0
    sdf                                            ONLINE       0     0     0
    sdg                                            ONLINE       0     0     0
    sdh                                            ONLINE       0     0     0
logs
  mirror-1                                         ONLINE       0     0     0
    nvme0n1p5                                      ONLINE       0     0     0
    nvme1n1p5                                      ONLINE       0     0     0
cache
  nvme0n1p6                                        OFFLINE      0     0     0
  nvme1n1p6                                        OFFLINE      0     0     0

Include any warning/errors/backtraces from the system logs

yeah, sorry, no nothing logged

note: Looks to me like some rare race/deadlock problem under high load.

openzfs / zfs

ZFS 'just' hangs #6725

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs