openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.68k stars 1.76k forks source link

Syncthing and rm (so far) hang on IO #10522

Open lightrush opened 4 years ago

lightrush commented 4 years ago

System information

Type Version/Name
Distribution Name Ubuntu
Distribution Version 20.04
Linux Kernel 5.4.0
Architecture arm64
ZFS Version 0.8.3-1ubuntu12
SPL Version 0.8.3-1ubuntu12

Describe the problem you're observing

Tried to tranfer 3.4TB of data using Syncthing from another device on the network. Syncthing hung on IO after 150GB, 1.2TB, and 1.3TB on the 3 out of 3 attempts to do the transfer. I did scrub after the first hang. Then I decided to rm -rf the data and retry but rm hung in the same fashion. The disks are brand new WD Elements 8TB, non-SMR. They have been ATA secure erased and show no issues in SMART. The zpool contains a simple 2-way mirror. Load is 6.73, 6.67, 6.21 while ZFS is hung. The hang seems stay indefinitely until the machine is rebooted. I had it hung for over 12 hours on one occasion before rebooting. The system stays alive since root isn't on the zpool. There are no errors in zpool status. It looks very deadlock-y.

Describe how to reproduce the problem

I don't know if you can reproduce it unless you have the same setup and data set, but this is very repeatable here so I can run debug versions and collect logs if that would help narrow the issue down. The system is disposable at the moment so running arbitrary versions of ZFS and reproducing it shouldn't be an issue.

Include any warning/errors/backtraces from the system logs

This is what I see in the log:

[ 1692.658503] INFO: task z_wr_iss:7931 blocked for more than 120 seconds.
[ 1692.665285]       Tainted: P         C OE     5.4.0-1012-raspi #12-Ubuntu
[ 1692.672253] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1692.680286] z_wr_iss        D    0  7931      2 0x00000028
[ 1692.680310] Call trace:
[ 1692.680332]  __switch_to+0x104/0x170
[ 1692.680348]  __schedule+0x30c/0x7c0
[ 1692.680359]  schedule+0x3c/0xb8
[ 1692.680370]  io_schedule+0x20/0x40
[ 1692.680384]  rq_qos_wait+0x100/0x178
[ 1692.680395]  wbt_wait+0xb4/0xf0
[ 1692.680407]  __rq_qos_throttle+0x38/0x50
[ 1692.680419]  blk_mq_make_request+0x128/0x610
[ 1692.680432]  generic_make_request+0xb4/0x2d8
[ 1692.680442]  submit_bio+0x48/0x218
[ 1692.680689]  vdev_disk_io_start+0x670/0x9f0 [zfs]
[ 1692.680920]  zio_vdev_io_start+0xdc/0x2b8 [zfs]
[ 1692.681147]  zio_nowait+0xd4/0x170 [zfs]
[ 1692.681372]  vdev_queue_io+0x23c/0x258 [zfs]
[ 1692.681599]  zio_vdev_io_start+0x94/0x2b8 [zfs]
[ 1692.681824]  zio_nowait+0xd4/0x170 [zfs]
[ 1692.682048]  vdev_mirror_io_start+0xa8/0x1b0 [zfs]
[ 1692.682275]  zio_vdev_io_start+0xdc/0x2b8 [zfs]
[ 1692.682501]  zio_nowait+0xd4/0x170 [zfs]
[ 1692.682784]  vdev_mirror_io_start+0xa8/0x1b0 [zfs]
[ 1692.683023]  zio_vdev_io_start+0x248/0x2b8 [zfs]
[ 1692.683258]  zio_execute+0xac/0x110 [zfs]
[ 1692.683305]  taskq_thread+0x2f8/0x570 [spl]
[ 1692.683323]  kthread+0xfc/0x128
[ 1692.683342]  ret_from_fork+0x10/0x1c
[ 1692.683354] INFO: task z_wr_iss:7932 blocked for more than 120 seconds.
[ 1692.690140]       Tainted: P         C OE     5.4.0-1012-raspi #12-Ubuntu
[ 1692.697083] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1692.705081] z_wr_iss        D    0  7932      2 0x00000028
[ 1692.705096] Call trace:
[ 1692.705113]  __switch_to+0x104/0x170
[ 1692.705134]  __schedule+0x30c/0x7c0
[ 1692.705145]  schedule+0x3c/0xb8
[ 1692.705156]  io_schedule+0x20/0x40
[ 1692.705169]  rq_qos_wait+0x100/0x178
[ 1692.705181]  wbt_wait+0xb4/0xf0
[ 1692.705192]  __rq_qos_throttle+0x38/0x50
[ 1692.705205]  blk_mq_make_request+0x128/0x610
[ 1692.705218]  generic_make_request+0xb4/0x2d8
[ 1692.705228]  submit_bio+0x48/0x218
[ 1692.705474]  vdev_disk_io_start+0x670/0x9f0 [zfs]
[ 1692.705730]  zio_vdev_io_start+0xdc/0x2b8 [zfs]
[ 1692.705972]  zio_nowait+0xd4/0x170 [zfs]
[ 1692.706198]  vdev_mirror_io_start+0xa8/0x1b0 [zfs]
[ 1692.706427]  zio_vdev_io_start+0xdc/0x2b8 [zfs]
[ 1692.706719]  zio_nowait+0xd4/0x170 [zfs]
[ 1692.706951]  vdev_mirror_io_start+0xa8/0x1b0 [zfs]
[ 1692.707184]  zio_vdev_io_start+0x248/0x2b8 [zfs]
[ 1692.707415]  zio_execute+0xac/0x110 [zfs]
[ 1692.707461]  taskq_thread+0x2f8/0x570 [spl]
[ 1692.707480]  kthread+0xfc/0x128
[ 1692.707498]  ret_from_fork+0x10/0x1c
[ 1692.707514] INFO: task z_wr_int:7939 blocked for more than 120 seconds.
[ 1692.714304]       Tainted: P         C OE     5.4.0-1012-raspi #12-Ubuntu
[ 1692.721244] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1692.729248] z_wr_int        D    0  7939      2 0x00000028
[ 1692.729263] Call trace:
[ 1692.729279]  __switch_to+0x104/0x170
[ 1692.729293]  __schedule+0x30c/0x7c0
[ 1692.729303]  schedule+0x3c/0xb8
[ 1692.729314]  io_schedule+0x20/0x40
[ 1692.729326]  rq_qos_wait+0x100/0x178
[ 1692.729338]  wbt_wait+0xb4/0xf0
[ 1692.729349]  __rq_qos_throttle+0x38/0x50
[ 1692.729361]  blk_mq_make_request+0x128/0x610
[ 1692.729373]  generic_make_request+0xb4/0x2d8
[ 1692.729383]  submit_bio+0x48/0x218
[ 1692.729618]  vdev_disk_io_start+0x670/0x9f0 [zfs]
[ 1692.729845]  zio_vdev_io_start+0xdc/0x2b8 [zfs]
[ 1692.730070]  zio_nowait+0xd4/0x170 [zfs]
[ 1692.730294]  vdev_queue_io_done+0x1ec/0x2a0 [zfs]
[ 1692.730520]  zio_vdev_io_done+0xec/0x220 [zfs]
[ 1692.730781]  zio_execute+0xac/0x110 [zfs]
[ 1692.730827]  taskq_thread+0x2f8/0x570 [spl]
[ 1692.730843]  kthread+0xfc/0x128
[ 1692.730854]  ret_from_fork+0x10/0x1c
[ 1692.730879] INFO: task txg_sync:9149 blocked for more than 120 seconds.
[ 1692.737639]       Tainted: P         C OE     5.4.0-1012-raspi #12-Ubuntu
[ 1692.744572] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1692.752572] txg_sync        D    0  9149      2 0x00000028
[ 1692.752586] Call trace:
[ 1692.752600]  __switch_to+0x104/0x170
[ 1692.752613]  __schedule+0x30c/0x7c0
[ 1692.752624]  schedule+0x3c/0xb8
[ 1692.752633]  schedule_timeout+0x9c/0x190
[ 1692.752644]  io_schedule_timeout+0x28/0x48
[ 1692.752684]  __cv_timedwait_common+0x1a8/0x1f8 [spl]
[ 1692.752721]  __cv_timedwait_io+0x3c/0x50 [spl]
[ 1692.752949]  zio_wait+0x130/0x2a0 [zfs]
[ 1692.753175]  dsl_pool_sync+0x3fc/0x498 [zfs]
[ 1692.753399]  spa_sync+0x538/0xe68 [zfs]
[ 1692.753623]  txg_sync_thread+0x2c0/0x468 [zfs]
[ 1692.753662]  thread_generic_wrapper+0x74/0xa0 [spl]
[ 1692.753675]  kthread+0xfc/0x128
[ 1692.753686]  ret_from_fork+0x10/0x1c
[ 1692.753711] INFO: task syncthing:11276 blocked for more than 120 seconds.
[ 1692.760645]       Tainted: P         C OE     5.4.0-1012-raspi #12-Ubuntu
[ 1692.767571] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1692.775551] syncthing       D    0 11276  11270 0x00000800
[ 1692.775567] Call trace:
[ 1692.775581]  __switch_to+0x104/0x170
[ 1692.775593]  __schedule+0x30c/0x7c0
[ 1692.775604]  schedule+0x3c/0xb8
[ 1692.775642]  cv_wait_common+0x188/0x1b0 [spl]
[ 1692.775679]  __cv_wait+0x30/0x40 [spl]
[ 1692.775910]  zil_commit_impl+0x234/0xd30 [zfs]
[ 1692.776142]  zil_commit+0x48/0x70 [zfs]
[ 1692.776371]  zfs_fsync+0x88/0xf0 [zfs]
[ 1692.776599]  zpl_fsync+0x7c/0xb0 [zfs]
[ 1692.776634]  vfs_fsync_range+0x4c/0x88
[ 1692.776644]  do_fsync+0x48/0x88
[ 1692.776655]  __arm64_sys_fsync+0x24/0x38
[ 1692.776670]  el0_svc_common.constprop.0+0xe0/0x1e8
[ 1692.776678]  el0_svc_handler+0x34/0xa0
[ 1692.776688]  el0_svc+0x10/0x14
[ 1692.776777] INFO: task z_wr_iss_h:26603 blocked for more than 120 seconds.
[ 1692.783794]       Tainted: P         C OE     5.4.0-1012-raspi #12-Ubuntu
[ 1692.790706] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1692.798669] z_wr_iss_h      D    0 26603      2 0x00000028
[ 1692.798678] Call trace:
[ 1692.798690]  __switch_to+0x104/0x170
[ 1692.798698]  __schedule+0x30c/0x7c0
[ 1692.798702]  schedule+0x3c/0xb8
[ 1692.798706]  io_schedule+0x20/0x40
[ 1692.798712]  rq_qos_wait+0x100/0x178
[ 1692.798717]  wbt_wait+0xb4/0xf0
[ 1692.798722]  __rq_qos_throttle+0x38/0x50
[ 1692.798727]  blk_mq_make_request+0x128/0x610
[ 1692.798733]  generic_make_request+0xb4/0x2d8
[ 1692.798737]  submit_bio+0x48/0x218
[ 1692.798869]  vdev_disk_io_start+0x670/0x9f0 [zfs]
[ 1692.798961]  zio_vdev_io_start+0xdc/0x2b8 [zfs]
[ 1692.799053]  zio_nowait+0xd4/0x170 [zfs]
[ 1692.799143]  vdev_mirror_io_start+0xa8/0x1b0 [zfs]
[ 1692.799234]  zio_vdev_io_start+0xdc/0x2b8 [zfs]
[ 1692.799324]  zio_nowait+0xd4/0x170 [zfs]
[ 1692.799414]  vdev_mirror_io_start+0xa8/0x1b0 [zfs]
[ 1692.799504]  zio_vdev_io_start+0x248/0x2b8 [zfs]
[ 1692.799594]  zio_execute+0xac/0x110 [zfs]
[ 1692.799613]  taskq_thread+0x2f8/0x570 [spl]
[ 1692.799620]  kthread+0xfc/0x128
[ 1692.799625]  ret_from_fork+0x10/0x1c
[ 1813.491930] INFO: task z_wr_iss:7931 blocked for more than 241 seconds.
[ 1813.498659]       Tainted: P         C OE     5.4.0-1012-raspi #12-Ubuntu
[ 1813.505576] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1813.513539] z_wr_iss        D    0  7931      2 0x00000028
[ 1813.513548] Call trace:
[ 1813.513561]  __switch_to+0x104/0x170
[ 1813.513568]  __schedule+0x30c/0x7c0
[ 1813.513573]  schedule+0x3c/0xb8
[ 1813.513577]  io_schedule+0x20/0x40
[ 1813.513583]  rq_qos_wait+0x100/0x178
[ 1813.513588]  wbt_wait+0xb4/0xf0
[ 1813.513593]  __rq_qos_throttle+0x38/0x50
[ 1813.513598]  blk_mq_make_request+0x128/0x610
[ 1813.513603]  generic_make_request+0xb4/0x2d8
[ 1813.513608]  submit_bio+0x48/0x218
[ 1813.513752]  vdev_disk_io_start+0x670/0x9f0 [zfs]
[ 1813.513844]  zio_vdev_io_start+0xdc/0x2b8 [zfs]
[ 1813.513935]  zio_nowait+0xd4/0x170 [zfs]
[ 1813.514025]  vdev_queue_io+0x23c/0x258 [zfs]
[ 1813.514116]  zio_vdev_io_start+0x94/0x2b8 [zfs]
[ 1813.514206]  zio_nowait+0xd4/0x170 [zfs]
[ 1813.514295]  vdev_mirror_io_start+0xa8/0x1b0 [zfs]
[ 1813.514386]  zio_vdev_io_start+0xdc/0x2b8 [zfs]
[ 1813.514476]  zio_nowait+0xd4/0x170 [zfs]
[ 1813.514566]  vdev_mirror_io_start+0xa8/0x1b0 [zfs]
[ 1813.514656]  zio_vdev_io_start+0x248/0x2b8 [zfs]
[ 1813.514746]  zio_execute+0xac/0x110 [zfs]
[ 1813.514764]  taskq_thread+0x2f8/0x570 [spl]
[ 1813.514770]  kthread+0xfc/0x128
[ 1813.514775]  ret_from_fork+0x10/0x1c
[ 1813.514780] INFO: task z_wr_iss:7932 blocked for more than 241 seconds.
[ 1813.521515]       Tainted: P         C OE     5.4.0-1012-raspi #12-Ubuntu
[ 1813.528416] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1813.536378] z_wr_iss        D    0  7932      2 0x00000028
[ 1813.536385] Call trace:
[ 1813.536395]  __switch_to+0x104/0x170
[ 1813.536401]  __schedule+0x30c/0x7c0
[ 1813.536406]  schedule+0x3c/0xb8
[ 1813.536410]  io_schedule+0x20/0x40
[ 1813.536416]  rq_qos_wait+0x100/0x178
[ 1813.536421]  wbt_wait+0xb4/0xf0
[ 1813.536426]  __rq_qos_throttle+0x38/0x50
[ 1813.536432]  blk_mq_make_request+0x128/0x610
[ 1813.536437]  generic_make_request+0xb4/0x2d8
[ 1813.536444]  submit_bio+0x48/0x218
[ 1813.536570]  vdev_disk_io_start+0x670/0x9f0 [zfs]
[ 1813.536665]  zio_vdev_io_start+0xdc/0x2b8 [zfs]
[ 1813.536757]  zio_nowait+0xd4/0x170 [zfs]
[ 1813.536849]  vdev_mirror_io_start+0xa8/0x1b0 [zfs]
[ 1813.536942]  zio_vdev_io_start+0xdc/0x2b8 [zfs]
[ 1813.537035]  zio_nowait+0xd4/0x170 [zfs]
[ 1813.537127]  vdev_mirror_io_start+0xa8/0x1b0 [zfs]
[ 1813.537223]  zio_vdev_io_start+0x248/0x2b8 [zfs]
[ 1813.537318]  zio_execute+0xac/0x110 [zfs]
[ 1813.537337]  taskq_thread+0x2f8/0x570 [spl]
[ 1813.537343]  kthread+0xfc/0x128
[ 1813.537349]  ret_from_fork+0x10/0x1c
[ 1813.537356] INFO: task z_wr_int:7939 blocked for more than 241 seconds.
[ 1813.544093]       Tainted: P         C OE     5.4.0-1012-raspi #12-Ubuntu
[ 1813.550993] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1813.558958] z_wr_int        D    0  7939      2 0x00000028
[ 1813.558965] Call trace:
[ 1813.558973]  __switch_to+0x104/0x170
[ 1813.558980]  __schedule+0x30c/0x7c0
[ 1813.558984]  schedule+0x3c/0xb8
[ 1813.558989]  io_schedule+0x20/0x40
[ 1813.558996]  rq_qos_wait+0x100/0x178
[ 1813.559000]  wbt_wait+0xb4/0xf0
[ 1813.559005]  __rq_qos_throttle+0x38/0x50
[ 1813.559010]  blk_mq_make_request+0x128/0x610
[ 1813.559016]  generic_make_request+0xb4/0x2d8
[ 1813.559020]  submit_bio+0x48/0x218
[ 1813.559141]  vdev_disk_io_start+0x670/0x9f0 [zfs]
[ 1813.559233]  zio_vdev_io_start+0xdc/0x2b8 [zfs]
[ 1813.559323]  zio_nowait+0xd4/0x170 [zfs]
[ 1813.559414]  vdev_queue_io_done+0x1ec/0x2a0 [zfs]
[ 1813.559504]  zio_vdev_io_done+0xec/0x220 [zfs]
[ 1813.559594]  zio_execute+0xac/0x110 [zfs]
[ 1813.559611]  taskq_thread+0x2f8/0x570 [spl]
[ 1813.559618]  kthread+0xfc/0x128
[ 1813.559623]  ret_from_fork+0x10/0x1c
[ 1813.559649] INFO: task syncthing:11276 blocked for more than 241 seconds.
[ 1813.566561]       Tainted: P         C OE     5.4.0-1012-raspi #12-Ubuntu
[ 1813.573465] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1813.581425] syncthing       D    0 11276  11270 0x00000800
[ 1813.581431] Call trace:
[ 1813.581439]  __switch_to+0x104/0x170
[ 1813.581445]  __schedule+0x30c/0x7c0
[ 1813.581450]  schedule+0x3c/0xb8
[ 1813.581469]  cv_wait_common+0x188/0x1b0 [spl]
[ 1813.581483]  __cv_wait+0x30/0x40 [spl]
[ 1813.581617]  zil_commit_impl+0x234/0xd30 [zfs]
[ 1813.581716]  zil_commit+0x48/0x70 [zfs]
[ 1813.581812]  zfs_fsync+0x88/0xf0 [zfs]
[ 1813.581910]  zpl_fsync+0x7c/0xb0 [zfs]
[ 1813.581918]  vfs_fsync_range+0x4c/0x88
[ 1813.581922]  do_fsync+0x48/0x88
[ 1813.581926]  __arm64_sys_fsync+0x24/0x38
[ 1813.581931]  el0_svc_common.constprop.0+0xe0/0x1e8
[ 1813.581935]  el0_svc_handler+0x34/0xa0
[ 1813.581940]  el0_svc+0x10/0x14
cfcs commented 4 years ago

This is on a raspberry pi with spinning disks connected by USB? I've experienced very similar failures that turned out to be related to the RPI not being able to provide enough power to the USB ports - may be worth trying to generate disk i/o with dd or shred or ext4 or whatever to see if it's related to ZFS at all. EDIT: That being said I guess it would be nice if ZFS would fail the vdev instead of hanging though.

lightrush commented 4 years ago

Pi 4, yes. WD Elements are externally powered disks so power isn't an issue. I've also ran ATA secure erase on these disks on this same Pi which walks the whole disks without an issue. I'm running a very similar setup on another Pi 4 with Ext4 on LVMRAID mirror. It's been running 24/7 since last July without a squeak. That's the source of the data set I'm trying to copy.

lightrush commented 4 years ago

Currently testing latest master.

lightrush commented 4 years ago

Latest master managed to complete the Syncthing data transfer. Good.

Then I did rm -rf the whole dir and that reproduced the hang. Bad.

[160106.267704] INFO: task z_wr_int:3549 blocked for more than 120 seconds.
[160106.274547]       Tainted: P         C OE     5.4.0-1012-raspi #12-Ubuntu
[160106.281584] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[160106.289674] z_wr_int        D    0  3549      2 0x00000028
[160106.289690] Call trace:
[160106.289718]  __switch_to+0x104/0x170
[160106.289733]  __schedule+0x30c/0x7c0
[160106.289744]  schedule+0x3c/0xb8
[160106.289755]  io_schedule+0x20/0x40
[160106.289768]  rq_qos_wait+0x100/0x178
[160106.289780]  wbt_wait+0xb4/0xf0
[160106.289791]  __rq_qos_throttle+0x38/0x50
[160106.289806]  blk_mq_make_request+0x128/0x610
[160106.289821]  generic_make_request+0xb4/0x2d8
[160106.289831]  submit_bio+0x48/0x218
[160106.290106]  vdev_disk_io_start+0x540/0x900 [zfs]
[160106.290363]  zio_vdev_io_start+0xdc/0x2b8 [zfs]
[160106.290644]  zio_nowait+0xd4/0x170 [zfs]
[160106.290908]  vdev_queue_io_done+0x1ec/0x2a0 [zfs]
[160106.291155]  zio_vdev_io_done+0xec/0x220 [zfs]
[160106.291407]  zio_execute+0xac/0x108 [zfs]
[160106.291445]  taskq_thread+0x304/0x580 [spl]
[160106.291463]  kthread+0xfc/0x128
[160106.291475]  ret_from_fork+0x10/0x1c
[160106.291488] INFO: task z_wr_int:3552 blocked for more than 120 seconds.
[160106.298342]       Tainted: P         C OE     5.4.0-1012-raspi #12-Ubuntu
[160106.305351] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[160106.313419] z_wr_int        D    0  3552      2 0x00000028
[160106.313432] Call trace:
[160106.313447]  __switch_to+0x104/0x170
[160106.313461]  __schedule+0x30c/0x7c0
[160106.313471]  schedule+0x3c/0xb8
[160106.313482]  io_schedule+0x20/0x40
[160106.313495]  rq_qos_wait+0x100/0x178
[160106.313506]  wbt_wait+0xb4/0xf0
[160106.313517]  __rq_qos_throttle+0x38/0x50
[160106.313529]  blk_mq_make_request+0x128/0x610
[160106.313541]  generic_make_request+0xb4/0x2d8
[160106.313552]  submit_bio+0x48/0x218
[160106.313801]  vdev_disk_io_start+0x540/0x900 [zfs]
[160106.314040]  zio_vdev_io_start+0xdc/0x2b8 [zfs]
[160106.314279]  zio_nowait+0xd4/0x170 [zfs]
[160106.314518]  vdev_queue_io_done+0x1ec/0x2a0 [zfs]
[160106.314757]  zio_vdev_io_done+0xec/0x220 [zfs]
[160106.314995]  zio_execute+0xac/0x108 [zfs]
[160106.315031]  taskq_thread+0x304/0x580 [spl]
[160106.315044]  kthread+0xfc/0x128
[160106.315055]  ret_from_fork+0x10/0x1c
[160106.315077] INFO: task txg_sync:3781 blocked for more than 120 seconds.
[160106.321914]       Tainted: P         C OE     5.4.0-1012-raspi #12-Ubuntu
[160106.328919] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[160106.336988] txg_sync        D    0  3781      2 0x00000028
[160106.337004] Call trace:
[160106.337016]  __switch_to+0x104/0x170
[160106.337029]  __schedule+0x30c/0x7c0
[160106.337039]  schedule+0x3c/0xb8
[160106.337048]  schedule_timeout+0x9c/0x190
[160106.337059]  io_schedule_timeout+0x28/0x48
[160106.337094]  __cv_timedwait_common+0x1ac/0x1f8 [spl]
[160106.337127]  __cv_timedwait_io+0x3c/0x50 [spl]
[160106.337372]  zio_wait+0x138/0x2b8 [zfs]
[160106.337610]  dsl_pool_sync+0x3fc/0x498 [zfs]
[160106.337847]  spa_sync+0x530/0xeb8 [zfs]
[160106.338085]  txg_sync_thread+0x2d8/0x460 [zfs]
[160106.338120]  thread_generic_wrapper+0x74/0xa0 [spl]
[160106.338132]  kthread+0xfc/0x128
[160106.338143]  ret_from_fork+0x10/0x1c
[160106.338167] INFO: task rm:1342914 blocked for more than 120 seconds.
[160106.344739]       Tainted: P         C OE     5.4.0-1012-raspi #12-Ubuntu
[160106.351745] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[160106.359811] rm              D    0 1342914 1342913 0x00000000
[160106.359824] Call trace:
[160106.359836]  __switch_to+0x104/0x170
[160106.359848]  __schedule+0x30c/0x7c0
[160106.359858]  schedule+0x3c/0xb8
[160106.359869]  io_schedule+0x20/0x40
[160106.359903]  cv_wait_common+0x104/0x1b0 [spl]
[160106.359936]  __cv_wait_io+0x30/0x40 [spl]
[160106.360179]  txg_wait_open+0xa0/0x118 [zfs]
[160106.360416]  dmu_free_long_range+0x414/0x4f0 [zfs]
[160106.360656]  zfs_rmnode+0x2e8/0x3e0 [zfs]
[160106.360896]  zfs_zinactive+0x12c/0x148 [zfs]
[160106.361135]  zfs_inactive+0x78/0x1f8 [zfs]
[160106.361374]  zpl_evict_inode+0x4c/0x70 [zfs]
[160106.361386]  evict+0xcc/0x1c8
[160106.361396]  iput+0x158/0x250
[160106.361407]  do_unlinkat+0x1bc/0x2a0
[160106.361418]  __arm64_sys_unlinkat+0x44/0x70
[160106.361427]  el0_svc_common.constprop.0+0xe0/0x1e8
[160106.361435]  el0_svc_handler+0x34/0xa0
[160106.361445]  el0_svc+0x10/0x14
lightrush commented 4 years ago

Tested on another Pi 4 board to eliminate a potential hardware problem like bad memory, reproduced.

lightrush commented 4 years ago

I'm retrying the same transfer on the same hardware with the same software but using Ext4 on LUKS on LVM RAID1 instead of ZFS.

lightrush commented 4 years ago

Both data replication via Syncthing and rm-ing it completed twice on the same hardware using Ext4 on LUKS on LVM on mdraid mirror.

agiz10 commented 4 years ago

System information

Type Version/Name
Distribution Name Ubuntu
Distribution Version 20.04.1
Linux Kernel Linux 5.4.0-1019-raspi aarch64
Architecture arm64
ZFS Version 0.8.3-1ubuntu12.4
SPL Version 0.8.3-1ubuntu12.4
Hardware RaspberryPi 4b 4GB
System disk USB SSD - ext4 - connected to USB2
Storage Disks 2x 14TB USB WD MyBook in ZFS mirror- connected to 2xUSB3

Describe the problem you're observing

It seems i have identical errors as op. Random IO freezes that only resolve by rebooting the system. I first tried with single drive, and results were the same. I observed the freezes only when doing disk-intensive operations.

This freeze is a a bit different then the rest because i was running scrub and before scrub finished i only later noticed txg_sync were present in the logs (logs are below). But scrub still shows as finished successfully. And the pool was responding as well. Then i did a rm -R * and a lot of the files got deleted untill the command froze (dreaded class=deadman in logs).

I can provide whatever logs you need or test something to help identify the cause of this problem and squish this bug. The problem is reproducable. I do not know where to go from here.

Describe how to reproduce the problem

I get freezes in three scenarios that i observed so far:

  1. when i try to delete contents of a zpool through rm -rf
  2. when i rsync to the pool from another computer over the network
  3. when i delete large ammounts of data through smb from another computer

Include any warning/errors/backtraces from the system logs

From the freeze today:

zpool status
pool: uranpool
state: ONLINE
scan: resilvered 8.99T in 1 days 07:54:43 with 0 errors on Fri Oct 16 06:31:13 2020
remove: Removal of vdev 1 copied 32K in 0h0m, completed on Wed Oct 14 22:33:56 2020
    96 memory used for removed device mappings
config:
        NAME                                          STATE     READ WRITE CKSUM
        uranpool                                      ONLINE       0     0     0
          mirror-0                                    ONLINE       0     0     0
            usb-WD_My_Book_25EE_1111111111111154-0:0  ONLINE       0     0     0
            usb-WD_My_Book_25EE_2222222222222243-0:0  ONLINE       0     0     0
arcstat
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
19:47:46     0     0      0     0    0     0    0     0    0   1.6G  1.7G
free -m
              total        used        free      shared  buff/cache   available
Mem:           3792        2477         895           4         419        1255
Swap:          6143           4        6139
zpool iostat -v
                                                capacity     operations     bandwidth
pool                                          alloc   free   read  write   read  write
--------------------------------------------  -----  -----  -----  -----  -----  -----
uranpool                                      9.03T  3.69T    149    136  57.5M  58.6M
  mirror                                      9.03T  3.69T    150    137  57.9M  59.1M
    usb-WD_My_Book_25EE_1111111111111154-0:0      -      -    147     14  57.5M   598K
    usb-WD_My_Book_25EE_2222222222222243-0:0      -      -      1    123  9.20K  58.5M
--------------------------------------------  -----  -----  -----  -----  -----  -----
iotop
Other logs are from current freeze but this one is from previous hang where it hung when i tried to delete via smb. Current hang shows no io-intensive processes
Total DISK READ:         0.00 B/s | Total DISK WRITE:        49.51 K/s
Current DISK READ:       0.00 B/s | Current DISK WRITE:      17.68 K/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
 463673 ?sys user          0.00 B      0.00 B  0.00 % 99.99 % smbd --foreground --no-process-group
   2093 ?sys root          0.00 B      0.00 B  0.00 % 99.99 % [txg_sync]
journalctl -b
Oct 15 10:20:09 uran kernel: INFO: task txg_sync:2094 blocked for more than 120 seconds.
Oct 15 10:20:09 uran kernel:       Tainted: P         C OE     5.4.0-1019-raspi #21-Ubuntu
Oct 15 10:20:09 uran kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 15 10:20:09 uran kernel: txg_sync        D    0  2094      2 0x00000028
Oct 15 10:20:09 uran kernel: Call trace:
Oct 15 10:20:09 uran kernel:  __switch_to+0x104/0x170
Oct 15 10:20:09 uran kernel:  __schedule+0x314/0x810
Oct 15 10:20:09 uran kernel:  schedule+0x48/0xe8
Oct 15 10:20:09 uran kernel:  cv_wait_common+0x12c/0x170 [spl]
Oct 15 10:20:09 uran kernel:  __cv_wait+0x30/0x40 [spl]
Oct 15 10:20:09 uran kernel:  arc_read+0x160/0xdc8 [zfs]
Oct 15 10:20:09 uran kernel:  dsl_scan_visitbp+0x37c/0x900 [zfs]
Oct 15 10:20:09 uran kernel:  dsl_scan_visitbp+0x218/0x900 [zfs]
Oct 15 10:20:09 uran kernel:  dsl_scan_visitbp+0x218/0x900 [zfs]
Oct 15 10:20:09 uran kernel:  dsl_scan_visitbp+0x218/0x900 [zfs]
Oct 15 10:20:09 uran kernel:  dsl_scan_visitbp+0x218/0x900 [zfs]
Oct 15 10:20:09 uran kernel:  dsl_scan_visitbp+0x218/0x900 [zfs]
Oct 15 10:20:09 uran kernel:  dsl_scan_visitbp+0x5ec/0x900 [zfs]
Oct 15 10:20:09 uran kernel:  dsl_scan_visit_rootbp+0xcc/0x110 [zfs]
Oct 15 10:20:09 uran kernel:  dsl_scan_visitds+0x140/0x460 [zfs]
Oct 15 10:20:09 uran kernel:  dsl_scan_visit+0xb4/0x290 [zfs]
Oct 15 10:20:09 uran kernel:  dsl_scan_sync+0x3f8/0x7d8 [zfs]
Oct 15 10:20:09 uran kernel:  spa_sync_iterate_to_convergence+0x124/0x1e8 [zfs]
Oct 15 10:20:09 uran kernel:  spa_sync+0x2ec/0x520 [zfs]
Oct 15 10:20:09 uran kernel:  txg_sync_thread+0x244/0x2a0 [zfs]
Oct 15 10:20:09 uran kernel:  thread_generic_wrapper+0x74/0xa0 [spl]
Oct 15 10:20:09 uran kernel:  kthread+0x104/0x130
Oct 15 10:20:09 uran kernel:  ret_from_fork+0x10/0x1c
Oct 15 10:21:38 uran PackageKit[3376127]: daemon quit
Oct 15 10:21:38 uran systemd[1]: packagekit.service: Succeeded.
Oct 15 10:22:10 uran kernel: INFO: task txg_sync:2094 blocked for more than 241 seconds.
Oct 15 10:22:10 uran kernel:       Tainted: P         C OE     5.4.0-1019-raspi #21-Ubuntu
Oct 15 10:22:10 uran kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 15 10:22:10 uran kernel: txg_sync        D    0  2094      2 0x00000028
Oct 15 10:22:10 uran kernel: Call trace:
Oct 15 10:22:10 uran kernel:  __switch_to+0x104/0x170
Oct 15 10:22:10 uran kernel:  __schedule+0x314/0x810
Oct 15 10:22:10 uran kernel:  schedule+0x48/0xe8
Oct 15 10:22:10 uran kernel:  cv_wait_common+0x12c/0x170 [spl]
Oct 15 10:22:10 uran kernel:  __cv_wait+0x30/0x40 [spl]
Oct 15 10:22:10 uran kernel:  arc_read+0x160/0xdc8 [zfs]
Oct 15 10:22:10 uran kernel:  dsl_scan_visitbp+0x37c/0x900 [zfs]
Oct 15 10:22:10 uran kernel:  dsl_scan_visitbp+0x218/0x900 [zfs]
Oct 15 10:22:10 uran kernel:  dsl_scan_visitbp+0x218/0x900 [zfs]
Oct 15 10:22:10 uran kernel:  dsl_scan_visitbp+0x218/0x900 [zfs]
Oct 15 10:22:10 uran kernel:  dsl_scan_visitbp+0x218/0x900 [zfs]
Oct 15 10:22:10 uran kernel:  dsl_scan_visitbp+0x218/0x900 [zfs]
Oct 15 10:22:10 uran kernel:  dsl_scan_visitbp+0x5ec/0x900 [zfs]
Oct 15 10:22:10 uran kernel:  dsl_scan_visit_rootbp+0xcc/0x110 [zfs]
Oct 15 10:22:10 uran kernel:  dsl_scan_visitds+0x140/0x460 [zfs]
Oct 15 10:22:10 uran kernel:  dsl_scan_visit+0xb4/0x290 [zfs]
Oct 15 10:22:10 uran kernel:  dsl_scan_sync+0x3f8/0x7d8 [zfs]
Oct 15 10:22:10 uran kernel:  spa_sync_iterate_to_convergence+0x124/0x1e8 [zfs]
Oct 15 10:22:10 uran kernel:  spa_sync+0x2ec/0x520 [zfs]
Oct 15 10:22:10 uran kernel:  txg_sync_thread+0x244/0x2a0 [zfs]
Oct 15 10:22:10 uran kernel:  thread_generic_wrapper+0x74/0xa0 [spl]
Oct 15 10:22:10 uran kernel:  kthread+0x104/0x130
Oct 15 10:22:10 uran kernel:  ret_from_fork+0x10/0x1c
Oct 15 10:24:10 uran kernel: INFO: task txg_sync:2094 blocked for more than 362 seconds.
Oct 15 10:24:10 uran kernel:       Tainted: P         C OE     5.4.0-1019-raspi #21-Ubuntu
Oct 15 10:24:10 uran kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 15 10:24:10 uran kernel: txg_sync        D    0  2094      2 0x00000028
Oct 15 10:24:10 uran kernel: Call trace:
Oct 15 10:24:10 uran kernel:  __switch_to+0x104/0x170
Oct 15 10:24:10 uran kernel:  __schedule+0x314/0x810
Oct 15 10:24:10 uran kernel:  schedule+0x48/0xe8
Oct 15 10:24:10 uran kernel:  cv_wait_common+0x12c/0x170 [spl]
Oct 15 10:24:10 uran kernel:  __cv_wait+0x30/0x40 [spl]
Oct 15 10:24:10 uran kernel:  arc_read+0x160/0xdc8 [zfs]
Oct 15 10:24:10 uran kernel:  dsl_scan_visitbp+0x37c/0x900 [zfs]
Oct 15 10:24:10 uran kernel:  dsl_scan_visitbp+0x218/0x900 [zfs]
Oct 15 10:24:10 uran kernel:  dsl_scan_visitbp+0x218/0x900 [zfs]
Oct 15 10:24:10 uran kernel:  dsl_scan_visitbp+0x218/0x900 [zfs]
Oct 15 10:24:10 uran kernel:  dsl_scan_visitbp+0x218/0x900 [zfs]
Oct 15 10:24:10 uran kernel:  dsl_scan_visitbp+0x218/0x900 [zfs]
Oct 15 10:24:10 uran kernel:  dsl_scan_visitbp+0x5ec/0x900 [zfs]
Oct 15 10:24:10 uran kernel:  dsl_scan_visit_rootbp+0xcc/0x110 [zfs]
Oct 15 10:24:10 uran kernel:  dsl_scan_visitds+0x140/0x460 [zfs]
Oct 15 10:24:10 uran kernel:  dsl_scan_visit+0xb4/0x290 [zfs]
Oct 15 10:24:10 uran kernel:  dsl_scan_sync+0x3f8/0x7d8 [zfs]
Oct 15 10:24:10 uran kernel:  spa_sync_iterate_to_convergence+0x124/0x1e8 [zfs]
Oct 15 10:24:10 uran kernel:  spa_sync+0x2ec/0x520 [zfs]
Oct 15 10:24:10 uran kernel:  txg_sync_thread+0x244/0x2a0 [zfs]
Oct 15 10:24:10 uran kernel:  thread_generic_wrapper+0x74/0xa0 [spl]
Oct 15 10:24:10 uran kernel:  kthread+0x104/0x130
Oct 15 10:24:10 uran kernel:  ret_from_fork+0x10/0x1c
Oct 15 10:25:01 uran CRON[3383262]: pam_unix(cron:session): session opened for user root by (uid=0)
Oct 15 10:25:01 uran CRON[3383263]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Oct 15 10:25:01 uran CRON[3383262]: pam_unix(cron:session): session closed for user root
Oct 15 10:26:11 uran kernel: INFO: task txg_sync:2094 blocked for more than 483 seconds.
Oct 15 10:26:11 uran kernel:       Tainted: P         C OE     5.4.0-1019-raspi #21-Ubuntu
Oct 15 10:26:11 uran kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 15 10:26:11 uran kernel: txg_sync        D    0  2094      2 0x00000028
Oct 15 10:26:11 uran kernel: Call trace:
Oct 15 10:26:11 uran kernel:  __switch_to+0x104/0x170
Oct 15 10:26:11 uran kernel:  __schedule+0x314/0x810
Oct 15 10:26:11 uran kernel:  schedule+0x48/0xe8
Oct 15 10:26:11 uran kernel:  cv_wait_common+0x12c/0x170 [spl]
Oct 15 10:26:11 uran kernel:  __cv_wait+0x30/0x40 [spl]
Oct 15 10:26:11 uran kernel:  arc_read+0x160/0xdc8 [zfs]
Oct 15 10:26:11 uran kernel:  dsl_scan_visitbp+0x37c/0x900 [zfs]
Oct 15 10:26:11 uran kernel:  dsl_scan_visitbp+0x218/0x900 [zfs]
Oct 15 10:26:11 uran kernel:  dsl_scan_visitbp+0x218/0x900 [zfs]
Oct 15 10:26:11 uran kernel:  dsl_scan_visitbp+0x218/0x900 [zfs]
Oct 15 10:26:11 uran kernel:  dsl_scan_visitbp+0x218/0x900 [zfs]
Oct 15 10:26:11 uran kernel:  dsl_scan_visitbp+0x218/0x900 [zfs]
Oct 15 10:26:11 uran kernel:  dsl_scan_visitbp+0x5ec/0x900 [zfs]
Oct 15 10:26:11 uran kernel:  dsl_scan_visit_rootbp+0xcc/0x110 [zfs]
Oct 15 10:26:11 uran kernel:  dsl_scan_visitds+0x140/0x460 [zfs]
Oct 15 10:26:11 uran kernel:  dsl_scan_visit+0xb4/0x290 [zfs]
Oct 15 10:26:11 uran kernel:  dsl_scan_sync+0x3f8/0x7d8 [zfs]
Oct 15 10:26:11 uran kernel:  spa_sync_iterate_to_convergence+0x124/0x1e8 [zfs]
Oct 15 10:26:11 uran kernel:  spa_sync+0x2ec/0x520 [zfs]
Oct 15 10:26:11 uran kernel:  txg_sync_thread+0x244/0x2a0 [zfs]
Oct 15 10:26:11 uran kernel:  thread_generic_wrapper+0x74/0xa0 [spl]
Oct 15 10:26:11 uran kernel:  kthread+0x104/0x130
Oct 15 10:26:11 uran kernel:  ret_from_fork+0x10/0x1c
Oct 15 10:28:12 uran kernel: INFO: task txg_sync:2094 blocked for more than 604 seconds.
Oct 15 10:28:12 uran kernel:       Tainted: P         C OE     5.4.0-1019-raspi #21-Ubuntu
Oct 15 10:28:12 uran kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 15 10:28:12 uran kernel: txg_sync        D    0  2094      2 0x00000028
Oct 15 10:28:12 uran kernel: Call trace:
Oct 15 10:28:12 uran kernel:  __switch_to+0x104/0x170
Oct 15 10:28:12 uran kernel:  __schedule+0x314/0x810
Oct 15 10:28:12 uran kernel:  schedule+0x48/0xe8
Oct 15 10:28:12 uran kernel:  cv_wait_common+0x12c/0x170 [spl]
Oct 15 10:28:12 uran kernel:  __cv_wait+0x30/0x40 [spl]
Oct 15 10:28:12 uran kernel:  arc_read+0x160/0xdc8 [zfs]
Oct 15 10:28:12 uran kernel:  dsl_scan_visitbp+0x37c/0x900 [zfs]
Oct 15 10:28:12 uran kernel:  dsl_scan_visitbp+0x218/0x900 [zfs]
Oct 15 10:28:12 uran kernel:  dsl_scan_visitbp+0x218/0x900 [zfs]
Oct 15 10:28:12 uran kernel:  dsl_scan_visitbp+0x218/0x900 [zfs]
Oct 15 10:28:12 uran kernel:  dsl_scan_visitbp+0x218/0x900 [zfs]
Oct 15 10:28:12 uran kernel:  dsl_scan_visitbp+0x218/0x900 [zfs]
Oct 15 10:28:12 uran kernel:  dsl_scan_visitbp+0x5ec/0x900 [zfs]
Oct 15 10:28:12 uran kernel:  dsl_scan_visit_rootbp+0xcc/0x110 [zfs]
Oct 15 10:28:12 uran kernel:  dsl_scan_visitds+0x140/0x460 [zfs]
Oct 15 10:28:12 uran kernel:  dsl_scan_visit+0xb4/0x290 [zfs]
Oct 15 10:28:12 uran kernel:  dsl_scan_sync+0x3f8/0x7d8 [zfs]
Oct 15 10:28:12 uran kernel:  spa_sync_iterate_to_convergence+0x124/0x1e8 [zfs]
Oct 15 10:28:12 uran kernel:  spa_sync+0x2ec/0x520 [zfs]
Oct 15 10:28:12 uran kernel:  txg_sync_thread+0x244/0x2a0 [zfs]
Oct 15 10:28:12 uran kernel:  thread_generic_wrapper+0x74/0xa0 [spl]
Oct 15 10:28:12 uran kernel:  kthread+0x104/0x130
Oct 15 10:28:12 uran kernel:  ret_from_fork+0x10/0x1c
...
Oct 15 10:30:13 uran kernel: INFO: task txg_sync:2094 blocked for more than 724 seconds.
Oct 15 10:30:13 uran kernel:       Tainted: P         C OE     5.4.0-1019-raspi #21-Ubuntu
Oct 15 10:30:13 uran kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 15 10:30:13 uran kernel: txg_sync        D    0  2094      2 0x00000028
Oct 15 10:30:13 uran kernel: Call trace:
Oct 15 10:30:13 uran kernel:  __switch_to+0x104/0x170
Oct 15 10:30:13 uran kernel:  __schedule+0x314/0x810
Oct 15 10:30:13 uran kernel:  schedule+0x48/0xe8
Oct 15 10:30:13 uran kernel:  cv_wait_common+0x12c/0x170 [spl]
Oct 15 10:30:13 uran kernel:  __cv_wait+0x30/0x40 [spl]
Oct 15 10:30:13 uran kernel:  arc_read+0x160/0xdc8 [zfs]
Oct 15 10:30:13 uran kernel:  dsl_scan_visitbp+0x37c/0x900 [zfs]
Oct 15 10:30:13 uran kernel:  dsl_scan_visitbp+0x218/0x900 [zfs]
Oct 15 10:30:13 uran kernel:  dsl_scan_visitbp+0x218/0x900 [zfs]
Oct 15 10:30:13 uran kernel:  dsl_scan_visitbp+0x218/0x900 [zfs]
Oct 15 10:30:13 uran kernel:  dsl_scan_visitbp+0x218/0x900 [zfs]
Oct 15 10:30:13 uran kernel:  dsl_scan_visitbp+0x218/0x900 [zfs]
Oct 15 10:30:13 uran kernel:  dsl_scan_visitbp+0x5ec/0x900 [zfs]
Oct 15 10:30:13 uran kernel:  dsl_scan_visit_rootbp+0xcc/0x110 [zfs]
Oct 15 10:30:13 uran kernel:  dsl_scan_visitds+0x140/0x460 [zfs]
Oct 15 10:30:13 uran kernel:  dsl_scan_visit+0xb4/0x290 [zfs]
Oct 15 10:30:13 uran kernel:  dsl_scan_sync+0x3f8/0x7d8 [zfs]
Oct 15 10:30:13 uran kernel:  spa_sync_iterate_to_convergence+0x124/0x1e8 [zfs]
Oct 15 10:30:13 uran kernel:  spa_sync+0x2ec/0x520 [zfs]
Oct 15 10:30:13 uran kernel:  txg_sync_thread+0x244/0x2a0 [zfs]
Oct 15 10:30:13 uran kernel:  thread_generic_wrapper+0x74/0xa0 [spl]
Oct 15 10:30:13 uran kernel:  kthread+0x104/0x130
Oct 15 10:30:13 uran kernel:  ret_from_fork+0x10/0x1c
Oct 15 10:32:14 uran kernel: INFO: task txg_sync:2094 blocked for more than 845 seconds.
Oct 15 10:32:14 uran kernel:       Tainted: P         C OE     5.4.0-1019-raspi #21-Ubuntu
Oct 15 10:32:14 uran kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 15 10:32:14 uran kernel: txg_sync        D    0  2094      2 0x00000028
Oct 15 10:32:14 uran kernel: Call trace:
Oct 15 10:32:14 uran kernel:  __switch_to+0x104/0x170
Oct 15 10:32:14 uran kernel:  __schedule+0x314/0x810
Oct 15 10:32:14 uran kernel:  schedule+0x48/0xe8
Oct 15 10:32:14 uran kernel:  cv_wait_common+0x12c/0x170 [spl]
Oct 15 10:32:14 uran kernel:  __cv_wait+0x30/0x40 [spl]
Oct 15 10:32:14 uran kernel:  arc_read+0x160/0xdc8 [zfs]
Oct 15 10:32:14 uran kernel:  dsl_scan_visitbp+0x37c/0x900 [zfs]
Oct 15 10:32:14 uran kernel:  dsl_scan_visitbp+0x218/0x900 [zfs]
Oct 15 10:32:14 uran kernel:  dsl_scan_visitbp+0x218/0x900 [zfs]
Oct 15 10:32:14 uran kernel:  dsl_scan_visitbp+0x218/0x900 [zfs]
Oct 15 10:32:14 uran kernel:  dsl_scan_visitbp+0x218/0x900 [zfs]
Oct 15 10:32:14 uran kernel:  dsl_scan_visitbp+0x218/0x900 [zfs]
Oct 15 10:32:14 uran kernel:  dsl_scan_visitbp+0x5ec/0x900 [zfs]
Oct 15 10:32:14 uran kernel:  dsl_scan_visit_rootbp+0xcc/0x110 [zfs]
Oct 15 10:32:14 uran kernel:  dsl_scan_visitds+0x140/0x460 [zfs]
Oct 15 10:32:14 uran kernel:  dsl_scan_visit+0xb4/0x290 [zfs]
Oct 15 10:32:14 uran kernel:  dsl_scan_sync+0x3f8/0x7d8 [zfs]
Oct 15 10:32:14 uran kernel:  spa_sync_iterate_to_convergence+0x124/0x1e8 [zfs]
Oct 15 10:32:14 uran kernel:  spa_sync+0x2ec/0x520 [zfs]
Oct 15 10:32:14 uran kernel:  txg_sync_thread+0x244/0x2a0 [zfs]
Oct 15 10:32:14 uran kernel:  thread_generic_wrapper+0x74/0xa0 [spl]
Oct 15 10:32:14 uran kernel:  kthread+0x104/0x130
Oct 15 10:32:14 uran kernel:  ret_from_fork+0x10/0x1c
Oct 15 10:34:15 uran kernel: INFO: task txg_sync:2094 blocked for more than 966 seconds.
Oct 15 10:34:15 uran kernel:       Tainted: P         C OE     5.4.0-1019-raspi #21-Ubuntu
Oct 15 10:34:15 uran kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 15 10:34:15 uran kernel: txg_sync        D    0  2094      2 0x00000028
Oct 15 10:34:15 uran kernel: Call trace:
Oct 15 10:34:15 uran kernel:  __switch_to+0x104/0x170
Oct 15 10:34:15 uran kernel:  __schedule+0x314/0x810
Oct 15 10:34:15 uran kernel:  schedule+0x48/0xe8
Oct 15 10:34:15 uran kernel:  cv_wait_common+0x12c/0x170 [spl]
Oct 15 10:34:15 uran kernel:  __cv_wait+0x30/0x40 [spl]
Oct 15 10:34:15 uran kernel:  arc_read+0x160/0xdc8 [zfs]
Oct 15 10:34:15 uran kernel:  dsl_scan_visitbp+0x37c/0x900 [zfs]
Oct 15 10:34:15 uran kernel:  dsl_scan_visitbp+0x218/0x900 [zfs]
Oct 15 10:34:15 uran kernel:  dsl_scan_visitbp+0x218/0x900 [zfs]
Oct 15 10:34:15 uran kernel:  dsl_scan_visitbp+0x218/0x900 [zfs]
Oct 15 10:34:15 uran kernel:  dsl_scan_visitbp+0x218/0x900 [zfs]
Oct 15 10:34:15 uran kernel:  dsl_scan_visitbp+0x218/0x900 [zfs]
Oct 15 10:34:15 uran kernel:  dsl_scan_visitbp+0x5ec/0x900 [zfs]
Oct 15 10:34:15 uran kernel:  dsl_scan_visit_rootbp+0xcc/0x110 [zfs]
Oct 15 10:34:15 uran kernel:  dsl_scan_visitds+0x140/0x460 [zfs]
Oct 15 10:34:15 uran kernel:  dsl_scan_visit+0xb4/0x290 [zfs]
Oct 15 10:34:15 uran kernel:  dsl_scan_sync+0x3f8/0x7d8 [zfs]
Oct 15 10:34:15 uran kernel:  spa_sync_iterate_to_convergence+0x124/0x1e8 [zfs]
Oct 15 10:34:15 uran kernel:  spa_sync+0x2ec/0x520 [zfs]
Oct 15 10:34:15 uran kernel:  txg_sync_thread+0x244/0x2a0 [zfs]
Oct 15 10:34:15 uran kernel:  thread_generic_wrapper+0x74/0xa0 [spl]
Oct 15 10:34:15 uran kernel:  kthread+0x104/0x130
Oct 15 10:34:15 uran kernel:  ret_from_fork+0x10/0x1c
Oct 15 10:35:01 uran CRON[3386234]: pam_unix(cron:session): session opened for user root by (uid=0)
Oct 15 10:35:01 uran CRON[3386235]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Oct 15 10:35:01 uran CRON[3386234]: pam_unix(cron:session): session closed for user root
Oct 15 10:36:15 uran kernel: INFO: task txg_sync:2094 blocked for more than 1087 seconds.
Oct 15 10:36:15 uran kernel:       Tainted: P         C OE     5.4.0-1019-raspi #21-Ubuntu
Oct 15 10:36:15 uran kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 15 10:36:15 uran kernel: txg_sync        D    0  2094      2 0x00000028
Oct 15 10:36:15 uran kernel: Call trace:
Oct 15 10:36:15 uran kernel:  __switch_to+0x104/0x170
Oct 15 10:36:15 uran kernel:  __schedule+0x314/0x810
Oct 15 10:36:15 uran kernel:  schedule+0x48/0xe8
Oct 15 10:36:15 uran kernel:  cv_wait_common+0x12c/0x170 [spl]
Oct 15 10:36:15 uran kernel:  __cv_wait+0x30/0x40 [spl]
Oct 15 10:36:15 uran kernel:  arc_read+0x160/0xdc8 [zfs]
Oct 15 10:36:15 uran kernel:  dsl_scan_visitbp+0x37c/0x900 [zfs]
Oct 15 10:36:15 uran kernel:  dsl_scan_visitbp+0x218/0x900 [zfs]
Oct 15 10:36:15 uran kernel:  dsl_scan_visitbp+0x218/0x900 [zfs]
Oct 15 10:36:15 uran kernel:  dsl_scan_visitbp+0x218/0x900 [zfs]
Oct 15 10:36:15 uran kernel:  dsl_scan_visitbp+0x218/0x900 [zfs]
Oct 15 10:36:15 uran kernel:  dsl_scan_visitbp+0x218/0x900 [zfs]
Oct 15 10:36:16 uran kernel:  dsl_scan_visitbp+0x5ec/0x900 [zfs]
Oct 15 10:36:16 uran kernel:  dsl_scan_visit_rootbp+0xcc/0x110 [zfs]
Oct 15 10:36:16 uran kernel:  dsl_scan_visitds+0x140/0x460 [zfs]
Oct 15 10:36:16 uran kernel:  dsl_scan_visit+0xb4/0x290 [zfs]
Oct 15 10:36:16 uran kernel:  dsl_scan_sync+0x3f8/0x7d8 [zfs]
Oct 15 10:36:16 uran kernel:  spa_sync_iterate_to_convergence+0x124/0x1e8 [zfs]
Oct 15 10:36:16 uran kernel:  spa_sync+0x2ec/0x520 [zfs]
Oct 15 10:36:16 uran kernel:  txg_sync_thread+0x244/0x2a0 [zfs]
Oct 15 10:36:16 uran kernel:  thread_generic_wrapper+0x74/0xa0 [spl]
Oct 15 10:36:16 uran kernel:  kthread+0x104/0x130
Oct 15 10:36:16 uran kernel:  ret_from_fork+0x10/0x1c
Oct 15 10:38:16 uran kernel: INFO: task txg_sync:2094 blocked for more than 1208 seconds.
Oct 15 10:38:16 uran kernel:       Tainted: P         C OE     5.4.0-1019-raspi #21-Ubuntu
Oct 15 10:38:16 uran kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 15 10:38:16 uran kernel: txg_sync        D    0  2094      2 0x00000028
Oct 15 10:38:16 uran kernel: Call trace:
Oct 15 10:38:16 uran kernel:  __switch_to+0x104/0x170
Oct 15 10:38:16 uran kernel:  __schedule+0x314/0x810
Oct 15 10:38:16 uran kernel:  schedule+0x48/0xe8
Oct 15 10:38:16 uran kernel:  cv_wait_common+0x12c/0x170 [spl]
Oct 15 10:38:16 uran kernel:  __cv_wait+0x30/0x40 [spl]
Oct 15 10:38:16 uran kernel:  arc_read+0x160/0xdc8 [zfs]
Oct 15 10:38:16 uran kernel:  dsl_scan_visitbp+0x37c/0x900 [zfs]
Oct 15 10:38:16 uran kernel:  dsl_scan_visitbp+0x218/0x900 [zfs]
Oct 15 10:38:16 uran kernel:  dsl_scan_visitbp+0x218/0x900 [zfs]
Oct 15 10:38:16 uran kernel:  dsl_scan_visitbp+0x218/0x900 [zfs]
Oct 15 10:38:16 uran kernel:  dsl_scan_visitbp+0x218/0x900 [zfs]
Oct 15 10:38:16 uran kernel:  dsl_scan_visitbp+0x218/0x900 [zfs]
Oct 15 10:38:16 uran kernel:  dsl_scan_visitbp+0x5ec/0x900 [zfs]
Oct 15 10:38:16 uran kernel:  dsl_scan_visit_rootbp+0xcc/0x110 [zfs]
Oct 15 10:38:16 uran kernel:  dsl_scan_visitds+0x140/0x460 [zfs]
Oct 15 10:38:16 uran kernel:  dsl_scan_visit+0xb4/0x290 [zfs]
Oct 15 10:38:16 uran kernel:  dsl_scan_sync+0x3f8/0x7d8 [zfs]
Oct 15 10:38:16 uran kernel:  spa_sync_iterate_to_convergence+0x124/0x1e8 [zfs]
Oct 15 10:38:16 uran kernel:  spa_sync+0x2ec/0x520 [zfs]
Oct 15 10:38:16 uran kernel:  txg_sync_thread+0x244/0x2a0 [zfs]
Oct 15 10:38:16 uran kernel:  thread_generic_wrapper+0x74/0xa0 [spl]
Oct 15 10:38:16 uran kernel:  kthread+0x104/0x130
Oct 15 10:38:16 uran kernel:  ret_from_fork+0x10/0x1c
...
Oct 16 06:31:14 uran zed[1687034]: eid=57 class=history_event pool_guid=0x5E9821C46223F9A5
Oct 16 06:31:15 uran zed[1687036]: eid=58 class=resilver_finish pool_guid=0x5E9821C46223F9A5
Oct 16 06:31:16 uran zed[1687043]: error: resilver_finish-notify.sh: eid=58: "mail" not installed
...
Oct 16 12:59:08 uran zed[2763536]: eid=59 class=deadman pool_guid=0x5E9821C46223F9A5 vdev_path=/dev/disk/by-id/usb-WD_My_Book_25EE_2222222222222243-0:0-part1
Oct 16 12:59:09 uran zed[2763538]: eid=60 class=deadman pool_guid=0x5E9821C46223F9A5 vdev_path=/dev/disk/by-id/usb-WD_My_Book_25EE_2222222222222243-0:0-part1
cat /proc/spl/kstat/zfs/dmu_tx
11 1 0x01 12 3264 20045022062 164369774375694
name                            type data
dmu_tx_assigned                 4    4978191
dmu_tx_delay                    4    0
dmu_tx_error                    4    0
dmu_tx_suspended                4    0
dmu_tx_group                    4    0
dmu_tx_memory_reserve           4    0
dmu_tx_memory_reclaim           4    0
dmu_tx_dirty_throttle           4    0
dmu_tx_dirty_delay              4    28002
dmu_tx_dirty_over_max           4    0
dmu_tx_dirty_frees_delay        4    7803
dmu_tx_quota                    4    0
cat /proc/spl/kstat/zfs/arcstats
12 1 0x01 98 26656 20053414543 164422259257589
name                            type data
hits                            4    54090208
misses                          4    2775485
demand_data_hits                4    2
demand_data_misses              4    2
demand_metadata_hits            4    51159488
demand_metadata_misses          4    643447
prefetch_data_hits              4    0
prefetch_data_misses            4    0
prefetch_metadata_hits          4    2930718
prefetch_metadata_misses        4    2132036
mru_hits                        4    21274026
mru_ghost_hits                  4    7435
mfu_hits                        4    30053299
mfu_ghost_hits                  4    20430
deleted                         4    2914903
mutex_miss                      4    422
access_skip                     4    307
evict_skip                      4    14622752
evict_not_enough                4    18017
evict_l2_cached                 4    0
evict_l2_eligible               4    73021932544
evict_l2_ineligible             4    198834327552
evict_l2_skip                   4    0
hash_elements                   4    122896
hash_elements_max               4    360413
hash_collisions                 4    1166586
hash_chains                     4    12288
hash_chain_max                  4    8
p                               4    1165417578
c                               4    1781544448
c_min                           4    124267008
c_max                           4    1988272128
size                            4    1724062016
compressed_size                 4    1079760384
uncompressed_size               4    4306244608
overhead_size                   4    207561728
hdr_size                        4    41035192
data_size                       4    743559168
metadata_size                   4    543762944
dbuf_size                       4    81275688
dnode_size                      4    233893344
bonus_size                      4    80535680
anon_size                       4    9024000
anon_evictable_data             4    0
anon_evictable_metadata         4    0
mru_size                        4    1155947520
mru_evictable_data              4    743415808
mru_evictable_metadata          4    278911488
mru_ghost_size                  4    568823808
mru_ghost_evictable_data        4    372506624
mru_ghost_evictable_metadata    4    196317184
mfu_size                        4    122350592
mfu_evictable_data              4    0
mfu_evictable_metadata          4    14849536
mfu_ghost_size                  4    427843584
mfu_ghost_evictable_data        4    0
mfu_ghost_evictable_metadata    4    427843584
l2_hits                         4    0
l2_misses                       4    0
l2_feeds                        4    0
l2_rw_clash                     4    0
l2_read_bytes                   4    0
l2_write_bytes                  4    0
l2_writes_sent                  4    0
l2_writes_done                  4    0
l2_writes_error                 4    0
l2_writes_lock_retry            4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_evict_l1cached               4    0
l2_free_on_write                4    0
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    0
l2_asize                        4    0
l2_hdr_size                     4    0
memory_throttle_count           4    0
memory_direct_count             4    928
memory_indirect_count           4    187112
memory_all_bytes                4    3976544256
memory_free_bytes               4    1212428288
memory_available_bytes          3    1150296064
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    35532
arc_meta_used                   4    980502848
arc_meta_limit                  4    1491204096
arc_dnode_limit                 4    149120409
arc_meta_max                    4    2016757248
arc_meta_min                    4    16777216
async_upgrade_sync              4    1118
demand_hit_predictive_prefetch  4    164808
demand_hit_prescient_prefetch   4    3518483
arc_need_free                   4    0
arc_sys_free                    4    62133504
arc_raw_size                    4    0
tail /proc/spl/kstat/zfs/dbgmsg
1602871076   zio.c:1919:zio_deadman_impl(): slow zio[5]: zio=ffff00007fa39d10 timestamp=139014768748992 delta=25433705563767 queued=0 io=0 path=/dev/disk/by-id/usb-WD_My_Book_25EE_2222222222222243-0:0-part1 last=139014860493029 type=2 priority=3 flags=0x180880 stage=0x100000 pipeline=0x1700000 pipeline-trace=0x100001 objset=0 object=0 level=0 blkid=41 offset=2105049370624 size=4096 error=0
1602871076   zio.c:1919:zio_deadman_impl(): slow zio[5]: zio=ffff00007fa3a6c0 timestamp=139014768739863 delta=25433705755729 queued=0 io=0 path=/dev/disk/by-id/usb-WD_My_Book_25EE_2222222222222243-0:0-part1 last=139014860493029 type=2 priority=3 flags=0x180880 stage=0x100000 pipeline=0x1700000 pipeline-trace=0x100001 objset=0 object=0 level=0 blkid=41 offset=7924248633344 size=4096 error=0
1602871076   zio.c:1919:zio_deadman_impl(): slow zio[5]: zio=ffff00006e5b3548 timestamp=139014769038474 delta=25433705642229 queued=0 io=0 path=/dev/disk/by-id/usb-WD_My_Book_25EE_2222222222222243-0:0-part1 last=139014860493029 type=2 priority=3 flags=0x180880 stage=0x100000 pipeline=0x1700000 pipeline-trace=0x100001 objset=0 object=0 level=0 blkid=2 offset=2105049366528 size=4096 error=0
1602871076   zio.c:1919:zio_deadman_impl(): slow zio[5]: zio=ffff00006e5b1360 timestamp=139014769026992 delta=25433705841137 queued=0 io=0 path=/dev/disk/by-id/usb-WD_My_Book_25EE_2222222222222243-0:0-part1 last=139014860493029 type=2 priority=3 flags=0x180880 stage=0x100000 pipeline=0x1700000 pipeline-trace=0x100001 objset=0 object=0 level=0 blkid=2 offset=7924248629248 size=4096 error=0
1602871076   zio.c:1919:zio_deadman_impl(): slow zio[5]: zio=ffff00006e5b1d10 timestamp=139014769012937 delta=25433706041247 queued=0 io=0 path=/dev/disk/by-id/usb-WD_My_Book_25EE_2222222222222243-0:0-part1 last=139014860493029 type=2 priority=3 flags=0x180880 stage=0x100000 pipeline=0x1700000 pipeline-trace=0x100001 objset=0 object=0 level=0 blkid=2 offset=2306569748480 size=4096 error=0
1602871076   zio.c:1967:zio_deadman(): zio_wait waiting for hung I/O to pool 'uranpool'
1602871080   spa_misc.c:605:spa_deadman(): slow spa_sync: started 25437 seconds ago, calls 424
1602871080   vdev.c:4670:vdev_deadman(): slow vdev: /dev/disk/by-id/usb-WD_My_Book_25EE_2222222222222243-0:0-part1 has 2 active IOs
1602871080   zio.c:1919:zio_deadman_impl(): slow zio[0]: zio=ffff0000f272e0e0 timestamp=139014769399714 delta=25437795829857 queued=0 io=139014770094326 path=/dev/disk/by-id/usb-WD_My_Book_25EE_2222222222222243-0:0-part1 last=139014860493029 type=2 priority=3 flags=0x40080c80 stage=0x100000 pipeline=0x1700000 pipeline-trace=0x100001 objset=0 object=0 level=0 blkid=0 offset=2306569768960 size=24576 error=0
1602871080   zio.c:1967:zio_deadman(): spa_deadman waiting for hung I/O to pool 'uranpool'
arc_summary -p1
------------------------------------------------------------------------
ZFS Subsystem Report                            Fri Oct 16 19:59:37 2020
Linux 5.4.0-1019-raspi                                 0.8.3-1ubuntu12.4
Machine: uran (aarch64)                                0.8.3-1ubuntu12.4

WARNING: Pages are deprecated, please use "--section"

ARC status:                                                      HEALTHY
        Memory throttle count:                                         0

ARC size (current):                                    86.7 %    1.6 GiB
        Target size (adaptive):                        89.6 %    1.7 GiB
        Min size (hard limit):                          6.2 %  118.5 MiB
        Max size (high water):                           16:1    1.9 GiB
        Most Frequently Used (MFU) cache size:          9.6 %  116.7 MiB
        Most Recently Used (MRU) cache size:           90.4 %    1.1 GiB
        Metadata cache size (hard limit):              75.0 %    1.4 GiB
        Metadata cache size (current):                 65.8 %  935.1 MiB
        Dnode cache size (hard limit):                 10.0 %  142.2 MiB
        Dnode cache size (current):                   156.8 %  223.1 MiB

ARC hash breakdown:
        Elements max:                                             360.4k
        Elements current:                              34.1 %     122.9k
        Collisions:                                                 1.2M
        Chain max:                                                     8
        Chains:                                                    12.3k

ARC misc:
        Deleted:                                                    2.9M
        Mutex misses:                                                422
        Eviction skips:                                            14.6M
cat /proc/meminfo
MemTotal:        3883344 kB
MemFree:          929584 kB
MemAvailable:    1299796 kB
Buffers:           38140 kB
Cached:           295272 kB
SwapCached:          620 kB
Active:           228036 kB
Inactive:         157544 kB
Active(anon):      19768 kB
Inactive(anon):    42800 kB
Active(file):     208268 kB
Inactive(file):   114744 kB
Unevictable:       17244 kB
Mlocked:           17244 kB
SwapTotal:       6291452 kB
SwapFree:        6286580 kB
Dirty:                 0 kB
Writeback:             0 kB
AnonPages:         69140 kB
Mapped:            55508 kB
Shmem:              4104 kB
KReclaimable:      97364 kB
Slab:            1282460 kB
SReclaimable:      97364 kB
SUnreclaim:      1185096 kB
KernelStack:        4032 kB
PageTables:         2540 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     8233124 kB
Committed_AS:     285468 kB
VmallocTotal:   135290159040 kB
VmallocUsed:      145128 kB
VmallocChunk:          0 kB
Percpu:             2240 kB
CmaTotal:          65536 kB
CmaFree:           62656 kB
cat /proc/spl/kstat/zfs/uranpool/txgs
It is not moving anywhere. stuck at same entries. Here are last 4:
txg      birth            state ndirty       nread        nwritten     reads    writes   otime        qtime        wtime        stime
...
1270487  139013638627326  C     6889984      77824        3244032      19       237      578093481    7074         71130        286069963
1270488  139014216720807  S     0            0            0            0        0        525613741    5981         57056        0
1270489  139014742334548  W     0            0            0            0        0        505985648    8592         0            0
1270490  139015248320196  O     0            0            0            0        0        0            0            0            0
morkol-dot commented 3 years ago

I have the same issue (RPi 4b 8GB with a WD digital 2,5" external drive) and was also able to replicate it with a similar RPi 4 and a similar WD digital 2,5" external drive. Both with additional external power supplys (powered USB hub). Might be an issue that's caused by that specific disk / controller? Because I don't have the issue with a similar Seagate 2,5" drive.

stale[bot] commented 2 years ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

stale[bot] commented 1 year ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.