openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.63k stars 1.75k forks source link

deadlock in autotrim #15217

Open clhedrick opened 1 year ago

clhedrick commented 1 year ago

System information

Type | Version/Name Ubuntu | 22.04 LTS Distribution Name | Ubuntu Distribution Version | 22.04 Kernel Version | 5.15.0-76-generic Architecture | x86_4 OpenZFS Version | zfs-2.1.5-1ubuntu6~22.04.1

Describe the problem you're observing

We installed a new pool. 2 mirrored pairs (4 disks), Samsung MZ7L37T6HBLA-00A07 (7.68 TB SATA SSD).

The pool was set to autotrim.

Copied a bunch of data into it. While copying the pool hung

Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.937841] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.937854] task:txg_sync        state:D stack:    0 pid:811157 ppid:     2 flags:0x00004000
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.937857] Call Trace:
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.937858]  <TASK>
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.937861]  __schedule+0x24e/0x590
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.937866]  schedule+0x69/0x110
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.937868]  schedule_timeout+0x87/0x140
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.937872]  ? zio_issue_async+0x12/0x20 [zfs]
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.937942]  ? __bpf_trace_tick_stop+0x20/0x20
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.937945]  io_schedule_timeout+0x51/0x80
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.937947]  __cv_timedwait_common+0x12c/0x170 [spl]
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.937952]  ? wait_woken+0x70/0x70
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.937954]  __cv_timedwait_io+0x19/0x20 [spl]
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.937958]  zio_wait+0x116/0x220 [zfs]
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.938008]  dsl_pool_sync+0xb6/0x400 [zfs]
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.938052]  ? __mod_timer+0x214/0x400
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.938054]  spa_sync_iterate_to_convergence+0xe0/0x1f0 [zfs]
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.938101]  spa_sync+0x2dc/0x5b0 [zfs]
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.938142]  txg_sync_thread+0x266/0x2f0 [zfs]
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.938190]  ? txg_dispatch_callbacks+0x100/0x100 [zfs]
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.938232]  thread_generic_wrapper+0x64/0x80 [spl]
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.938237]  ? __thread_exit+0x20/0x20 [spl]
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.938242]  kthread+0x12a/0x150
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.938244]  ? set_kthread_struct+0x50/0x50
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.938245]  ret_from_fork+0x22/0x30
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.938248]  </TASK>
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.938254] INFO: task vdev_autotrim:1940159 blocked for more than 120 seconds.
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.938266]       Tainted: P           O      5.15.0-76-generic #83-Ubuntu

and similar

Turned off autotrim. The problem went away.

I did "zpool trim" while doing the same copy. It finished without problem.

I've disabled autotrim, and will use Ubuntu's normal monthly trim.

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

hpannenb commented 11 months ago

Same issue here on a Proxmox 8.1 machine (Debian 12 based):

Turning on autotrim on two pools increases the system load due to vdev_autotrim processes in "blocked" status ( ref.to. e.g. https://forum.proxmox.com/threads/proxmox-ve-8-1-released.136960/page-5#post-609592 ).

root@pve:~# [ 8218.231630] INFO: task vdev_autotrim:133705 blocked for more than 362 seconds.
[ 8218.232450]       Tainted: P          IO       6.5.11-6-pve #1
[ 8218.233156] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 8218.233570] task:vdev_autotrim   state:D stack:0     pid:133705 ppid:2      flags:0x00004000
[ 8218.234422] Call Trace:
[ 8218.234558]  <TASK>
[ 8218.235053]  __schedule+0x3fd/0x1450
[ 8218.235275]  ? __wake_up_common_lock+0x8b/0xd0
[ 8218.235913]  schedule+0x63/0x110
[ 8218.236482]  cv_wait_common+0x109/0x140 [spl]
[ 8218.237144]  ? __pfx_autoremove_wake_function+0x10/0x10
[ 8218.237437]  __cv_wait+0x15/0x30 [spl]
[ 8218.237649]  vdev_autotrim_thread+0x797/0x9a0 [zfs]
[ 8218.238532]  ? __pfx_vdev_autotrim_thread+0x10/0x10 [zfs]
[ 8218.239042]  ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
[ 8218.239339]  thread_generic_wrapper+0x5f/0x70 [spl]
[ 8218.240000]  kthread+0xf2/0x120
[ 8218.240580]  ? __pfx_kthread+0x10/0x10
[ 8218.240844]  ret_from_fork+0x47/0x70
[ 8218.241050]  ? __pfx_kthread+0x10/0x10
[ 8218.241249]  ret_from_fork_asm+0x1b/0x30
[ 8218.241451]  </TASK>
[ 8218.241583] INFO: task vdev_autotrim:133957 blocked for more than 362 seconds.
[ 8218.242346]       Tainted: P          IO       6.5.11-6-pve #1
[ 8218.243038] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 8218.243458] task:vdev_autotrim   state:D stack:0     pid:133957 ppid:2      flags:0x00004000
[ 8218.244294] Call Trace:
[ 8218.244440]  <TASK>
[ 8218.244926]  __schedule+0x3fd/0x1450
[ 8218.245146]  ? __wake_up_common_lock+0x8b/0xd0
[ 8218.245772]  schedule+0x63/0x110
[ 8218.246504]  cv_wait_common+0x109/0x140 [spl]
[ 8218.247378]  ? __pfx_autoremove_wake_function+0x10/0x10
[ 8218.247698]  __cv_wait+0x15/0x30 [spl]
[ 8218.247960]  vdev_autotrim_thread+0x797/0x9a0 [zfs]
[ 8218.248969]  ? __pfx_vdev_autotrim_thread+0x10/0x10 [zfs]
[ 8218.249616]  ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
[ 8218.249938]  thread_generic_wrapper+0x5f/0x70 [spl]
[ 8218.250584]  kthread+0xf2/0x120
[ 8218.251172]  ? __pfx_kthread+0x10/0x10
[ 8218.251396]  ret_from_fork+0x47/0x70
[ 8218.251608]  ? __pfx_kthread+0x10/0x10
[ 8218.251808]  ret_from_fork_asm+0x1b/0x30
[ 8218.252014]  </TASK>