Closed talex5 closed 3 years ago
[1325170.432411] INFO: task ocluster-worker:3232955 blocked for more than 120 seconds.
[1325170.440783] Not tainted 5.4.0-56-generic #62-Ubuntu
[1325170.447046] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1325170.455796] ocluster-worker D 0 3232955 1 0x00004000
[1325170.455802] Call Trace:
[1325170.455813] __schedule+0x2e3/0x740
[1325170.455818] ? __switch_to_asm+0x34/0x70
[1325170.455825] ? __switch_to_asm+0x40/0x70
[1325170.455829] schedule+0x42/0xb0
[1325170.455833] schedule_preempt_disabled+0xe/0x10
[1325170.455836] __mutex_lock.isra.0+0x182/0x4f0
[1325170.455840] __mutex_lock_slowpath+0x13/0x20
[1325170.455843] mutex_lock+0x2e/0x40
[1325170.455888] btrfs_start_delalloc_roots+0x60/0x280 [btrfs]
[1325170.455907] ? __next_timer_interrupt+0xe0/0xe0
[1325170.455957] flush_space+0x60b/0x770 [btrfs]
[1325170.456003] ? __clear_extent_bit+0x201/0x4a0 [btrfs]
[1325170.456059] ? get_alloc_profile+0x7f/0x180 [btrfs]
[1325170.456112] priority_reclaim_metadata_space.isra.0+0xbf/0x1e0 [btrfs]
[1325170.456154] btrfs_reserve_metadata_bytes+0x608/0x950 [btrfs]
[1325170.456193] ? __btrfs_block_rsv_release+0x1c1/0x300 [btrfs]
[1325170.456231] btrfs_block_rsv_refill+0x7d/0xa0 [btrfs]
[1325170.456266] evict_refill_and_join+0x39/0xd0 [btrfs]
[1325170.456299] btrfs_evict_inode+0x417/0x4c0 [btrfs]
[1325170.456306] evict+0xd2/0x1b0
[1325170.456309] iput+0x148/0x210
[1325170.456314] dentry_unlink_inode+0xc6/0x110
[1325170.456317] d_delete+0x76/0x80
[1325170.456321] vfs_rmdir+0x179/0x1a0
[1325170.456325] do_rmdir+0x18c/0x1c0
[1325170.456332] __x64_sys_rmdir+0x17/0x20
[1325170.456337] do_syscall_64+0x57/0x190
[1325170.456341] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[1325170.456348] RIP: 0033:0x7f613970ee9b
[1325170.456357] Code: Bad RIP value.
[1325170.456359] RSP: 002b:00007f5d497f9e28 EFLAGS: 00000246 ORIG_RAX: 0000000000000054
[1325170.456367] RAX: ffffffffffffffda RBX: 000055bf722616b0 RCX: 00007f613970ee9b
[1325170.456376] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000055bf72261718
[1325170.456384] RBP: 000055bf722616d8 R08: 0000000000000000 R09: 0000000000000000
[1325170.456392] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff0790c36e
[1325170.456403] R13: 00007fff0790c36f R14: 00007fff0790c400 R15: 00007f5d497f9fc0
[1325170.456418] INFO: task ocluster-worker:3232958 blocked for more than 120 seconds.
[1325170.464845] Not tainted 5.4.0-56-generic #62-Ubuntu
[1325170.471160] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1325170.479952] ocluster-worker D 0 3232958 1 0x00004000
[1325170.479956] Call Trace:
[1325170.479961] __schedule+0x2e3/0x740
[1325170.480001] ? __btrfs_block_rsv_release+0x1c1/0x300 [btrfs]
[1325170.480020] schedule+0x42/0xb0
[1325170.480031] schedule_preempt_disabled+0xe/0x10
[1325170.480035] __mutex_lock.isra.0+0x182/0x4f0
[1325170.480038] ? ktime_get+0x3e/0xa0
[1325170.480042] __mutex_lock_slowpath+0x13/0x20
[1325170.480046] mutex_lock+0x2e/0x40
[1325170.480079] btrfs_start_delalloc_roots+0x60/0x280 [btrfs]
[1325170.480126] flush_space+0x60b/0x770 [btrfs]
[1325170.480171] ? __clear_extent_bit+0x201/0x4a0 [btrfs]
[1325170.480216] ? get_alloc_profile+0x7f/0x180 [btrfs]
[1325170.480253] priority_reclaim_metadata_space.isra.0+0xbf/0x1e0 [btrfs]
[1325170.480297] btrfs_reserve_metadata_bytes+0x608/0x950 [btrfs]
[1325170.480337] ? __btrfs_block_rsv_release+0x1c1/0x300 [btrfs]
[1325170.480373] btrfs_block_rsv_refill+0x7d/0xa0 [btrfs]
[1325170.480407] evict_refill_and_join+0x39/0xd0 [btrfs]
[1325170.480454] btrfs_evict_inode+0x417/0x4c0 [btrfs]
[1325170.480469] evict+0xd2/0x1b0
[1325170.480473] iput+0x148/0x210
[1325170.480480] dentry_unlink_inode+0xc6/0x110
[1325170.480483] d_delete+0x76/0x80
[1325170.480487] vfs_rmdir+0x179/0x1a0
[1325170.480493] do_rmdir+0x18c/0x1c0
[1325170.480498] __x64_sys_rmdir+0x17/0x20
[1325170.480502] do_syscall_64+0x57/0x190
[1325170.480506] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[1325170.480517] RIP: 0033:0x7f613970ee9b
[1325170.480526] Code: Bad RIP value.
[1325170.480535] RSP: 002b:00007f5d3b7fde28 EFLAGS: 00000246 ORIG_RAX: 0000000000000054
[1325170.480551] RAX: ffffffffffffffda RBX: 000055bf72a15b60 RCX: 00007f613970ee9b
[1325170.480562] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000055bf72a15bc8
[1325170.480564] RBP: 000055bf72a15b88 R08: 0000000000000000 R09: 0000000000000004
[1325170.480565] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff0790c34e
[1325170.480567] R13: 00007fff0790c34f R14: 00007fff0790c3e0 R15: 00007f5d3b7fdfc0
Some PPC lock-ups today.
prothoe:
[5723372.036165] INFO: task kworker/u160:18:3878172 blocked for more than 120 seconds.
[5723372.036200] Not tainted 5.4.0-53-generic #59-Ubuntu
[5723372.036336] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[5723372.036355] kworker/u160:18 D 0 3878172 2 0x00000808
[5723372.036392] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
[5723372.036393] Call Trace:
[5723372.036397] [c000000a65f872d0] [c000000a65f87380] 0xc000000a65f87380 (unreliable)
[5723372.036403] [c000000a65f874b0] [c000000000021a1c] __switch_to+0x2dc/0x450
[5723372.036407] [c000000a65f87520] [c000000000eebd7c] __schedule+0x2ec/0x930
[5723372.036409] [c000000a65f87600] [c000000000eec418] schedule+0x58/0x130
[5723372.036412] [c000000a65f87630] [c000000000eeca30] schedule_preempt_disabled+0x20/0x30
[5723372.036415] [c000000a65f87650] [c000000000eeed48] __mutex_lock.isra.0+0x2b8/0x790
[5723372.036443] [c000000a65f87710] [c008000001e112c0] btrfs_start_delalloc_roots+0x78/0x3f0 [btrfs]
[5723372.036471] [c000000a65f877c0] [c008000001eb72ec] flush_space+0x694/0xa80 [btrfs]
[5723372.036498] [c000000a65f878b0] [c008000001eb7b0c] priority_reclaim_metadata_space+0x234/0x438 [btrfs]
[5723372.036525] [c000000a65f87920] [c008000001eb96b0] btrfs_reserve_metadata_bytes+0x758/0xb98 [btrfs]
[5723372.036552] [c000000a65f87a30] [c008000001eba644] btrfs_block_rsv_refill+0xdc/0x160 [btrfs]
[5723372.036579] [c000000a65f87a70] [c008000001df8258] evict_refill_and_join+0x50/0x158 [btrfs]
[5723372.036609] [c000000a65f87af0] [c008000001e09b18] btrfs_evict_inode+0x510/0x620 [btrfs]
[5723372.036612] [c000000a65f87b80] [c0000000004aa330] evict+0x100/0x250
[5723372.036640] [c000000a65f87bc0] [c008000001df8db4] btrfs_run_delalloc_work+0x4c/0xc0 [btrfs]
[5723372.036668] [c000000a65f87bf0] [c008000001e4288c] btrfs_work_helper+0x144/0x5f8 [btrfs]
[5723372.036672] [c000000a65f87c80] [c00000000016877c] process_one_work+0x29c/0x550
[5723372.036674] [c000000a65f87d10] [c000000000168ac8] worker_thread+0x98/0x5f0
[5723372.036677] [c000000a65f87db0] [c000000000172de4] kthread+0x1a4/0x1b0
[5723372.036680] [c000000a65f87e20] [c00000000000b648] ret_from_kernel_thread+0x5c/0x74
pisto:
[4551456.239823] INFO: task rm:649833 blocked for more than 120 seconds.
[4551456.239872] Not tainted 5.4.0-54-generic #60-Ubuntu
[4551456.239880] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[4551456.239891] rm D 0 649833 649832 0x00040400
[4551456.239893] Call Trace:
[4551456.239894] [c000002aeb9234d0] [c000002aeb923500] 0xc000002aeb923500 (unreliable)
[4551456.239896] [c000002aeb9236b0] [c000000000021a1c] __switch_to+0x2dc/0x450
[4551456.239897] [c000002aeb923720] [c000000000eebd7c] __schedule+0x2ec/0x930
[4551456.239899] [c000002aeb923800] [c000000000eec418] schedule+0x58/0x130
[4551456.239900] [c000002aeb923830] [c0000000004ccaa4] wb_wait_for_completion+0xa4/0xf0
[4551456.239902] [c000002aeb923890] [c0000000004ccf84] __writeback_inodes_sb_nr+0xb4/0xf0
[4551456.239917] [c000002aeb923910] [c0080000049271b8] flush_space+0x560/0xa80 [btrfs]
[4551456.239933] [c000002aeb923a00] [c008000004927b0c] priority_reclaim_metadata_space+0x234/0x438 [btrfs]
[4551456.239949] [c000002aeb923a70] [c0080000049296b0] btrfs_reserve_metadata_bytes+0x758/0xb98 [btrfs]
[4551456.239964] [c000002aeb923b80] [c00800000492a644] btrfs_block_rsv_refill+0xdc/0x160 [btrfs]
[4551456.239980] [c000002aeb923bc0] [c008000004868258] evict_refill_and_join+0x50/0x158 [btrfs]
[4551456.239996] [c000002aeb923c40] [c008000004879b18] btrfs_evict_inode+0x510/0x620 [btrfs]
[4551456.239997] [c000002aeb923cd0] [c0000000004aa330] evict+0x100/0x250
[4551456.239999] [c000002aeb923d10] [c0000000004a07b8] dentry_unlink_inode+0x148/0x210
[4551456.240000] [c000002aeb923d40] [c00000000048e9e8] vfs_rmdir+0x228/0x270
[4551456.240002] [c000002aeb923d80] [c0000000004965a0] do_rmdir+0x270/0x2a0
[4551456.240003] [c000002aeb923e20] [c00000000000b278] system_call+0x5c/0x68
Feb 02 22:08:09 hipp kernel: INFO: task kworker/u144:3:1661147 blocked for more than 120 seconds.
Feb 02 22:08:09 hipp kernel: Not tainted 5.4.0-54-generic #60-Ubuntu
Feb 02 22:08:09 hipp kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 02 22:08:09 hipp kernel: kworker/u144:3 D 0 1661147 2 0x80004000
Feb 02 22:08:09 hipp kernel: Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
Feb 02 22:08:09 hipp kernel: Call Trace:
Feb 02 22:08:09 hipp kernel: __schedule+0x2e3/0x740
Feb 02 22:08:09 hipp kernel: ? __btrfs_block_rsv_release+0x1c1/0x300 [btrfs]
Feb 02 22:08:09 hipp kernel: schedule+0x42/0xb0
Feb 02 22:08:09 hipp kernel: schedule_preempt_disabled+0xe/0x10
Feb 02 22:08:09 hipp kernel: __mutex_lock.isra.0+0x182/0x4f0
Feb 02 22:08:09 hipp kernel: ? ktime_get+0x3e/0xa0
Feb 02 22:08:09 hipp kernel: __mutex_lock_slowpath+0x13/0x20
Feb 02 22:08:09 hipp kernel: mutex_lock+0x2e/0x40
Feb 02 22:08:09 hipp kernel: btrfs_start_delalloc_roots+0x60/0x280 [btrfs]
Feb 02 22:08:09 hipp kernel: flush_space+0x60b/0x770 [btrfs]
Feb 02 22:08:09 hipp kernel: ? extent_io_tree_init+0x30/0x40 [btrfs]
Feb 02 22:08:09 hipp kernel: ? __clear_extent_bit+0x201/0x4a0 [btrfs]
Feb 02 22:08:09 hipp kernel: priority_reclaim_metadata_space.isra.0+0x18b/0x220 [btrfs]
Feb 02 22:08:09 hipp kernel: ? can_overcommit.part.0+0x5f/0xc0 [btrfs]
Feb 02 22:08:09 hipp kernel: btrfs_reserve_metadata_bytes+0x5ea/0x970 [btrfs]
Feb 02 22:08:09 hipp kernel: ? btrfs_truncate_inode_items+0x35e/0xdd0 [btrfs]
Feb 02 22:08:09 hipp kernel: ? __btrfs_block_rsv_release+0x1c1/0x300 [btrfs]
Feb 02 22:08:09 hipp kernel: btrfs_block_rsv_refill+0x7d/0xa0 [btrfs]
Feb 02 22:08:09 hipp kernel: evict_refill_and_join+0x39/0xd0 [btrfs]
Feb 02 22:08:09 hipp kernel: btrfs_evict_inode+0x417/0x4c0 [btrfs]
Feb 02 22:08:09 hipp kernel: evict+0xd2/0x1b0
Feb 02 22:08:09 hipp kernel: iput+0x148/0x210
Feb 02 22:08:09 hipp kernel: btrfs_run_delalloc_work+0x31/0x50 [btrfs]
Feb 02 22:08:09 hipp kernel: btrfs_work_helper+0xc1/0x3a0 [btrfs]
Feb 02 22:08:09 hipp kernel: process_one_work+0x1eb/0x3b0
Feb 02 22:08:09 hipp kernel: worker_thread+0x4d/0x400
Feb 02 22:08:09 hipp kernel: kthread+0x104/0x140
Feb 02 22:08:09 hipp kernel: ? process_one_work+0x3b0/0x3b0
Feb 02 22:08:09 hipp kernel: ? kthread_park+0x90/0x90
Feb 02 22:08:09 hipp kernel: ret_from_fork+0x35/0x40
[836031.482376] INFO: task ocluster-worker:168425 blocked for more than 120 seconds.
[836031.484793] Not tainted 5.4.0-65-generic #73-Ubuntu
[836031.487131] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[836031.489458] ocluster-worker D 0 168425 1 0x00004000
[836031.489463] Call Trace:
[836031.489475] __schedule+0x2e3/0x740
[836031.489480] ? __switch_to_asm+0x34/0x70
[836031.489483] ? __switch_to_asm+0x40/0x70
[836031.489487] schedule+0x42/0xb0
[836031.489490] schedule_preempt_disabled+0xe/0x10
[836031.489493] __mutex_lock.isra.0+0x182/0x4f0
[836031.489498] __mutex_lock_slowpath+0x13/0x20
[836031.489500] mutex_lock+0x2e/0x40
[836031.489542] btrfs_start_delalloc_roots+0x60/0x280 [btrfs]
[836031.489547] ? __next_timer_interrupt+0xe0/0xe0
[836031.489589] flush_space+0x60b/0x770 [btrfs]
[836031.489629] ? get_alloc_profile+0x7f/0x180 [btrfs]
[836031.489667] priority_reclaim_metadata_space.isra.0+0xbf/0x1e0 [btrfs]
[836031.489705] btrfs_reserve_metadata_bytes+0x608/0x950 [btrfs]
[836031.489742] ? __btrfs_block_rsv_release+0x1c1/0x300 [btrfs]
[836031.489778] btrfs_block_rsv_refill+0x7d/0xa0 [btrfs]
[836031.489810] evict_refill_and_join+0x39/0xd0 [btrfs]
[836031.489843] btrfs_evict_inode+0x417/0x4c0 [btrfs]
[836031.489850] evict+0xd2/0x1b0
[836031.489853] iput+0x148/0x210
[836031.489857] dentry_unlink_inode+0xc6/0x110
[836031.489860] d_delete+0x76/0x80
[836031.489864] vfs_rmdir+0x179/0x1a0
[836031.489867] do_rmdir+0x18c/0x1c0
[836031.489870] __x64_sys_rmdir+0x17/0x20
[836031.489874] do_syscall_64+0x57/0x190
[836031.489878] entry_SYSCALL_64_after_hwframe+0x44/0xa9
After ~30 hours of testing on pisto it looks like upgrading the linux kernel to at least 5.11.3 is the solution.
Just the kernel, or the increasing the swap as well?
just the kernel
That's great! How are you installing the new kernel? We should add the commands to the Ansible playbook.
x86-bm-12:
[712061.425998] ocamlc.opt D 0 3058109 3058074 0x00000320
[712061.426002] Call Trace:
[712061.426012] __schedule+0x2e3/0x740
[712061.426038] ? btrfs_get_token_32+0x63/0xf0 [btrfs]
[712061.426040] schedule+0x42/0xb0
[712061.426044] rwsem_down_write_slowpath+0x244/0x4d0
[712061.426046] down_write+0x41/0x50
[712061.426050] do_last+0x2b1/0x900
[712061.426053] ? __alloc_file+0x94/0x110
[712061.426056] ? update_curr+0x112/0x1e0
[712061.426057] path_openat+0x8d/0x290
[712061.426059] do_filp_open+0x91/0x100
[712061.426063] ? __seccomp_filter+0x7e/0x690
[712061.426066] ? __alloc_fd+0x46/0x150
[712061.426067] do_sys_open+0x17e/0x290
[712061.426069] __x64_sys_openat+0x20/0x30
[712061.426071] do_syscall_64+0x57/0x190
[712061.426073] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[712061.426075] RIP: 0033:0x7fa7980a8c8b
[712061.426082] Code: Bad RIP value.
[712061.426084] RSP: 002b:00007ffea4549c30 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
[712061.426086] RAX: ffffffffffffffda RBX: 0000000000000241 RCX: 00007fa7980a8c8b
[712061.426087] RDX: 0000000000080241 RSI: 0000562c068e3a50 RDI: 00000000ffffff9c
[712061.426088] RBP: 00000000000001b6 R08: 0000000000000004 R09: 0000562c06364c00
[712061.426089] R10: 00000000000001b6 R11: 0000000000000246 R12: 0000562c068e3a50
[712061.426090] R13: 0000000000000000 R14: 0000562c068e2260 R15: 00007fa797d3d280
[712182.255934] ocluster-worker D 0 8507 1 0x00000000
[712182.255939] Call Trace:
[712182.255954] __schedule+0x2e3/0x740
[712182.255957] schedule+0x42/0xb0
[712182.255960] rwsem_down_read_slowpath+0x16c/0x4a0
[712182.255963] down_read+0x85/0xa0
[712182.255967] lookup_slow+0x2d/0x60
[712182.255968] walk_component+0x1da/0x360
[712182.255970] ? link_path_walk.part.0+0x6d/0x550
[712182.255972] path_lookupat.isra.0+0x80/0x230
[712182.255974] filename_lookup+0xae/0x170
[712182.255976] ? __switch_to_asm+0x40/0x70
[712182.255979] ? __check_object_size+0x13f/0x150
[712182.255982] ? strncpy_from_user+0x4c/0x150
[712182.256009] user_path_at_empty+0x3a/0x50
[712182.256010] vfs_statx+0x7d/0xe0
[712182.256012] ? __switch_to_asm+0x40/0x70
[712182.256013] ? __switch_to_asm+0x34/0x70
[712182.256015] __do_sys_newstat+0x3e/0x80
[712182.256016] ? schedule+0x42/0xb0
[712182.256019] ? exit_to_usermode_loop+0x8f/0x160
[712182.256021] __x64_sys_newstat+0x16/0x20
[712182.256022] do_syscall_64+0x57/0x190
[712182.256024] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[712182.256026] RIP: 0033:0x7f87b637b49a
[712182.256035] Code: Bad RIP value.
Another 3 machines stuck today:
[4758684.432422] Call Trace:
[4758684.432428] __schedule+0x2e3/0x740
[4758684.432434] schedule+0x42/0xb0
[4758684.432443] schedule_preempt_disabled+0xe/0x10
[4758684.432452] __mutex_lock.isra.0+0x182/0x4f0
[4758684.432462] ? ktime_get+0x3e/0xa0
[4758684.432472] __mutex_lock_slowpath+0x13/0x20
[4758684.432481] mutex_lock+0x2e/0x40
[4758684.432522] btrfs_start_delalloc_roots+0x60/0x280 [btrfs]
[4758684.432565] flush_space+0x60b/0x770 [btrfs]
[4758684.432611] ? __clear_extent_bit+0x201/0x4a0 [btrfs]
[4758684.432652] ? get_alloc_profile+0x7f/0x180 [btrfs]
[4758684.432688] priority_reclaim_metadata_space.isra.0+0xbf/0x1e0 [btrfs]
[4758684.432722] btrfs_reserve_metadata_bytes+0x608/0x950 [btrfs]
[4758684.432795] ? __btrfs_block_rsv_release+0x1c1/0x300 [btrfs]
[4758684.432869] btrfs_block_rsv_refill+0x7d/0xa0 [btrfs]
[4758684.432921] evict_refill_and_join+0x39/0xd0 [btrfs]
[4758684.432966] btrfs_evict_inode+0x417/0x4c0 [btrfs]
[4758684.432973] evict+0xd2/0x1b0
[4758684.432985] iput+0x148/0x210
[4758684.432995] do_unlinkat+0x1c5/0x2d0
[4758684.433005] __x64_sys_unlink+0x23/0x30
[4758684.433017] do_syscall_64+0x57/0x190
[4758684.433026] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[11926567.453889] btrfs D 0 3096469 3288919 0x00000000
[11926567.453891] Call Trace:
[11926567.453894] __schedule+0x2e3/0x740
[11926567.453901] schedule+0x42/0xb0
[11926567.453909] schedule_timeout+0x10e/0x160
[11926567.453918] ? wake_up_process+0x15/0x20
[11926567.453926] ? insert_work+0x70/0x80
[11926567.453933] ? __queue_work+0x14c/0x3f0
[11926567.453940] wait_for_completion+0xb1/0x120
[11926567.453947] ? wake_up_q+0x70/0x70
[11926567.453972] start_delalloc_inodes+0x245/0x300 [btrfs]
[11926567.454005] btrfs_start_delalloc_snapshot+0x2a/0x40 [btrfs]
[11926567.454034] create_snapshot+0xf9/0x360 [btrfs]
[11926567.454059] ? btrfs_check_dir_item_collision+0x8b/0x130 [btrfs]
[11926567.454087] btrfs_mksubvol+0x138/0x200 [btrfs]
[11926567.454113] btrfs_ioctl_snap_create_transid+0x176/0x180 [btrfs]
[11926567.454145] btrfs_ioctl_snap_create_v2+0x108/0x190 [btrfs]
[11926567.454171] btrfs_ioctl+0x113f/0x20d0 [btrfs]
[11926567.454181] ? do_anonymous_page+0x2e6/0x650
[11926567.454190] ? __handle_mm_fault+0x760/0x7a0
[11926567.454204] do_vfs_ioctl+0x407/0x670
[11926567.454206] ? do_vfs_ioctl+0x407/0x670
[11926567.454215] ? do_user_addr_fault+0x216/0x450
[11926567.454224] ksys_ioctl+0x67/0x90
[11926567.454231] __x64_sys_ioctl+0x1a/0x20
[11926567.454237] do_syscall_64+0x57/0x190
[11926567.454245] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[5659024.048043] opam D 0 2942301 2941300 0x00004320
[5659024.048052] Call Trace:
[5659024.048067] __schedule+0x2e3/0x740
[5659024.048073] ? __switch_to_asm+0x34/0x70
[5659024.048077] ? __switch_to_asm+0x40/0x70
[5659024.048080] schedule+0x42/0xb0
[5659024.048084] schedule_preempt_disabled+0xe/0x10
[5659024.048087] __mutex_lock.isra.0+0x182/0x4f0
[5659024.048092] __mutex_lock_slowpath+0x13/0x20
[5659024.048095] mutex_lock+0x2e/0x40
[5659024.048142] btrfs_start_delalloc_roots+0x60/0x280 [btrfs]
[5659024.048156] ? __next_timer_interrupt+0xe0/0xe0
[5659024.048201] flush_space+0x60b/0x770 [btrfs]
[5659024.048244] ? __clear_extent_bit+0x201/0x4a0 [btrfs]
[5659024.048286] ? get_alloc_profile+0x7f/0x180 [btrfs]
[5659024.048328] priority_reclaim_metadata_space.isra.0+0xbf/0x1e0 [btrfs]
[5659024.048369] btrfs_reserve_metadata_bytes+0x608/0x950 [btrfs]
[5659024.048408] ? __btrfs_block_rsv_release+0x1c1/0x300 [btrfs]
[5659024.048446] btrfs_block_rsv_refill+0x7d/0xa0 [btrfs]
[5659024.048479] evict_refill_and_join+0x39/0xd0 [btrfs]
[5659024.048514] btrfs_evict_inode+0x417/0x4c0 [btrfs]
[5659024.048520] evict+0xd2/0x1b0
[5659024.048524] iput+0x148/0x210
[5659024.048527] do_unlinkat+0x1c5/0x2d0
[5659024.048530] __x64_sys_unlink+0x23/0x30
[5659024.048535] do_syscall_64+0x57/0x190
[5659024.048539] entry_SYSCALL_64_after_hwframe+0x44/0xa9
After ~30 hours of testing on pisto it looks like upgrading the linux kernel to at least 5.11.3 is the solution.
I bisected the issue in this order:
The changelog for linux 5.8 contains numerous fixes for Btrfs (as is every version) but I'm not sure which one is at play here: https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.8
Given Ubuntu 20.10 ships with linux 5.8 I suspect that upgrading all the workers to that version should be enough to get rid of these failures. We might want to add a check to obuilder when starting to ensure it is running on a new enough kernel (>= 5.8)
linux 5.8.18 does not seem to fail after 3+ hours of testing (I am leaving it run overnight to be sure)
I can confirm, still no failures after 16 hours of testing :tada: (it usually only takes ~2)
An issue to log Btrfs deadlock stack-traces: