ocurrent / obuilder

Experimental "docker build" alternative using btrfs/zfs snapshots
Apache License 2.0
60 stars 17 forks source link

Btrfs bug tracking #47

Closed talex5 closed 3 years ago

talex5 commented 3 years ago

An issue to log Btrfs deadlock stack-traces:

[1036019.465941] INFO: task ocluster-worker:3435002 blocked for more than 120 seconds.
[1036019.473678]       Not tainted 5.4.0-54-generic #60-Ubuntu
[1036019.480023] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1036019.488751] ocluster-worker D    0 3435002      1 0x00000004
[1036019.488758] Call Trace:
[1036019.488772]  __schedule+0x2e3/0x740
[1036019.488776]  schedule+0x42/0xb0
[1036019.488780]  rwsem_down_read_slowpath+0x16c/0x4a0
[1036019.488786]  down_read+0x85/0xa0
[1036019.488791]  lookup_slow+0x2d/0x60
[1036019.488795]  walk_component+0x1da/0x360
[1036019.488798]  ? link_path_walk.part.0+0x6d/0x550
[1036019.488803]  path_lookupat.isra.0+0x80/0x230
[1036019.488806]  ? __switch_to_asm+0x40/0x70
[1036019.488826]  ? __switch_to_asm+0x34/0x70
[1036019.488836]  ? __switch_to_asm+0x40/0x70
[1036019.488847]  ? __switch_to_asm+0x34/0x70
[1036019.488856]  ? __switch_to_asm+0x40/0x70
[1036019.488867]  filename_lookup+0xae/0x170
[1036019.488880]  ? __check_object_size+0x13f/0x150
[1036019.488892]  ? strncpy_from_user+0x4c/0x150
[1036019.488902]  user_path_at_empty+0x3a/0x50
[1036019.488906]  vfs_statx+0x7d/0xe0
[1036019.488918]  __do_sys_newlstat+0x3e/0x80
[1036019.488933]  ? do_user_addr_fault+0x216/0x450
[1036019.488941]  ? __do_page_fault+0x58/0x90
[1036019.488944]  __x64_sys_newlstat+0x16/0x20
[1036019.488950]  do_syscall_64+0x57/0x190
[1036019.488954]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[1036019.488959] RIP: 0033:0x7f9d5fc6c6ea
[1036019.488968] Code: Bad RIP value.
[1036019.488970] RSP: 002b:00007ffebcce6ec8 EFLAGS: 00000246 ORIG_RAX: 0000000000000006
[1036019.488973] RAX: ffffffffffffffda RBX: 00005599c1807060 RCX: 00007f9d5fc6c6ea
[1036019.488975] RDX: 00007ffebcce6f20 RSI: 00007ffebcce6f20 RDI: 00005599c1807060
[1036019.488977] RBP: 0000000000000000 R08: 0000000000000001 R09: 00007f9d5f5c8f10
[1036019.488978] R10: 00005599bec07d60 R11: 0000000000000246 R12: 00005599bffac4e0
[1036019.488980] R13: 00007ffebcce6f20 R14: 00005599c17302a0 R15: 00007f9d5f5c8828
talex5 commented 3 years ago
[1325170.432411] INFO: task ocluster-worker:3232955 blocked for more than 120 seconds.
[1325170.440783]       Not tainted 5.4.0-56-generic #62-Ubuntu
[1325170.447046] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1325170.455796] ocluster-worker D    0 3232955      1 0x00004000
[1325170.455802] Call Trace:
[1325170.455813]  __schedule+0x2e3/0x740
[1325170.455818]  ? __switch_to_asm+0x34/0x70
[1325170.455825]  ? __switch_to_asm+0x40/0x70
[1325170.455829]  schedule+0x42/0xb0
[1325170.455833]  schedule_preempt_disabled+0xe/0x10
[1325170.455836]  __mutex_lock.isra.0+0x182/0x4f0
[1325170.455840]  __mutex_lock_slowpath+0x13/0x20
[1325170.455843]  mutex_lock+0x2e/0x40
[1325170.455888]  btrfs_start_delalloc_roots+0x60/0x280 [btrfs]
[1325170.455907]  ? __next_timer_interrupt+0xe0/0xe0
[1325170.455957]  flush_space+0x60b/0x770 [btrfs]
[1325170.456003]  ? __clear_extent_bit+0x201/0x4a0 [btrfs]
[1325170.456059]  ? get_alloc_profile+0x7f/0x180 [btrfs]
[1325170.456112]  priority_reclaim_metadata_space.isra.0+0xbf/0x1e0 [btrfs]
[1325170.456154]  btrfs_reserve_metadata_bytes+0x608/0x950 [btrfs]
[1325170.456193]  ? __btrfs_block_rsv_release+0x1c1/0x300 [btrfs]
[1325170.456231]  btrfs_block_rsv_refill+0x7d/0xa0 [btrfs]
[1325170.456266]  evict_refill_and_join+0x39/0xd0 [btrfs]
[1325170.456299]  btrfs_evict_inode+0x417/0x4c0 [btrfs]
[1325170.456306]  evict+0xd2/0x1b0
[1325170.456309]  iput+0x148/0x210
[1325170.456314]  dentry_unlink_inode+0xc6/0x110
[1325170.456317]  d_delete+0x76/0x80
[1325170.456321]  vfs_rmdir+0x179/0x1a0
[1325170.456325]  do_rmdir+0x18c/0x1c0
[1325170.456332]  __x64_sys_rmdir+0x17/0x20
[1325170.456337]  do_syscall_64+0x57/0x190
[1325170.456341]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[1325170.456348] RIP: 0033:0x7f613970ee9b
[1325170.456357] Code: Bad RIP value.
[1325170.456359] RSP: 002b:00007f5d497f9e28 EFLAGS: 00000246 ORIG_RAX: 0000000000000054
[1325170.456367] RAX: ffffffffffffffda RBX: 000055bf722616b0 RCX: 00007f613970ee9b
[1325170.456376] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000055bf72261718
[1325170.456384] RBP: 000055bf722616d8 R08: 0000000000000000 R09: 0000000000000000
[1325170.456392] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff0790c36e
[1325170.456403] R13: 00007fff0790c36f R14: 00007fff0790c400 R15: 00007f5d497f9fc0
[1325170.456418] INFO: task ocluster-worker:3232958 blocked for more than 120 seconds.
[1325170.464845]       Not tainted 5.4.0-56-generic #62-Ubuntu
[1325170.471160] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1325170.479952] ocluster-worker D    0 3232958      1 0x00004000
[1325170.479956] Call Trace:
[1325170.479961]  __schedule+0x2e3/0x740
[1325170.480001]  ? __btrfs_block_rsv_release+0x1c1/0x300 [btrfs]
[1325170.480020]  schedule+0x42/0xb0
[1325170.480031]  schedule_preempt_disabled+0xe/0x10
[1325170.480035]  __mutex_lock.isra.0+0x182/0x4f0
[1325170.480038]  ? ktime_get+0x3e/0xa0
[1325170.480042]  __mutex_lock_slowpath+0x13/0x20
[1325170.480046]  mutex_lock+0x2e/0x40
[1325170.480079]  btrfs_start_delalloc_roots+0x60/0x280 [btrfs]
[1325170.480126]  flush_space+0x60b/0x770 [btrfs]
[1325170.480171]  ? __clear_extent_bit+0x201/0x4a0 [btrfs]
[1325170.480216]  ? get_alloc_profile+0x7f/0x180 [btrfs]
[1325170.480253]  priority_reclaim_metadata_space.isra.0+0xbf/0x1e0 [btrfs]
[1325170.480297]  btrfs_reserve_metadata_bytes+0x608/0x950 [btrfs]
[1325170.480337]  ? __btrfs_block_rsv_release+0x1c1/0x300 [btrfs]
[1325170.480373]  btrfs_block_rsv_refill+0x7d/0xa0 [btrfs]
[1325170.480407]  evict_refill_and_join+0x39/0xd0 [btrfs]
[1325170.480454]  btrfs_evict_inode+0x417/0x4c0 [btrfs]
[1325170.480469]  evict+0xd2/0x1b0
[1325170.480473]  iput+0x148/0x210
[1325170.480480]  dentry_unlink_inode+0xc6/0x110
[1325170.480483]  d_delete+0x76/0x80
[1325170.480487]  vfs_rmdir+0x179/0x1a0
[1325170.480493]  do_rmdir+0x18c/0x1c0
[1325170.480498]  __x64_sys_rmdir+0x17/0x20
[1325170.480502]  do_syscall_64+0x57/0x190
[1325170.480506]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[1325170.480517] RIP: 0033:0x7f613970ee9b
[1325170.480526] Code: Bad RIP value.
[1325170.480535] RSP: 002b:00007f5d3b7fde28 EFLAGS: 00000246 ORIG_RAX: 0000000000000054
[1325170.480551] RAX: ffffffffffffffda RBX: 000055bf72a15b60 RCX: 00007f613970ee9b
[1325170.480562] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000055bf72a15bc8
[1325170.480564] RBP: 000055bf72a15b88 R08: 0000000000000000 R09: 0000000000000004
[1325170.480565] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff0790c34e
[1325170.480567] R13: 00007fff0790c34f R14: 00007fff0790c3e0 R15: 00007f5d3b7fdfc0
talex5 commented 3 years ago

Some PPC lock-ups today.

prothoe:

[5723372.036165] INFO: task kworker/u160:18:3878172 blocked for more than 120 seconds.
[5723372.036200]       Not tainted 5.4.0-53-generic #59-Ubuntu
[5723372.036336] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[5723372.036355] kworker/u160:18 D    0 3878172      2 0x00000808
[5723372.036392] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
[5723372.036393] Call Trace:
[5723372.036397] [c000000a65f872d0] [c000000a65f87380] 0xc000000a65f87380 (unreliable)
[5723372.036403] [c000000a65f874b0] [c000000000021a1c] __switch_to+0x2dc/0x450
[5723372.036407] [c000000a65f87520] [c000000000eebd7c] __schedule+0x2ec/0x930
[5723372.036409] [c000000a65f87600] [c000000000eec418] schedule+0x58/0x130
[5723372.036412] [c000000a65f87630] [c000000000eeca30] schedule_preempt_disabled+0x20/0x30
[5723372.036415] [c000000a65f87650] [c000000000eeed48] __mutex_lock.isra.0+0x2b8/0x790
[5723372.036443] [c000000a65f87710] [c008000001e112c0] btrfs_start_delalloc_roots+0x78/0x3f0 [btrfs]
[5723372.036471] [c000000a65f877c0] [c008000001eb72ec] flush_space+0x694/0xa80 [btrfs]
[5723372.036498] [c000000a65f878b0] [c008000001eb7b0c] priority_reclaim_metadata_space+0x234/0x438 [btrfs]
[5723372.036525] [c000000a65f87920] [c008000001eb96b0] btrfs_reserve_metadata_bytes+0x758/0xb98 [btrfs]
[5723372.036552] [c000000a65f87a30] [c008000001eba644] btrfs_block_rsv_refill+0xdc/0x160 [btrfs]
[5723372.036579] [c000000a65f87a70] [c008000001df8258] evict_refill_and_join+0x50/0x158 [btrfs]
[5723372.036609] [c000000a65f87af0] [c008000001e09b18] btrfs_evict_inode+0x510/0x620 [btrfs]
[5723372.036612] [c000000a65f87b80] [c0000000004aa330] evict+0x100/0x250
[5723372.036640] [c000000a65f87bc0] [c008000001df8db4] btrfs_run_delalloc_work+0x4c/0xc0 [btrfs]
[5723372.036668] [c000000a65f87bf0] [c008000001e4288c] btrfs_work_helper+0x144/0x5f8 [btrfs]
[5723372.036672] [c000000a65f87c80] [c00000000016877c] process_one_work+0x29c/0x550
[5723372.036674] [c000000a65f87d10] [c000000000168ac8] worker_thread+0x98/0x5f0
[5723372.036677] [c000000a65f87db0] [c000000000172de4] kthread+0x1a4/0x1b0
[5723372.036680] [c000000a65f87e20] [c00000000000b648] ret_from_kernel_thread+0x5c/0x74

pisto:

[4551456.239823] INFO: task rm:649833 blocked for more than 120 seconds.
[4551456.239872]       Not tainted 5.4.0-54-generic #60-Ubuntu
[4551456.239880] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[4551456.239891] rm              D    0 649833 649832 0x00040400
[4551456.239893] Call Trace:
[4551456.239894] [c000002aeb9234d0] [c000002aeb923500] 0xc000002aeb923500 (unreliable)
[4551456.239896] [c000002aeb9236b0] [c000000000021a1c] __switch_to+0x2dc/0x450
[4551456.239897] [c000002aeb923720] [c000000000eebd7c] __schedule+0x2ec/0x930
[4551456.239899] [c000002aeb923800] [c000000000eec418] schedule+0x58/0x130
[4551456.239900] [c000002aeb923830] [c0000000004ccaa4] wb_wait_for_completion+0xa4/0xf0
[4551456.239902] [c000002aeb923890] [c0000000004ccf84] __writeback_inodes_sb_nr+0xb4/0xf0
[4551456.239917] [c000002aeb923910] [c0080000049271b8] flush_space+0x560/0xa80 [btrfs]
[4551456.239933] [c000002aeb923a00] [c008000004927b0c] priority_reclaim_metadata_space+0x234/0x438 [btrfs]
[4551456.239949] [c000002aeb923a70] [c0080000049296b0] btrfs_reserve_metadata_bytes+0x758/0xb98 [btrfs]
[4551456.239964] [c000002aeb923b80] [c00800000492a644] btrfs_block_rsv_refill+0xdc/0x160 [btrfs]
[4551456.239980] [c000002aeb923bc0] [c008000004868258] evict_refill_and_join+0x50/0x158 [btrfs]
[4551456.239996] [c000002aeb923c40] [c008000004879b18] btrfs_evict_inode+0x510/0x620 [btrfs]
[4551456.239997] [c000002aeb923cd0] [c0000000004aa330] evict+0x100/0x250
[4551456.239999] [c000002aeb923d10] [c0000000004a07b8] dentry_unlink_inode+0x148/0x210
[4551456.240000] [c000002aeb923d40] [c00000000048e9e8] vfs_rmdir+0x228/0x270
[4551456.240002] [c000002aeb923d80] [c0000000004965a0] do_rmdir+0x270/0x2a0
[4551456.240003] [c000002aeb923e20] [c00000000000b278] system_call+0x5c/0x68
talex5 commented 3 years ago
Feb 02 22:08:09 hipp kernel: INFO: task kworker/u144:3:1661147 blocked for more than 120 seconds.
Feb 02 22:08:09 hipp kernel:       Not tainted 5.4.0-54-generic #60-Ubuntu
Feb 02 22:08:09 hipp kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 02 22:08:09 hipp kernel: kworker/u144:3  D    0 1661147      2 0x80004000
Feb 02 22:08:09 hipp kernel: Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
Feb 02 22:08:09 hipp kernel: Call Trace:
Feb 02 22:08:09 hipp kernel:  __schedule+0x2e3/0x740
Feb 02 22:08:09 hipp kernel:  ? __btrfs_block_rsv_release+0x1c1/0x300 [btrfs]
Feb 02 22:08:09 hipp kernel:  schedule+0x42/0xb0
Feb 02 22:08:09 hipp kernel:  schedule_preempt_disabled+0xe/0x10
Feb 02 22:08:09 hipp kernel:  __mutex_lock.isra.0+0x182/0x4f0
Feb 02 22:08:09 hipp kernel:  ? ktime_get+0x3e/0xa0
Feb 02 22:08:09 hipp kernel:  __mutex_lock_slowpath+0x13/0x20
Feb 02 22:08:09 hipp kernel:  mutex_lock+0x2e/0x40
Feb 02 22:08:09 hipp kernel:  btrfs_start_delalloc_roots+0x60/0x280 [btrfs]
Feb 02 22:08:09 hipp kernel:  flush_space+0x60b/0x770 [btrfs]
Feb 02 22:08:09 hipp kernel:  ? extent_io_tree_init+0x30/0x40 [btrfs]
Feb 02 22:08:09 hipp kernel:  ? __clear_extent_bit+0x201/0x4a0 [btrfs]
Feb 02 22:08:09 hipp kernel:  priority_reclaim_metadata_space.isra.0+0x18b/0x220 [btrfs]
Feb 02 22:08:09 hipp kernel:  ? can_overcommit.part.0+0x5f/0xc0 [btrfs]
Feb 02 22:08:09 hipp kernel:  btrfs_reserve_metadata_bytes+0x5ea/0x970 [btrfs]
Feb 02 22:08:09 hipp kernel:  ? btrfs_truncate_inode_items+0x35e/0xdd0 [btrfs]
Feb 02 22:08:09 hipp kernel:  ? __btrfs_block_rsv_release+0x1c1/0x300 [btrfs]
Feb 02 22:08:09 hipp kernel:  btrfs_block_rsv_refill+0x7d/0xa0 [btrfs]
Feb 02 22:08:09 hipp kernel:  evict_refill_and_join+0x39/0xd0 [btrfs]
Feb 02 22:08:09 hipp kernel:  btrfs_evict_inode+0x417/0x4c0 [btrfs]
Feb 02 22:08:09 hipp kernel:  evict+0xd2/0x1b0
Feb 02 22:08:09 hipp kernel:  iput+0x148/0x210
Feb 02 22:08:09 hipp kernel:  btrfs_run_delalloc_work+0x31/0x50 [btrfs]
Feb 02 22:08:09 hipp kernel:  btrfs_work_helper+0xc1/0x3a0 [btrfs]
Feb 02 22:08:09 hipp kernel:  process_one_work+0x1eb/0x3b0
Feb 02 22:08:09 hipp kernel:  worker_thread+0x4d/0x400
Feb 02 22:08:09 hipp kernel:  kthread+0x104/0x140
Feb 02 22:08:09 hipp kernel:  ? process_one_work+0x3b0/0x3b0
Feb 02 22:08:09 hipp kernel:  ? kthread_park+0x90/0x90
Feb 02 22:08:09 hipp kernel:  ret_from_fork+0x35/0x40
talex5 commented 3 years ago
[836031.482376] INFO: task ocluster-worker:168425 blocked for more than 120 seconds.
[836031.484793]       Not tainted 5.4.0-65-generic #73-Ubuntu
[836031.487131] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[836031.489458] ocluster-worker D    0 168425      1 0x00004000
[836031.489463] Call Trace:
[836031.489475]  __schedule+0x2e3/0x740
[836031.489480]  ? __switch_to_asm+0x34/0x70
[836031.489483]  ? __switch_to_asm+0x40/0x70
[836031.489487]  schedule+0x42/0xb0
[836031.489490]  schedule_preempt_disabled+0xe/0x10
[836031.489493]  __mutex_lock.isra.0+0x182/0x4f0
[836031.489498]  __mutex_lock_slowpath+0x13/0x20
[836031.489500]  mutex_lock+0x2e/0x40
[836031.489542]  btrfs_start_delalloc_roots+0x60/0x280 [btrfs]
[836031.489547]  ? __next_timer_interrupt+0xe0/0xe0
[836031.489589]  flush_space+0x60b/0x770 [btrfs]
[836031.489629]  ? get_alloc_profile+0x7f/0x180 [btrfs]
[836031.489667]  priority_reclaim_metadata_space.isra.0+0xbf/0x1e0 [btrfs]
[836031.489705]  btrfs_reserve_metadata_bytes+0x608/0x950 [btrfs]
[836031.489742]  ? __btrfs_block_rsv_release+0x1c1/0x300 [btrfs]
[836031.489778]  btrfs_block_rsv_refill+0x7d/0xa0 [btrfs]
[836031.489810]  evict_refill_and_join+0x39/0xd0 [btrfs]
[836031.489843]  btrfs_evict_inode+0x417/0x4c0 [btrfs]
[836031.489850]  evict+0xd2/0x1b0
[836031.489853]  iput+0x148/0x210
[836031.489857]  dentry_unlink_inode+0xc6/0x110
[836031.489860]  d_delete+0x76/0x80
[836031.489864]  vfs_rmdir+0x179/0x1a0
[836031.489867]  do_rmdir+0x18c/0x1c0
[836031.489870]  __x64_sys_rmdir+0x17/0x20
[836031.489874]  do_syscall_64+0x57/0x190
[836031.489878]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
kit-ty-kate commented 3 years ago

After ~30 hours of testing on pisto it looks like upgrading the linux kernel to at least 5.11.3 is the solution.

dra27 commented 3 years ago

Just the kernel, or the increasing the swap as well?

kit-ty-kate commented 3 years ago

just the kernel

talex5 commented 3 years ago

That's great! How are you installing the new kernel? We should add the commands to the Ansible playbook.

talex5 commented 3 years ago

x86-bm-12:

[712061.425998] ocamlc.opt      D    0 3058109 3058074 0x00000320
[712061.426002] Call Trace:
[712061.426012]  __schedule+0x2e3/0x740
[712061.426038]  ? btrfs_get_token_32+0x63/0xf0 [btrfs]
[712061.426040]  schedule+0x42/0xb0
[712061.426044]  rwsem_down_write_slowpath+0x244/0x4d0
[712061.426046]  down_write+0x41/0x50
[712061.426050]  do_last+0x2b1/0x900
[712061.426053]  ? __alloc_file+0x94/0x110
[712061.426056]  ? update_curr+0x112/0x1e0
[712061.426057]  path_openat+0x8d/0x290
[712061.426059]  do_filp_open+0x91/0x100
[712061.426063]  ? __seccomp_filter+0x7e/0x690
[712061.426066]  ? __alloc_fd+0x46/0x150
[712061.426067]  do_sys_open+0x17e/0x290
[712061.426069]  __x64_sys_openat+0x20/0x30
[712061.426071]  do_syscall_64+0x57/0x190
[712061.426073]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[712061.426075] RIP: 0033:0x7fa7980a8c8b
[712061.426082] Code: Bad RIP value.
[712061.426084] RSP: 002b:00007ffea4549c30 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
[712061.426086] RAX: ffffffffffffffda RBX: 0000000000000241 RCX: 00007fa7980a8c8b
[712061.426087] RDX: 0000000000080241 RSI: 0000562c068e3a50 RDI: 00000000ffffff9c
[712061.426088] RBP: 00000000000001b6 R08: 0000000000000004 R09: 0000562c06364c00
[712061.426089] R10: 00000000000001b6 R11: 0000000000000246 R12: 0000562c068e3a50
[712061.426090] R13: 0000000000000000 R14: 0000562c068e2260 R15: 00007fa797d3d280
[712182.255934] ocluster-worker D    0  8507      1 0x00000000
[712182.255939] Call Trace:
[712182.255954]  __schedule+0x2e3/0x740
[712182.255957]  schedule+0x42/0xb0
[712182.255960]  rwsem_down_read_slowpath+0x16c/0x4a0
[712182.255963]  down_read+0x85/0xa0
[712182.255967]  lookup_slow+0x2d/0x60
[712182.255968]  walk_component+0x1da/0x360
[712182.255970]  ? link_path_walk.part.0+0x6d/0x550
[712182.255972]  path_lookupat.isra.0+0x80/0x230
[712182.255974]  filename_lookup+0xae/0x170
[712182.255976]  ? __switch_to_asm+0x40/0x70
[712182.255979]  ? __check_object_size+0x13f/0x150
[712182.255982]  ? strncpy_from_user+0x4c/0x150
[712182.256009]  user_path_at_empty+0x3a/0x50
[712182.256010]  vfs_statx+0x7d/0xe0
[712182.256012]  ? __switch_to_asm+0x40/0x70
[712182.256013]  ? __switch_to_asm+0x34/0x70
[712182.256015]  __do_sys_newstat+0x3e/0x80
[712182.256016]  ? schedule+0x42/0xb0
[712182.256019]  ? exit_to_usermode_loop+0x8f/0x160
[712182.256021]  __x64_sys_newstat+0x16/0x20
[712182.256022]  do_syscall_64+0x57/0x190
[712182.256024]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[712182.256026] RIP: 0033:0x7f87b637b49a
[712182.256035] Code: Bad RIP value.
talex5 commented 3 years ago

Another 3 machines stuck today:

[4758684.432422] Call Trace:
[4758684.432428]  __schedule+0x2e3/0x740
[4758684.432434]  schedule+0x42/0xb0
[4758684.432443]  schedule_preempt_disabled+0xe/0x10
[4758684.432452]  __mutex_lock.isra.0+0x182/0x4f0
[4758684.432462]  ? ktime_get+0x3e/0xa0
[4758684.432472]  __mutex_lock_slowpath+0x13/0x20
[4758684.432481]  mutex_lock+0x2e/0x40
[4758684.432522]  btrfs_start_delalloc_roots+0x60/0x280 [btrfs]
[4758684.432565]  flush_space+0x60b/0x770 [btrfs]
[4758684.432611]  ? __clear_extent_bit+0x201/0x4a0 [btrfs]
[4758684.432652]  ? get_alloc_profile+0x7f/0x180 [btrfs]
[4758684.432688]  priority_reclaim_metadata_space.isra.0+0xbf/0x1e0 [btrfs]
[4758684.432722]  btrfs_reserve_metadata_bytes+0x608/0x950 [btrfs]
[4758684.432795]  ? __btrfs_block_rsv_release+0x1c1/0x300 [btrfs]
[4758684.432869]  btrfs_block_rsv_refill+0x7d/0xa0 [btrfs]
[4758684.432921]  evict_refill_and_join+0x39/0xd0 [btrfs]
[4758684.432966]  btrfs_evict_inode+0x417/0x4c0 [btrfs]
[4758684.432973]  evict+0xd2/0x1b0
[4758684.432985]  iput+0x148/0x210
[4758684.432995]  do_unlinkat+0x1c5/0x2d0
[4758684.433005]  __x64_sys_unlink+0x23/0x30
[4758684.433017]  do_syscall_64+0x57/0x190
[4758684.433026]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[11926567.453889] btrfs           D    0 3096469 3288919 0x00000000
[11926567.453891] Call Trace:
[11926567.453894]  __schedule+0x2e3/0x740
[11926567.453901]  schedule+0x42/0xb0
[11926567.453909]  schedule_timeout+0x10e/0x160
[11926567.453918]  ? wake_up_process+0x15/0x20
[11926567.453926]  ? insert_work+0x70/0x80
[11926567.453933]  ? __queue_work+0x14c/0x3f0
[11926567.453940]  wait_for_completion+0xb1/0x120
[11926567.453947]  ? wake_up_q+0x70/0x70
[11926567.453972]  start_delalloc_inodes+0x245/0x300 [btrfs]
[11926567.454005]  btrfs_start_delalloc_snapshot+0x2a/0x40 [btrfs]
[11926567.454034]  create_snapshot+0xf9/0x360 [btrfs]
[11926567.454059]  ? btrfs_check_dir_item_collision+0x8b/0x130 [btrfs]
[11926567.454087]  btrfs_mksubvol+0x138/0x200 [btrfs]
[11926567.454113]  btrfs_ioctl_snap_create_transid+0x176/0x180 [btrfs]
[11926567.454145]  btrfs_ioctl_snap_create_v2+0x108/0x190 [btrfs]
[11926567.454171]  btrfs_ioctl+0x113f/0x20d0 [btrfs]
[11926567.454181]  ? do_anonymous_page+0x2e6/0x650
[11926567.454190]  ? __handle_mm_fault+0x760/0x7a0
[11926567.454204]  do_vfs_ioctl+0x407/0x670
[11926567.454206]  ? do_vfs_ioctl+0x407/0x670
[11926567.454215]  ? do_user_addr_fault+0x216/0x450
[11926567.454224]  ksys_ioctl+0x67/0x90
[11926567.454231]  __x64_sys_ioctl+0x1a/0x20
[11926567.454237]  do_syscall_64+0x57/0x190
[11926567.454245]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[5659024.048043] opam            D    0 2942301 2941300 0x00004320
[5659024.048052] Call Trace:
[5659024.048067]  __schedule+0x2e3/0x740
[5659024.048073]  ? __switch_to_asm+0x34/0x70
[5659024.048077]  ? __switch_to_asm+0x40/0x70
[5659024.048080]  schedule+0x42/0xb0
[5659024.048084]  schedule_preempt_disabled+0xe/0x10
[5659024.048087]  __mutex_lock.isra.0+0x182/0x4f0
[5659024.048092]  __mutex_lock_slowpath+0x13/0x20
[5659024.048095]  mutex_lock+0x2e/0x40
[5659024.048142]  btrfs_start_delalloc_roots+0x60/0x280 [btrfs]
[5659024.048156]  ? __next_timer_interrupt+0xe0/0xe0
[5659024.048201]  flush_space+0x60b/0x770 [btrfs]
[5659024.048244]  ? __clear_extent_bit+0x201/0x4a0 [btrfs]
[5659024.048286]  ? get_alloc_profile+0x7f/0x180 [btrfs]
[5659024.048328]  priority_reclaim_metadata_space.isra.0+0xbf/0x1e0 [btrfs]
[5659024.048369]  btrfs_reserve_metadata_bytes+0x608/0x950 [btrfs]
[5659024.048408]  ? __btrfs_block_rsv_release+0x1c1/0x300 [btrfs]
[5659024.048446]  btrfs_block_rsv_refill+0x7d/0xa0 [btrfs]
[5659024.048479]  evict_refill_and_join+0x39/0xd0 [btrfs]
[5659024.048514]  btrfs_evict_inode+0x417/0x4c0 [btrfs]
[5659024.048520]  evict+0xd2/0x1b0
[5659024.048524]  iput+0x148/0x210
[5659024.048527]  do_unlinkat+0x1c5/0x2d0
[5659024.048530]  __x64_sys_unlink+0x23/0x30
[5659024.048535]  do_syscall_64+0x57/0x190
[5659024.048539]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
kit-ty-kate commented 3 years ago

After ~30 hours of testing on pisto it looks like upgrading the linux kernel to at least 5.11.3 is the solution.

I bisected the issue in this order:

The changelog for linux 5.8 contains numerous fixes for Btrfs (as is every version) but I'm not sure which one is at play here: https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.8

Given Ubuntu 20.10 ships with linux 5.8 I suspect that upgrading all the workers to that version should be enough to get rid of these failures. We might want to add a check to obuilder when starting to ensure it is running on a new enough kernel (>= 5.8)

kit-ty-kate commented 3 years ago

linux 5.8.18 does not seem to fail after 3+ hours of testing (I am leaving it run overnight to be sure)

I can confirm, still no failures after 16 hours of testing :tada: (it usually only takes ~2)