Open jpsollie opened 1 week ago
apparently, it's not only NFS: a simple "ls .snapshots" (.snapshots is the directory where my snapshots live) is enough to trigger the bug:
[ 267.673698] [ C22] rcu: INFO: rcu_sched self-detected stall on CPU
[ 267.673701] [ C22] rcu: 22-....: (2099 ticks this GP) idle=fb04/1/0x4000000000000000 softirq=1207/1207 fqs=1043
[ 267.673706] [ C22] rcu: (t=2100 jiffies g=2049 q=339 ncpus=64)
[ 267.673708] [ C22] CPU: 22 UID: 0 PID: 4929 Comm: ls Not tainted 6.12.0release+ #4
[ 267.673709] [ C22] Hardware name: Micro-Star International Co., Ltd. MS-7C59/Creator TRX40 (MS-7C59), BIOS 1.96 04/25/2023
[ 267.673710] [ C22] RIP: 0010:rhashtable_insert_slow+0x349/0x500
[ 267.673714] [ C22] Code: 95 ff 31 c0 eb 10 49 8b 45 30 48 85 c0 49 0f 44 c6 eb 03 4c 89 f0 f0 41 80 24 24 fe f7 44 24 38 00 02 00 00 74 01 fb 48 85 c0 <0f> 84 91 00 00 00 48 3d 00 f0 ff ff 0f 86 15 fd ff ff e9 80 00 00
[ 267.673716] [ C22] RSP: 0018:ffff8882155b3870 EFLAGS: 00000286
[ 267.673717] [ C22] RAX: ffff88810462d800 RBX: ffff88810e571551 RCX: 0000000000000011
[ 267.673718] [ C22] RDX: 0000000000000011 RSI: ffff888259978000 RDI: ffff8882155b3888
[ 267.673719] [ C22] RBP: ffff88823cbc7958 R08: 000000008e0492de R09: 0000000000000000
[ 267.673720] [ C22] R10: 000000008e0492de R11: ffff88821c326680 R12: ffff88810e571550
[ 267.673721] [ C22] R13: ffff88810e571400 R14: fffffffffffffff5 R15: ffff88821c326680
[ 267.673721] [ C22] FS: 00007fb9de145740(0000) GS:ffff889fbe580000(0000) knlGS:0000000000000000
[ 267.673722] [ C22] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 267.673723] [ C22] CR2: 000055da12573040 CR3: 000000021542d000 CR4: 00000000003506f0
[ 267.673724] [ C22] Call Trace:
[ 267.673725] [ C22] <IRQ>
[ 267.673726] [ C22] ? rcu_dump_cpu_stacks+0xdf/0x130
[ 267.673729] [ C22] ? print_cpu_stall+0x173/0x2c0
[ 267.673730] [ C22] ? rcu_sched_clock_irq+0x221/0x5d0
[ 267.673731] [ C22] ? update_process_times+0x71/0xa0
[ 267.673733] [ C22] ? tick_nohz_handler+0xbb/0x110
[ 267.673735] [ C22] ? tick_setup_sched_timer+0x180/0x180
[ 267.673736] [ C22] ? __hrtimer_run_queues+0xf1/0x250
[ 267.673737] [ C22] ? hrtimer_interrupt+0xf0/0x390
[ 267.673738] [ C22] ? __sysvec_apic_timer_interrupt+0x47/0x140
[ 267.673740] [ C22] ? sysvec_apic_timer_interrupt+0x63/0x80
[ 267.673741] [ C22] </IRQ>
[ 267.673742] [ C22] <TASK>
[ 267.673742] [ C22] ? asm_sysvec_apic_timer_interrupt+0x16/0x20
[ 267.673744] [ C22] ? rhashtable_insert_slow+0x349/0x500
[ 267.673745] [ C22] bch2_inode_hash_insert+0x309/0x480
[ 267.673747] [ C22] bch2_lookup+0x567/0x730
[ 267.673749] [ C22] ? __memcg_slab_post_alloc_hook+0x2ca/0x370
[ 267.673751] [ C22] ? __d_alloc.llvm.3704585759423112158+0x2d/0x1b0
[ 267.673753] [ C22] ? kmem_cache_alloc_lru_noprof+0xe9/0x1b0
[ 267.673754] [ C22] ? __d_alloc.llvm.3704585759423112158+0x14e/0x1b0
[ 267.673756] [ C22] __lookup_slow+0xcb/0x130
[ 267.673758] [ C22] lookup_slow+0x33/0x50
[ 267.673759] [ C22] walk_component+0xcc/0xe0
[ 267.673761] [ C22] path_lookupat+0x4d/0xf0
[ 267.673762] [ C22] filename_lookup+0xa8/0x140
[ 267.673764] [ C22] vfs_statx+0x64/0xf0
[ 267.673765] [ C22] __se_sys_statx+0xc9/0x180
[ 267.673766] [ C22] ? touch_atime+0x23/0x1b0
[ 267.673767] [ C22] ? bch2_vfs_readdir+0xc2/0x160
[ 267.673769] [ C22] ? iterate_dir+0x92/0x120
[ 267.673770] [ C22] ? __se_sys_getdents64+0xa4/0xd0
[ 267.673771] [ C22] ? filldir+0x180/0x180
[ 267.673772] [ C22] do_syscall_64+0x52/0xf0
[ 267.673774] [ C22] entry_SYSCALL_64_after_hwframe+0x4b/0x53
[ 267.673775] [ C22] RIP: 0033:0x7fb9de26884e
[ 267.673777] [ C22] Code: 05 0c 00 ba ff ff ff ff 64 c7 00 16 00 00 00 e9 9a fd ff ff e8 d3 88 01 00 0f 1f 00 f3 0f 1e fa 41 89 ca b8 4c 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 2a 89 c1 85 c0 74 0f 48 8b 05 a5 05 0c 00 64
[ 267.673778] [ C22] RSP: 002b:00007ffff17b71f8 EFLAGS: 00000246 ORIG_RAX: 000000000000014c
[ 267.673779] [ C22] RAX: ffffffffffffffda RBX: 000055da12573be8 RCX: 00007fb9de26884e
[ 267.673780] [ C22] RDX: 0000000000000900 RSI: 00007ffff17b7340 RDI: 00000000ffffff9c
[ 267.673780] [ C22] RBP: 00007ffff17b7330 R08: 00007ffff17b7210 R09: 00007ffff17b73e0
[ 267.673781] [ C22] R10: 0000000000000002 R11: 0000000000000246 R12: 0000000000000002
[ 267.673781] [ C22] R13: 000055da12573bd0 R14: 0000000000000000 R15: 0000000000000000
[ 267.673782] [ C22] </TASK>
[ 330.701892] [ C22] rcu: INFO: rcu_sched self-detected stall on CPU
[ 330.701895] [ C22] rcu: 22-....: (8402 ticks this GP) idle=fb04/1/0x4000000000000000 softirq=1207/1207 fqs=4193
[ 330.701900] [ C22] rcu: (t=8403 jiffies g=2049 q=2717 ncpus=64)
[ 330.701902] [ C22] CPU: 22 UID: 0 PID: 4929 Comm: ls Not tainted 6.12.0release+ #4
[ 330.701903] [ C22] Hardware name: Micro-Star International Co., Ltd. MS-7C59/Creator TRX40 (MS-7C59), BIOS 1.96 04/25/2023
[ 330.701904] [ C22] RIP: 0010:rhashtable_insert_slow+0x349/0x500
[ 330.701907] [ C22] Code: 95 ff 31 c0 eb 10 49 8b 45 30 48 85 c0 49 0f 44 c6 eb 03 4c 89 f0 f0 41 80 24 24 fe f7 44 24 38 00 02 00 00 74 01 fb 48 85 c0 <0f> 84 91 00 00 00 48 3d 00 f0 ff ff 0f 86 15 fd ff ff e9 80 00 00
[ 330.701908] [ C22] RSP: 0018:ffff8882155b3870 EFLAGS: 00000286
[ 330.701910] [ C22] RAX: ffff88810462d800 RBX: ffff88810e571551 RCX: 0000000000000011
[ 330.701911] [ C22] RDX: 0000000000000011 RSI: ffff888259978000 RDI: ffff8882155b3888
[ 330.701911] [ C22] RBP: ffff88823cbc7958 R08: 000000008e0492de R09: 0000000000000000
[ 330.701912] [ C22] R10: 000000008e0492de R11: ffff88821c326680 R12: ffff88810e571550
[ 330.701913] [ C22] R13: ffff88810e571400 R14: fffffffffffffff5 R15: ffff88821c326680
[ 330.701913] [ C22] FS: 00007fb9de145740(0000) GS:ffff889fbe580000(0000) knlGS:0000000000000000
[ 330.701914] [ C22] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 330.701915] [ C22] CR2: 000055da12573040 CR3: 000000021542d000 CR4: 00000000003506f0
[ 330.701916] [ C22] Call Trace:
[ 330.701916] [ C22] <IRQ>
[ 330.701917] [ C22] ? rcu_dump_cpu_stacks+0xdf/0x130
[ 330.701919] [ C22] ? print_cpu_stall+0x173/0x2c0
[ 330.701920] [ C22] ? rcu_sched_clock_irq+0x221/0x5d0
[ 330.701921] [ C22] ? update_process_times+0x71/0xa0
[ 330.701923] [ C22] ? tick_nohz_handler+0xbb/0x110
[ 330.701925] [ C22] ? tick_setup_sched_timer+0x180/0x180
[ 330.701926] [ C22] ? __hrtimer_run_queues+0xf1/0x250
[ 330.701927] [ C22] ? hrtimer_interrupt+0xf0/0x390
[ 330.701928] [ C22] ? __sysvec_apic_timer_interrupt+0x47/0x140
[ 330.701930] [ C22] ? sysvec_apic_timer_interrupt+0x63/0x80
[ 330.701931] [ C22] </IRQ>
[ 330.701931] [ C22] <TASK>
[ 330.701931] [ C22] ? asm_sysvec_apic_timer_interrupt+0x16/0x20
[ 330.701933] [ C22] ? rhashtable_insert_slow+0x349/0x500
[ 330.701934] [ C22] bch2_inode_hash_insert+0x309/0x480
[ 330.701936] [ C22] bch2_lookup+0x567/0x730
[ 330.701938] [ C22] ? __memcg_slab_post_alloc_hook+0x2ca/0x370
[ 330.701940] [ C22] ? __d_alloc.llvm.3704585759423112158+0x2d/0x1b0
[ 330.701941] [ C22] ? kmem_cache_alloc_lru_noprof+0xe9/0x1b0
[ 330.701943] [ C22] ? __d_alloc.llvm.3704585759423112158+0x14e/0x1b0
[ 330.701945] [ C22] __lookup_slow+0xcb/0x130
[ 330.701947] [ C22] lookup_slow+0x33/0x50
[ 330.701948] [ C22] walk_component+0xcc/0xe0
[ 330.701949] [ C22] path_lookupat+0x4d/0xf0
[ 330.701950] [ C22] filename_lookup+0xa8/0x140
[ 330.701952] [ C22] vfs_statx+0x64/0xf0
[ 330.701953] [ C22] __se_sys_statx+0xc9/0x180
[ 330.701954] [ C22] ? touch_atime+0x23/0x1b0
[ 330.701956] [ C22] ? bch2_vfs_readdir+0xc2/0x160
[ 330.701957] [ C22] ? iterate_dir+0x92/0x120
[ 330.701958] [ C22] ? __se_sys_getdents64+0xa4/0xd0
[ 330.701959] [ C22] ? filldir+0x180/0x180
[ 330.701960] [ C22] do_syscall_64+0x52/0xf0
[ 330.701962] [ C22] entry_SYSCALL_64_after_hwframe+0x4b/0x53
[ 330.701963] [ C22] RIP: 0033:0x7fb9de26884e
[ 330.701964] [ C22] Code: 05 0c 00 ba ff ff ff ff 64 c7 00 16 00 00 00 e9 9a fd ff ff e8 d3 88 01 00 0f 1f 00 f3 0f 1e fa 41 89 ca b8 4c 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 2a 89 c1 85 c0 74 0f 48 8b 05 a5 05 0c 00 64
[ 330.701965] [ C22] RSP: 002b:00007ffff17b71f8 EFLAGS: 00000246 ORIG_RAX: 000000000000014c
[ 330.701966] [ C22] RAX: ffffffffffffffda RBX: 000055da12573be8 RCX: 00007fb9de26884e
[ 330.701967] [ C22] RDX: 0000000000000900 RSI: 00007ffff17b7340 RDI: 00000000ffffff9c
[ 330.701968] [ C22] RBP: 00007ffff17b7330 R08: 00007ffff17b7210 R09: 00007ffff17b73e0
[ 330.701968] [ C22] R10: 0000000000000002 R11: 0000000000000246 R12: 0000000000000002
[ 330.701969] [ C22] R13: 000055da12573bd0 R14: 0000000000000000 R15: 0000000000000000
[ 330.701970] [ C22] </TASK>
[ 393.730278] [ C22] rcu: INFO: rcu_sched self-detected stall on CPU
[ 393.730281] [ C22] rcu: 22-....: (14705 ticks this GP) idle=fb04/1/0x4000000000000000 softirq=1207/1207 fqs=7344
[ 393.730285] [ C22] rcu: (t=14706 jiffies g=2049 q=5902 ncpus=64)
[ 393.730287] [ C22] CPU: 22 UID: 0 PID: 4929 Comm: ls Not tainted 6.12.0release+ #4
[ 393.730289] [ C22] Hardware name: Micro-Star International Co., Ltd. MS-7C59/Creator TRX40 (MS-7C59), BIOS 1.96 04/25/2023
[ 393.730289] [ C22] RIP: 0010:rhashtable_insert_slow+0x349/0x500
[ 393.730292] [ C22] Code: 95 ff 31 c0 eb 10 49 8b 45 30 48 85 c0 49 0f 44 c6 eb 03 4c 89 f0 f0 41 80 24 24 fe f7 44 24 38 00 02 00 00 74 01 fb 48 85 c0 <0f> 84 91 00 00 00 48 3d 00 f0 ff ff 0f 86 15 fd ff ff e9 80 00 00
[ 393.730293] [ C22] RSP: 0018:ffff8882155b3870 EFLAGS: 00000286
[ 393.730294] [ C22] RAX: ffff88810462d800 RBX: ffff88810e571551 RCX: 0000000000000011
[ 393.730295] [ C22] RDX: 0000000000000011 RSI: ffff888259978000 RDI: ffff8882155b3888
[ 393.730296] [ C22] RBP: ffff88823cbc7958 R08: 000000008e0492de R09: 0000000000000000
[ 393.730297] [ C22] R10: 000000008e0492de R11: ffff88821c326680 R12: ffff88810e571550
[ 393.730297] [ C22] R13: ffff88810e571400 R14: fffffffffffffff5 R15: ffff88821c326680
[ 393.730298] [ C22] FS: 00007fb9de145740(0000) GS:ffff889fbe580000(0000) knlGS:0000000000000000
[ 393.730299] [ C22] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 393.730300] [ C22] CR2: 000055da12573040 CR3: 000000021542d000 CR4: 00000000003506f0
[ 393.730300] [ C22] Call Trace:
[ 393.730302] [ C22] <IRQ>
[ 393.730302] [ C22] ? rcu_dump_cpu_stacks+0xdf/0x130
[ 393.730304] [ C22] ? print_cpu_stall+0x173/0x2c0
[ 393.730306] [ C22] ? rcu_sched_clock_irq+0x221/0x5d0
[ 393.730307] [ C22] ? update_process_times+0x71/0xa0
[ 393.730308] [ C22] ? tick_nohz_handler+0xbb/0x110
[ 393.730310] [ C22] ? tick_setup_sched_timer+0x180/0x180
[ 393.730311] [ C22] ? __hrtimer_run_queues+0xf1/0x250
[ 393.730312] [ C22] ? hrtimer_interrupt+0xf0/0x390
[ 393.730313] [ C22] ? __sysvec_apic_timer_interrupt+0x47/0x140
[ 393.730315] [ C22] ? sysvec_apic_timer_interrupt+0x63/0x80
[ 393.730316] [ C22] </IRQ>
[ 393.730316] [ C22] <TASK>
[ 393.730316] [ C22] ? asm_sysvec_apic_timer_interrupt+0x16/0x20
[ 393.730318] [ C22] ? rhashtable_insert_slow+0x349/0x500
[ 393.730319] [ C22] bch2_inode_hash_insert+0x309/0x480
[ 393.730321] [ C22] bch2_lookup+0x567/0x730
[ 393.730322] [ C22] ? __memcg_slab_post_alloc_hook+0x2ca/0x370 [ 393.730324] [ C22] ? __d_alloc.llvm.3704585759423112158+0x2d/0x1b0
[ 393.730326] [ C22] ? kmem_cache_alloc_lru_noprof+0xe9/0x1b0
[ 393.730328] [ C22] ? __d_alloc.llvm.3704585759423112158+0x14e/0x1b0
[ 393.730329] [ C22] __lookup_slow+0xcb/0x130
[ 393.730331] [ C22] lookup_slow+0x33/0x50
[ 393.730333] [ C22] walk_component+0xcc/0xe0
[ 393.730334] [ C22] path_lookupat+0x4d/0xf0
[ 393.730335] [ C22] filename_lookup+0xa8/0x140
[ 393.730337] [ C22] vfs_statx+0x64/0xf0
[ 393.730338] [ C22] __se_sys_statx+0xc9/0x180
[ 393.730339] [ C22] ? touch_atime+0x23/0x1b0
[ 393.730340] [ C22] ? bch2_vfs_readdir+0xc2/0x160
[ 393.730342] [ C22] ? iterate_dir+0x92/0x120
[ 393.730343] [ C22] ? __se_sys_getdents64+0xa4/0xd0
[ 393.730344] [ C22] ? filldir+0x180/0x180
[ 393.730345] [ C22] do_syscall_64+0x52/0xf0
[ 393.730346] [ C22] entry_SYSCALL_64_after_hwframe+0x4b/0x53
[ 393.730347] [ C22] RIP: 0033:0x7fb9de26884e
[ 393.730348] [ C22] Code: 05 0c 00 ba ff ff ff ff 64 c7 00 16 00 00 00 e9 9a fd ff ff e8 d3 88 01 00 0f 1f 00 f3 0f 1e fa 41 89 ca b8 4c 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 2a 89 c1 85 c0 74 0f 48 8b 05 a5 05 0c 00 64
[ 393.730349] [ C22] RSP: 002b:00007ffff17b71f8 EFLAGS: 00000246 ORIG_RAX: 000000000000014c
[ 393.730350] [ C22] RAX: ffffffffffffffda RBX: 000055da12573be8 RCX: 00007fb9de26884e
[ 393.730351] [ C22] RDX: 0000000000000900 RSI: 00007ffff17b7340 RDI: 00000000ffffff9c
[ 393.730352] [ C22] RBP: 00007ffff17b7330 R08: 00007ffff17b7210 R09: 00007ffff17b73e0
[ 393.730352] [ C22] R10: 0000000000000002 R11: 0000000000000246 R12: 0000000000000002
[ 393.730353] [ C22] R13: 000055da12573bd0 R14: 0000000000000000 R15: 0000000000000000
[ 393.730353] [ C22] </TASK>
To be honest, I have no clue what caused this. I transferred a set of 20x200MB files to my multi-tiered FS. During the transfer, I was already pretty annoyed it took that long, and that the background target had to wake up (which takes some time) before the NFS transfer could be started (whereas the SSD frond/promote targets should perfectly be able to keep it for a while) But, after completely transferring, I saw in dmesg on the target, it wasn't healty:
after a while, I was unable to login again. No SSH, no samba mounts, not native terminal, nothing. I also wasn't able to unmount the NFS share on the target anymore. The magic sysrq+B keyboard combination made my PC reboot, but for the rest nothing was possible anymore.