Open snajpa opened 4 weeks ago
(gdb) l *step_into+0x57e
0xffffffff813b9bfe is in step_into (./include/linux/audit.h:375).
370 }
371 static inline void audit_inode(struct filename *name,
372 const struct dentry *dentry,
373 unsigned int aflags) {
374 if (unlikely(!audit_dummy_context()))
375 __audit_inode(name, dentry, aflags);
376 }
377 static inline void audit_file(struct file *file)
378 {
379 if (unlikely(!audit_dummy_context()))
step_into() fs/namei.c#L1874 -> pick_link() fs/namei.c#L1775 -> may_follow_link() fs/namei.c#L1125 -> audit_inode() include/linux/audit.h#L375
OK, this really is not the first time we're seeing the issue, but it was harder to correlate after a myriad of config changes, like turning off selinux in the kernel config.
This was with OpenZFS @ 8f2f6cd2a:
step_into ends a bit earlier b/c there's selinux in the mix
general protection fault, probably for non-canonical address 0x11c5fee1a757ad96: 0000 [#5037] PREEMPT SMP NOPTI
CPU: 120 PID: 1957892 Comm: systemctl Tainted: G D OE 6.6.21 #1-vpsAdminOS
Hardware name: Dell Inc. PowerEdge R6515/0R4CNN, BIOS 2.0.3 01/15/2021
In memory cgroup /osctl/pool.tank/group.default/user.3639/ct.24657/user-owned/lxc.payload.24657/system.slice/snapd.service
RIP: 0010:selinux_inode_follow_link+0x96/0xd0
Code: 00 00 8b 5c 01 04 e8 49 fd ff ff 85 c0 75 4c 48 8b 45 38 48 85 c0 74 12 48 63 15 ad af d7 00 48 01 d0 48 3d 00 f0 ff ff 77 16 <0f> b7 50 20 8b 70 1c 49 89 e0 b9 02 00 00 00 89 df e8 04 6e ff ff
RSP: 0018:ffff9cb5b2633be0 EFLAGS: 00010203
RAX: 11c5fee1a757ad76 RBX: 0000000000000001 RCX: ffff8a654183dbe0
RDX: 0000000000000010 RSI: 0000000000000000 RDI: ffff8956997b3218
RBP: ffff8956997b3218 R08: 0000000000000051 R09: 0000000000000000
R10: 0000000000000003 R11: 0000000000000002 R12: ffff8956997b3218
R13: ffff8982b8f0a600 R14: ffff9cb5b2633e18 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff8a8c7ee00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fe4fb3a0270 CR3: 00000121fc2b4000 CR4: 0000000000350ee0
Call Trace:
<TASK>
? die_addr+0x36/0x90
? exc_general_protection+0x1af/0x3d0
? asm_exc_general_protection+0x26/0x30
? selinux_inode_follow_link+0x96/0xd0
security_inode_follow_link+0x45/0x70
step_into+0x157/0x760
? srso_return_thunk+0x5/0x10
link_path_walk.part.0.constprop.0+0x246/0x380
? srso_return_thunk+0x5/0x10
? path_init+0x290/0x3c0
path_openat+0xa7/0x1160
? srso_return_thunk+0x5/0x10
do_filp_open+0xb3/0x160
do_sys_openat2+0xab/0xe0
__x64_sys_openat+0x6e/0xa0
do_syscall_64+0x3e/0x90
entry_SYSCALL_64_after_hwframe+0x6e/0xd8
RIP: 0033:0x7fe4fb694b18
Code: f9 41 89 f0 41 83 e2 40 75 30 89 f0 25 00 00 41 00 3d 00 00 41 00 74 22 44 89 c2 4c 89 ce bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 30 c3 0f 1f 80 00 00 00 00 48 8d 44 24 08 c7
RSP: 002b:00007ffd37c03348 EFLAGS: 00000287 ORIG_RAX: 0000000000000101
RAX: ffffffffffffffda RBX: 000000000000000a RCX: 00007fe4fb694b18
RDX: 0000000000080000 RSI: 00007ffd37c033b0 RDI: 00000000ffffff9c
RBP: 00007ffd37c033a0 R08: 0000000000080000 R09: 00007ffd37c033b0
R10: 0000000000000000 R11: 0000000000000287 R12: 00007ffd37c035b7
R13: 00007ffd37c035d0 R14: 00007ffd37c033b0 R15: 00007fe4fb6a92e0
</TASK>
---[ end trace 0000000000000000 ]---
BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor instruction fetch in kernel mode
#PF: error_code(0x0010) - not-present page
PGD 0 P4D 0
Oops: 0010 [#1] PREEMPT SMP NOPTI
CPU: 77 PID: 2451139 Comm: ps Tainted: G OE K 6.8.8-2.1 #1-vpsAdminOS
Hardware name: Dell Inc. PowerEdge R7525/0PYVT1, BIOS 2.14.1 12/17/2023
In memory cgroup /osctl/pool.tank/group.default/user.859/ct.6786/user-owned/lxc.payload.6786/system.slice/cron.service
RIP: 0010:0x0
Code: Unable to access opcode bytes at 0xffffffffffffffd6.
RSP: 0018:ffffb94a7f353ca0 EFLAGS: 00010246
RAX: ffffffffc18c9b40 RBX: ffffb94a7f353db0 RCX: 0000000000000007
RDX: ffffb94a7f353e28 RSI: ffffa18cd24ed028 RDI: 0000000000000000
RBP: ffffb94a7f353cf8 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000003 R11: 0000000000000002 R12: 0000000000000000
R13: ffffa18cd24ed028 R14: ffffb94a7f353e28 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffffa102bf940000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffffffffd6 CR3: 00000008a842c000 CR4: 0000000000350ef0
Call Trace:
<TASK>
? __die+0x23/0x70
? page_fault_oops+0x181/0x490
? exc_page_fault+0x68/0x150
? asm_exc_page_fault+0x26/0x30
step_into+0x5be/0x760
path_openat+0x132/0x1180
do_filp_open+0xb3/0x160
do_sys_openat2+0xab/0xe0
__x64_sys_openat+0x6e/0xa0
do_syscall_64+0xab/0x1b0
entry_SYSCALL_64_after_hwframe+0x79/0x81
RIP: 0033:0x7fe4cb378b18
Code: f9 41 89 f0 41 83 e2 40 75 30 89 f0 25 00 00 41 00 3d 00 00 41 00 74 22 44 89 c2 4c 89 ce bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 30 c3 0f 1f 80 00 00 00 00 48 8d 44 24 08 c7
RSP: 002b:00007ffd845bc7d8 EFLAGS: 00000287 ORIG_RAX: 0000000000000101
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fe4cb378b18
RDX: 0000000000080000 RSI: 00007fe4cb351b30 RDI: 00000000ffffff9c
RBP: 00007ffd845bc830 R08: 0000000000080000 R09: 00007fe4cb351b30
R10: 0000000000000000 R11: 0000000000000287 R12: 00007ffd845bc897
R13: 00007ffd845bc8b0 R14: 00007fe4cb351b30 R15: 00007fe4cb38c040
</TASK>
CR2: 0000000000000000
---[ end trace 0000000000000000 ]---
(gdb) l *step_into+0x5be
0xffffffff813aebae is in step_into (fs/namei.c:1804).
1799 if (!res) {
1800 const char * (*get)(struct dentry *, struct inode *,
1801 struct delayed_call *);
1802 get = inode->i_op->get_link;
1803 if (nd->flags & LOOKUP_RCU) {
1804 res = get(NULL, inode, &last->done);
1805 if (res == ERR_PTR(-ECHILD) && try_to_unlazy(nd))
1806 res = get(link->dentry, inode, &last->done);
1807 } else {
1808 res = get(link->dentry, inode, &last->done);
Interestingly, while this happens, arc_evict and arc_prune are both active at 70-100% cpu each.
This is a list of processes in D state with their stacks after a few processes died with ^ similar oopses: node6-dstakz.txt
OK now that we have those two ideas invalidated, now I'm thinking what's this iput
doing there, I see no justification for it: https://github.com/vpsfreecz/zfs/blob/vpsadminos-master-dev/module/os/linux/zfs/zpl_xattr.c#L205
@behlendorf if you have a nanosecond, could you take a look if I'm missing something, is that iput
really needed there? I still have ~6hrs before the template build test shows anything :-D
@snajpa I only took a quick look but the iput()
here looks correct to me. The zfs_lookup()
call about will result in a reference being taken on the returned dxzp
which needs to be release after we're done using it in zpl_xattr_readdir()
.
oh, right, I see zfs_dirlook
does that, didn't look that far :) thank you!
Turns out it's because the OS is allowed to freely deallocate the znode even if there's a SA hold and maybe other conditions which I didn't yet check for.
This patch seems to mitigate:
diff --git a/module/os/linux/zfs/zfs_znode_os.c b/module/os/linux/zfs/zfs_znode_os.c
index a153d0850..87ff4facb 100644
--- a/module/os/linux/zfs/zfs_znode_os.c
+++ b/module/os/linux/zfs/zfs_znode_os.c
@@ -186,7 +186,7 @@ zfs_znode_init(void)
znode_cache = kmem_cache_create("zfs_znode_cache",
sizeof (znode_t), 0, zfs_znode_cache_constructor,
zfs_znode_cache_destructor, NULL, NULL, NULL,
- KMC_SLAB | KMC_RECLAIMABLE);
+ KMC_SLAB);
ASSERT(znode_hold_cache == NULL);
znode_hold_cache = kmem_cache_create("zfs_znode_hold_cache",
ping @amotin it seems that maybe it's not that harmless marking as evictable, but perhaps it's also enabling some extra unexpected eviction, at least for znodes?
I'll investigate more tomorrow, but now I have a confirmation that the template build completes in 4 hrs less than it took for the bug to really be hit with those traces :) => way faster + bug free build now
it seems that maybe it's not that harmless marking as evictable, but perhaps it's also enabling some extra unexpected eviction, at least for znodes?
@snajpa I don't believe KMC_RECLAIMABLE presence should do more than moving the allocation into a different memory region with more active reuse, and your change just hides it back deeper, which is not a solution for anything.
So I think maybe I got it... now with the zpl_drop_inode
fix (which I'm only about 80% sure it's needed, will revalidate in following days) and the d_tmpfile
fix (which for sure has an impact), all that I need is to fix the problem when superblock is going away, the sb killing looks like it's a more complicated multi-step process and it so happens that the shrinker is deallocated while zfs_prune
doesn't know about it; I'm not sure what to do about that, I'd rather avoid the messy sb state detection + the locking, but don't know how yet - for any serious sb state detection, access to sb_lock
spinlock seems to be needed, s_active
and s_flags
(SB_DYING
or SB_ACTIVE
) don't seem to be enough on a first read. Dunno, needs more work, but thought I could report the current progress.
I'll also update the two previous PR with current state. Might get some sleep first though. Rough 2 weeks :D
Also thank you @amotin you're absolutely right; it just took longer for problems to arrive, about 5000 more painful seconds of waiting for the first thing to go wrong with a first stack trace :D
I love this :D squash one bug, uncover another one :-D you know what when I'm done with this I'll just send one PR and I'll close the two currently WIP, it's shaping up to be several small commits, seems like less reviewer overhead
We really need to figure out low memory test scenarios...
System information
Describe the problem you're observing
Turns out the problems logged in #16594 were always early after the machine has booted, but that wasn't the problem causing the crashes, this seems to be it. After we applied fix for the Bad Page State bug (https://github.com/openzfs/zfs/commit/b052035990594408899fa32fd4ad6603b75b6c6d), this is what we're left with, at least it's without noise now :D
Something about this
step_into
rings a bell though, I think I've seen it this year already, grepping through the log archive now.Describe how to reproduce the problem
Include any warning/errors/backtraces from the system logs