openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.6k stars 1.75k forks source link

General Protection Fault while looking up a file #16608

Open snajpa opened 4 weeks ago

snajpa commented 4 weeks ago

System information

Type Version/Name
Distribution Name vpsAdminOS
Distribution Version staging
Kernel Version 6.10.12
Architecture x86_64
OpenZFS Version e8cbb5952

Describe the problem you're observing

Turns out the problems logged in #16594 were always early after the machine has booted, but that wasn't the problem causing the crashes, this seems to be it. After we applied fix for the Bad Page State bug (https://github.com/openzfs/zfs/commit/b052035990594408899fa32fd4ad6603b75b6c6d), this is what we're left with, at least it's without noise now :D

Something about this step_into rings a bell though, I think I've seen it this year already, grepping through the log archive now.

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

Oops: general protection fault, probably for non-canonical address 0x7b954b40ed8e20d3: 0000 [#1] PREEMPT SMP NOPTI
CPU: 100 PID: 4140414 Comm: apache2 Not tainted 6.10.12 #1-vpsAdminOS
Hardware name: Dell Inc. PowerEdge R7515/07PXPY, BIOS 2.14.1 12/17/2023
In memory cgroup /osctl/pool.tank/group.default/user.844/ct.6662/user-owned/lxc.payload.6662/system.slice/apache2.service
RIP: 0010:step_into+0x57e/0x760
Code: 83 38 00 0f 84 6b 01 00 00 48 c7 c6 8a 4e 38 8d bf a6 06 00 00 49 c7 c4 f3 ff ff ff e8 fb 27 df ff e9 8a fd ff ff 49 8b 45 30 <4c> 8b 78 08 f6 43 38 40 0f 84 ce 00 00 00 49 83 c6 10 4c 89 ee 31
RSP: 0018:ffffbc951d09fc48 EFLAGS: 00010246
RAX: 7b954b40ed8e20cb RBX: ffffbc951d09fce0 RCX: 000000006700b063
RDX: 0000000000000001 RSI: ffff9c19b3ed72e8 RDI: ffff9c1beb762900
RBP: ffffbc951d09fc98 R08: 0000000000000001 R09: 0000000000000065
R10: 0000000000000001 R11: 0000000000000002 R12: 0000000000000000
R13: ffff9c19b3ed72e8 R14: ffffbc951d09fd48 R15: 0000000000000000
FS:  00007fd7957f2bc0(0000) GS:ffff9c543fb00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005569e774408c CR3: 0000005febc76005 CR4: 0000000000770ef0
PKRU: 55555554
Call Trace:
 <TASK>
 ? die_addr+0x32/0x80
 ? exc_general_protection+0x1b9/0x3c0
 ? asm_exc_general_protection+0x22/0x30
 ? step_into+0x57e/0x760
 ? step_into+0x158/0x760
 ? srso_alias_return_thunk+0x5/0xfbef5
 path_lookupat+0x6a/0x1a0
 filename_lookup+0xdf/0x1d0
 vfs_statx+0x92/0x180
 __do_sys_newstat+0x66/0xb0
 do_syscall_64+0x9a/0x1a0
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fd79410f725
Code: 00 00 00 75 05 48 83 c4 18 c3 e8 d6 53 02 00 66 0f 1f 44 00 00 83 ff 01 48 89 f0 77 30 48 89 c7 48 89 d6 b8 04 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 03 f3 c3 90 48 8b 15 31 b7 2d 00 f7 d8 64 89
RSP: 002b:00007fff31054fd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000004
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fd79410f725
RDX: 00007fff31055070 RSI: 00007fff31055070 RDI: 00007fd7941b5108
RBP: 0000000000000000 R08: 000000000000000a R09: 00007fd79416ede0
R10: 0000000000000000 R11: 0000000000000246 R12: 00007fd7941b5108
R13: 0000000000000001 R14: 00007fff31055420 R15: 0000000000007d6d
 </TASK>
---[ end trace 0000000000000000 ]---
RIP: 0010:step_into+0x57e/0x760
Code: 83 38 00 0f 84 6b 01 00 00 48 c7 c6 8a 4e 38 8d bf a6 06 00 00 49 c7 c4 f3 ff ff ff e8 fb 27 df ff e9 8a fd ff ff 49 8b 45 30 <4c> 8b 78 08 f6 43 38 40 0f 84 ce 00 00 00 49 83 c6 10 4c 89 ee 31
RSP: 0018:ffffbc951d09fc48 EFLAGS: 00010246
RAX: 7b954b40ed8e20cb RBX: ffffbc951d09fce0 RCX: 000000006700b063
RDX: 0000000000000001 RSI: ffff9c19b3ed72e8 RDI: ffff9c1beb762900
RBP: ffffbc951d09fc98 R08: 0000000000000001 R09: 0000000000000065
R10: 0000000000000001 R11: 0000000000000002 R12: 0000000000000000
R13: ffff9c19b3ed72e8 R14: ffffbc951d09fd48 R15: 0000000000000000
FS:  00007fd7957f2bc0(0000) GS:ffff9c543fb00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005569e774408c CR3: 0000005febc76005 CR4: 0000000000770ef0

BUG: kernel NULL pointer dereference, address: 0000000000000470
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0 
Oops: Oops: 0000 [#2] PREEMPT SMP NOPTI
CPU: 115 PID: 91359 Comm: php-cgi Tainted: G      D            6.10.12 #1-vpsAdminOS
Hardware name: Dell Inc. PowerEdge R7515/07PXPY, BIOS 2.14.1 12/17/2023
In memory cgroup /osctl/pool.tank/group.default/user.777/ct.6175/user-owned/lxc.payload.6175/system.slice/apache2.service
RIP: 0010:link_path_walk.part.0.constprop.0+0x329/0x370
Code: ff ff 48 8b 54 24 10 4c 8b 6c 24 18 e9 ab fe ff ff ba ec ff ff ff e9 b1 fd ff ff 48 8b 43 30 48 8b 3c 24 8b 50 04 48 8b 40 38 <48> 8b b0 70 04 00 00 e8 ab d3 03 00 31 d2 89 83 e0 00 00 00 48 8b
RSP: 0018:ffffbc94cbde3c48 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffffbc94cbde3ce0 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 61c8864680b583eb RDI: ffffffff8db8dee0
RBP: 8080808080808080 R08: ffffffffb3bebdbb R09: 0000000000000000
R10: ffff9cae862e4cc0 R11: 0000000000000002 R12: fefefefefefefeff
R13: ffff9cb8d78e706a R14: d0d0d0d0d0d0d0d0 R15: 2f2f2f2f2f2f2f2f
FS:  00007f914b140500(0000) GS:ffff9cd3becc0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000470 CR3: 000000622b43c003 CR4: 0000000000770ef0
PKRU: 55555554
Call Trace:
 <TASK>
 ? __die+0x1f/0x60
 ? page_fault_oops+0x17d/0x550
 ? exc_page_fault+0x67/0x140
 ? asm_exc_page_fault+0x22/0x30
 ? link_path_walk.part.0.constprop.0+0x329/0x370
 ? link_path_walk.part.0.constprop.0+0xaa/0x370
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? path_init+0x304/0x420
 path_lookupat+0x3e/0x1a0
 filename_lookup+0xdf/0x1d0
 vfs_statx+0x92/0x180
 __do_sys_newlstat+0x63/0xb0
 do_syscall_64+0x9a/0x1a0
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f914ad16a26
Code: 34 0e 00 64 c7 00 16 00 00 00 b8 ff ff ff ff c3 0f 1f 40 00 41 89 f8 48 89 f7 48 89 d6 41 83 f8 01 77 29 b8 06 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 02 c3 90 48 8b 15 39 34 0e 00 f7 d8 64 89 02
RSP: 002b:00007ffe45056dd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000006
RAX: ffffffffffffffda RBX: 0000000000000046 RCX: 00007f914ad16a26
RDX: 00007ffe45056e30 RSI: 00007ffe45056e30 RDI: 00007ffe450570f0
RBP: 00007ffe45056f00 R08: 0000000000000001 R09: 0000000000000001
R10: 000000000000004a R11: 0000000000000246 R12: 00007ffe450570f0
R13: 0000000000000001 R14: 0000000000000001 R15: 00007ffe45056f20
 </TASK>
CR2: 0000000000000470
---[ end trace 0000000000000000 ]---
RIP: 0010:step_into+0x57e/0x760
Code: 83 38 00 0f 84 6b 01 00 00 48 c7 c6 8a 4e 38 8d bf a6 06 00 00 49 c7 c4 f3 ff ff ff e8 fb 27 df ff e9 8a fd ff ff 49 8b 45 30 <4c> 8b 78 08 f6 43 38 40 0f 84 ce 00 00 00 49 83 c6 10 4c 89 ee 31
RSP: 0018:ffffbc951d09fc48 EFLAGS: 00010246
RAX: 7b954b40ed8e20cb RBX: ffffbc951d09fce0 RCX: 000000006700b063
RDX: 0000000000000001 RSI: ffff9c19b3ed72e8 RDI: ffff9c1beb762900
RBP: ffffbc951d09fc98 R08: 0000000000000001 R09: 0000000000000065
R10: 0000000000000001 R11: 0000000000000002 R12: 0000000000000000
R13: ffff9c19b3ed72e8 R14: ffffbc951d09fd48 R15: 0000000000000000
FS:  00007f914b140500(0000) GS:ffff9cd3becc0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000470 CR3: 000000622b43c003 CR4: 0000000000770ef0
PKRU: 55555554
snajpa commented 4 weeks ago
(gdb) l *step_into+0x57e
0xffffffff813b9bfe is in step_into (./include/linux/audit.h:375).
370     }
371     static inline void audit_inode(struct filename *name,
372                                     const struct dentry *dentry,
373                                     unsigned int aflags) {
374             if (unlikely(!audit_dummy_context()))
375                     __audit_inode(name, dentry, aflags);
376     }
377     static inline void audit_file(struct file *file)
378     {
379             if (unlikely(!audit_dummy_context()))

step_into() fs/namei.c#L1874 -> pick_link() fs/namei.c#L1775 -> may_follow_link() fs/namei.c#L1125 -> audit_inode() include/linux/audit.h#L375

snajpa commented 4 weeks ago

OK, this really is not the first time we're seeing the issue, but it was harder to correlate after a myriad of config changes, like turning off selinux in the kernel config.

This was with OpenZFS @ 8f2f6cd2a:

step_into ends a bit earlier b/c there's selinux in the mix

general protection fault, probably for non-canonical address 0x11c5fee1a757ad96: 0000 [#5037] PREEMPT SMP NOPTI
CPU: 120 PID: 1957892 Comm: systemctl Tainted: G      D    OE      6.6.21 #1-vpsAdminOS
Hardware name: Dell Inc. PowerEdge R6515/0R4CNN, BIOS 2.0.3 01/15/2021
In memory cgroup /osctl/pool.tank/group.default/user.3639/ct.24657/user-owned/lxc.payload.24657/system.slice/snapd.service
RIP: 0010:selinux_inode_follow_link+0x96/0xd0
Code: 00 00 8b 5c 01 04 e8 49 fd ff ff 85 c0 75 4c 48 8b 45 38 48 85 c0 74 12 48 63 15 ad af d7 00 48 01 d0 48 3d 00 f0 ff ff 77 16 <0f> b7 50 20 8b 70 1c 49 89 e0 b9 02 00 00 00 89 df e8 04 6e ff ff
RSP: 0018:ffff9cb5b2633be0 EFLAGS: 00010203
RAX: 11c5fee1a757ad76 RBX: 0000000000000001 RCX: ffff8a654183dbe0
RDX: 0000000000000010 RSI: 0000000000000000 RDI: ffff8956997b3218
RBP: ffff8956997b3218 R08: 0000000000000051 R09: 0000000000000000
R10: 0000000000000003 R11: 0000000000000002 R12: ffff8956997b3218
R13: ffff8982b8f0a600 R14: ffff9cb5b2633e18 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff8a8c7ee00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fe4fb3a0270 CR3: 00000121fc2b4000 CR4: 0000000000350ee0
Call Trace:
 <TASK>
 ? die_addr+0x36/0x90
 ? exc_general_protection+0x1af/0x3d0
 ? asm_exc_general_protection+0x26/0x30
 ? selinux_inode_follow_link+0x96/0xd0
 security_inode_follow_link+0x45/0x70
 step_into+0x157/0x760
 ? srso_return_thunk+0x5/0x10
 link_path_walk.part.0.constprop.0+0x246/0x380
 ? srso_return_thunk+0x5/0x10
 ? path_init+0x290/0x3c0
 path_openat+0xa7/0x1160
 ? srso_return_thunk+0x5/0x10
 do_filp_open+0xb3/0x160 
 do_sys_openat2+0xab/0xe0
 __x64_sys_openat+0x6e/0xa0
 do_syscall_64+0x3e/0x90
 entry_SYSCALL_64_after_hwframe+0x6e/0xd8
RIP: 0033:0x7fe4fb694b18
Code: f9 41 89 f0 41 83 e2 40 75 30 89 f0 25 00 00 41 00 3d 00 00 41 00 74 22 44 89 c2 4c 89 ce bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 30 c3 0f 1f 80 00 00 00 00 48 8d 44 24 08 c7
RSP: 002b:00007ffd37c03348 EFLAGS: 00000287 ORIG_RAX: 0000000000000101
RAX: ffffffffffffffda RBX: 000000000000000a RCX: 00007fe4fb694b18
RDX: 0000000000080000 RSI: 00007ffd37c033b0 RDI: 00000000ffffff9c
RBP: 00007ffd37c033a0 R08: 0000000000080000 R09: 00007ffd37c033b0
R10: 0000000000000000 R11: 0000000000000287 R12: 00007ffd37c035b7
R13: 00007ffd37c035d0 R14: 00007ffd37c033b0 R15: 00007fe4fb6a92e0
 </TASK>
---[ end trace 0000000000000000 ]---
snajpa commented 4 weeks ago
BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor instruction fetch in kernel mode
#PF: error_code(0x0010) - not-present page
PGD 0 P4D 0
Oops: 0010 [#1] PREEMPT SMP NOPTI
CPU: 77 PID: 2451139 Comm: ps Tainted: G           OE K    6.8.8-2.1 #1-vpsAdminOS
Hardware name: Dell Inc. PowerEdge R7525/0PYVT1, BIOS 2.14.1 12/17/2023
In memory cgroup /osctl/pool.tank/group.default/user.859/ct.6786/user-owned/lxc.payload.6786/system.slice/cron.service
RIP: 0010:0x0
Code: Unable to access opcode bytes at 0xffffffffffffffd6.
RSP: 0018:ffffb94a7f353ca0 EFLAGS: 00010246
RAX: ffffffffc18c9b40 RBX: ffffb94a7f353db0 RCX: 0000000000000007
RDX: ffffb94a7f353e28 RSI: ffffa18cd24ed028 RDI: 0000000000000000
RBP: ffffb94a7f353cf8 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000003 R11: 0000000000000002 R12: 0000000000000000
R13: ffffa18cd24ed028 R14: ffffb94a7f353e28 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffffa102bf940000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffffffffd6 CR3: 00000008a842c000 CR4: 0000000000350ef0
Call Trace:
 <TASK>
 ? __die+0x23/0x70
 ? page_fault_oops+0x181/0x490
 ? exc_page_fault+0x68/0x150
 ? asm_exc_page_fault+0x26/0x30
 step_into+0x5be/0x760
 path_openat+0x132/0x1180
 do_filp_open+0xb3/0x160
 do_sys_openat2+0xab/0xe0
 __x64_sys_openat+0x6e/0xa0
 do_syscall_64+0xab/0x1b0
 entry_SYSCALL_64_after_hwframe+0x79/0x81
RIP: 0033:0x7fe4cb378b18
Code: f9 41 89 f0 41 83 e2 40 75 30 89 f0 25 00 00 41 00 3d 00 00 41 00 74 22 44 89 c2 4c 89 ce bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 30 c3 0f 1f 80 00 00 00 00 48 8d 44 24 08 c7
RSP: 002b:00007ffd845bc7d8 EFLAGS: 00000287 ORIG_RAX: 0000000000000101
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fe4cb378b18
RDX: 0000000000080000 RSI: 00007fe4cb351b30 RDI: 00000000ffffff9c
RBP: 00007ffd845bc830 R08: 0000000000080000 R09: 00007fe4cb351b30
R10: 0000000000000000 R11: 0000000000000287 R12: 00007ffd845bc897
R13: 00007ffd845bc8b0 R14: 00007fe4cb351b30 R15: 00007fe4cb38c040
 </TASK>
CR2: 0000000000000000
---[ end trace 0000000000000000 ]---
(gdb) l *step_into+0x5be
0xffffffff813aebae is in step_into (fs/namei.c:1804).
1799            if (!res) {
1800                    const char * (*get)(struct dentry *, struct inode *,
1801                                    struct delayed_call *);
1802                    get = inode->i_op->get_link;
1803                    if (nd->flags & LOOKUP_RCU) {
1804                            res = get(NULL, inode, &last->done);
1805                            if (res == ERR_PTR(-ECHILD) && try_to_unlazy(nd))
1806                                    res = get(link->dentry, inode, &last->done);
1807                    } else {
1808                            res = get(link->dentry, inode, &last->done);
snajpa commented 3 weeks ago

Interestingly, while this happens, arc_evict and arc_prune are both active at 70-100% cpu each.

This is a list of processes in D state with their stacks after a few processes died with ^ similar oopses: node6-dstakz.txt

snajpa commented 3 weeks ago

OK now that we have those two ideas invalidated, now I'm thinking what's this iput doing there, I see no justification for it: https://github.com/vpsfreecz/zfs/blob/vpsadminos-master-dev/module/os/linux/zfs/zpl_xattr.c#L205

@behlendorf if you have a nanosecond, could you take a look if I'm missing something, is that iput really needed there? I still have ~6hrs before the template build test shows anything :-D

behlendorf commented 3 weeks ago

@snajpa I only took a quick look but the iput() here looks correct to me. The zfs_lookup() call about will result in a reference being taken on the returned dxzp which needs to be release after we're done using it in zpl_xattr_readdir().

snajpa commented 3 weeks ago

oh, right, I see zfs_dirlook does that, didn't look that far :) thank you!

snajpa commented 2 weeks ago

Turns out it's because the OS is allowed to freely deallocate the znode even if there's a SA hold and maybe other conditions which I didn't yet check for.

This patch seems to mitigate:

diff --git a/module/os/linux/zfs/zfs_znode_os.c b/module/os/linux/zfs/zfs_znode_os.c
index a153d0850..87ff4facb 100644
--- a/module/os/linux/zfs/zfs_znode_os.c
+++ b/module/os/linux/zfs/zfs_znode_os.c
@@ -186,7 +186,7 @@ zfs_znode_init(void)
        znode_cache = kmem_cache_create("zfs_znode_cache",
            sizeof (znode_t), 0, zfs_znode_cache_constructor,
            zfs_znode_cache_destructor, NULL, NULL, NULL,
-           KMC_SLAB | KMC_RECLAIMABLE);
+           KMC_SLAB);

        ASSERT(znode_hold_cache == NULL);
        znode_hold_cache = kmem_cache_create("zfs_znode_hold_cache",

ping @amotin it seems that maybe it's not that harmless marking as evictable, but perhaps it's also enabling some extra unexpected eviction, at least for znodes?

I'll investigate more tomorrow, but now I have a confirmation that the template build completes in 4 hrs less than it took for the bug to really be hit with those traces :) => way faster + bug free build now

amotin commented 2 weeks ago

it seems that maybe it's not that harmless marking as evictable, but perhaps it's also enabling some extra unexpected eviction, at least for znodes?

@snajpa I don't believe KMC_RECLAIMABLE presence should do more than moving the allocation into a different memory region with more active reuse, and your change just hides it back deeper, which is not a solution for anything.

snajpa commented 2 weeks ago

So I think maybe I got it... now with the zpl_drop_inode fix (which I'm only about 80% sure it's needed, will revalidate in following days) and the d_tmpfile fix (which for sure has an impact), all that I need is to fix the problem when superblock is going away, the sb killing looks like it's a more complicated multi-step process and it so happens that the shrinker is deallocated while zfs_prune doesn't know about it; I'm not sure what to do about that, I'd rather avoid the messy sb state detection + the locking, but don't know how yet - for any serious sb state detection, access to sb_lock spinlock seems to be needed, s_active and s_flags (SB_DYING or SB_ACTIVE) don't seem to be enough on a first read. Dunno, needs more work, but thought I could report the current progress.

I'll also update the two previous PR with current state. Might get some sleep first though. Rough 2 weeks :D

snajpa commented 2 weeks ago

Also thank you @amotin you're absolutely right; it just took longer for problems to arrive, about 5000 more painful seconds of waiting for the first thing to go wrong with a first stack trace :D

snajpa commented 1 week ago

I love this :D squash one bug, uncover another one :-D you know what when I'm done with this I'll just send one PR and I'll close the two currently WIP, it's shaping up to be several small commits, seems like less reviewer overhead

snajpa commented 1 week ago

We really need to figure out low memory test scenarios...

snajpa commented 1 week ago

Current status: https://github.com/vpsfreecz/zfs/compare/0ba322e5b02a...9190abd7ca6c