ZFS kernel panic VERIFY3(sa.sa_magic == SA_MAGIC) in zpl_get_file_info() on Ubuntu 22.04

siebenmann commented 5 months ago

System information

Type	Version/Name
Distribution Name	Ubuntu
Distribution Version	22.04
Kernel Version	5.15.0-88-generic
Architecture	x86_64
OpenZFS Version	v2.1.5-1ubuntu6~22.04.1

Describe the problem you're observing

We operate a number of ZFS fileservers exporting filesystems via NFS; these are the descendants of what were originally Solaris fileservers set up in 2008, with filesystems moved from generation to generation with 'zfs send | zfs recv'. As a result, we wound up with some filesystems that were still on ZFS filesystem version 4. In early December of 2023, we did 'zfs upgrade' to upgrade all of those to ZFS filesystem version 5. Since then, two fileservers have experienced a ZFS kernel panic due to an assertion failure, one once and one fileserver twice. These panics can't be provoked by simply reading the filesystems (we do this regularly for backups, and we did this for testing). We are not using encryption.

It's possible that this is related to #12659, #13144, and/or #13937, however our situation seems somewhat different than all of these (eg, lack of encryption).

Include any warning/errors/backtraces from the system logs

This is from our most recent panic and crash (and automated reboot because we set ZFS panics to be kernel panics and we reboot on panics), which happened the second time on the same fileserver. Past panics have had different values for the actual sa.sa_magic field: 1446876386, 1682687197, and 1682672016. The stack traces are otherwise consistent with each other (well, from zpl_get_file_info() backward).

[8956269.023527] VERIFY3(sa.sa_magic == SA_MAGIC) failed (1690920705 == 3100762)
[8956269.030785] PANIC at zfs_quota.c:89:zpl_get_file_info()
[8956269.036291] Kernel panic - not syncing: VERIFY3(sa.sa_magic == SA_MAGIC) failed (1690920705 == 3100762)
[8956269.046001] CPU: 4 PID: 4569 Comm: dp_sync_taskq Tainted: P           O      5.15.0-88-generic #98-Ubuntu
[8956269.055873] Hardware name: Supermicro Super Server/X11SPH-nCTF, BIOS 2.0 11/29/2017
[8956269.063830] Call Trace:
[8956269.066536]  <TASK>
[8956269.068894]  show_stack+0x52/0x5c
[8956269.072480]  dump_stack_lvl+0x4a/0x63
[8956269.076413]  dump_stack+0x10/0x16
[8956269.080001]  panic+0x15c/0x334
[8956269.083327]  spl_panic+0xcc/0xe9 [spl]
[8956269.087361]  ? zio_execute_stack_check.constprop.0+0x20/0x20 [zfs]
[8956269.094042]  ? __wake_up+0x13/0x20
[8956269.097708]  ? taskq_dispatch_ent+0xec/0x150 [spl]
[8956269.102778]  ? select_idle_sibling+0x2b/0xa60
[8956269.107407]  ? wake_affine+0x74/0x310
[8956269.111332]  ? newidle_balance+0x12e/0x470
[8956269.115692]  zpl_get_file_info+0x1fe/0x240 [zfs]
[8956269.120781]  dmu_objset_userquota_get_ids+0x382/0x490 [zfs]
[8956269.126761]  dnode_sync+0x557/0x620 [zfs]
[8956269.131179]  ? do_raw_spin_unlock+0x9/0x10 [zfs]
[8956269.136233]  dmu_objset_sync_dnodes+0x68/0x90 [zfs]
[8956269.141526]  sync_dnodes_task+0x2b/0x50 [zfs]
[8956269.146294]  taskq_thread+0x21e/0x400 [spl]
[8956269.150756]  ? wake_up_q+0x90/0x90
[8956269.154428]  ? param_set_taskq_kick+0xf0/0xf0 [spl]
[8956269.159581]  kthread+0x127/0x150
[8956269.163073]  ? set_kthread_struct+0x50/0x50
[8956269.167518]  ret_from_fork+0x1f/0x30
[8956269.171362]  </TASK>
[8956269.173874] Kernel Offset: 0x24800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[8956269.246646] Rebooting in 10 seconds..

rincebrain commented 5 months ago

I am curious if you're using Lustre or not, since one of those seemed to be Lustre calling the interface wrong. (I'm assuming not, or you'd have mentioned it, but it seemed like a reasonable question to be explicit about.)

It reminds me of the illumos bug where zfs upgrade on a clone pre-v5 with native encryption would panic - I know you're not using native encryption, but I'm wondering if it's still doing something unsafe with quotas all the same.

Those SA_MAGIC values being similarly wild makes me wonder if it got the same wild value from something, e.g. a buffer that it shouldn't have, and wrote it out happily, and now reading it back is going "oh no".

I suppose, in theory, if this is reading quota metadata and it's calculated quota data, we could do the same thing I added in the native encryption failure to decrypt quota metadata case and just trigger regenerating it, but that stacktrace suggests it's finding an insane SA, which I don't think the quota stuff uses directly, it just reads it to try and figure out how much space it's using, I imagine...

So it'd probably be useful to backport the changes to ASSERTs from the future that let you print custom info when they trigger, and then make it print the object it's on when it panics like that, so that you can then go poking around with zdb.

Alternately, one imagines asking zdb to walk the relevant dataset will panic in a similar way if you use flags that will make zdb attempt to do similar things...assuming it's an actual on-disk problem and not an in-memory bug, which seeing what object id it's complaining about would be informative for.

If it's a dnode update race, then if Ubuntu hasn't cherrypicked it, some of the fixes around that might be useful.

siebenmann commented 5 months ago

We're not using Lustre; it's all straight NFS (originally v3, now mostly v4). We routinely ask ZFS for all user quota information, so I think raw quota information must be good. Based on the code it looks like the objectid isn't directly available at the VERIFY3, as zpl_get_file_info() is only passed the bonustype and the hypothetical data for the SA (to reach this VERIFY3 we must be working on a SA, not a dnode). Only dmu_objset_userquota_get_ids() has the dnode itself. So in theory if I can find a zdb operation that will walk a ZFS filesystem reading all SAs for all dnodes, it should trigger this failure in user space.

The infrequency of this panic is puzzling to me. Regardless of whether this is on-disk corruption or (consistent) in-memory corruption, almost all of the ZFS filesystem activity is NFS activity, and NFS clients should retry any operation that fails because the fileserver crashes and restarts, which should trigger it again.

I doubt Ubuntu 22.04 has cherry-picked very much into their ZFS, and I can't spot signs of it in their confusing git repository of ZFS.

rincebrain commented 5 months ago

They've written their own data loss-inducing patches before, so I wouldn't recommend trusting them not to break things, in general.

For debugging, I'd probably smuggle the object id in through one of the fields of the zfs_file_info_t, since it's just going to get overwritten later on in normal operation, so things that don't trip the assert shouldn't care.

siebenmann commented 5 months ago

If I'm reading the code correctly, using zdb to search for this requires using zdb -Gddd .... and then looking at the end of the output for the output from module/zfs/sa.c's sa_build_index(), which I believe will look like Buffer Header: %x != SA_MAGIC:%x object=%#llx. It appears that the SA_MAGIC value is only verified in zpl_get_file_info() and here, and this is called from sa_handle_get_from_db() and then sa_handle_get(), which zdb calls for every znode. However I'm not sure I understand what is going on in these two places and why one panics and the other doesn't.

rincebrain commented 5 months ago

If zdb doesn't panic, or the object isn't the same every time this breaks, my guess would be that it's mangling something in memory (and panicking before it makes it out to disk).

You could try a kASAN kernel, see if someone's reaching out and mangling something they shouldn't be, potentially.

I don't recall or see many changes that might be immediately obviously relevant, though I wouldn't be astonished if this is somehow a race in dnode_sync.

You could also try 2.1.15 or 2.2.4 and see if they play any better here, potentially.

siebenmann commented 5 months ago

Here is what I think I see about what is happening in the code, to try to keep track of this.

the panic happens when a dnode's bonus data is supposed to be system attributes, but does not have SA_MAGIC in the right place.
because the call chain is going through dnode_sync(), we are looking at the new dnode (and bonus data) that is supposed to be written out as part of the transaction. As a corollary, this panic is likely a good thing for us because it is probably preventing corrupted SA bonus data being written to disk.
the obvious thing that starts happening when you update a filesystem from ZFS filesystem version 4 to ZFS filesystem version 5 is that filesystem objects start getting their system attributes converted from the old v4 form to the new, better v5 form if they're being updated anyway.

One possible theory is that there is a bug in the attribute update code, one that probably only triggers some of the time (perhaps depending on what else is already in memory).

snajpa commented 11 hours ago

can you try to put together a reproducer? I'd look into this if I can reproduce it locally

siebenmann commented 8 hours ago

Unfortunately we cannot reproduce this issue under any sort of controlled conditions (and we've tried). This panic happens only rarely and is extremely unpredictable. We don't know the trigger conditions, but they're clearly quite rare, with only a small number of crashes ever happening and generally months between crashes (I think we haven't had any since I filed this report).

snajpa commented 7 hours ago

mhm got it... would it be possible for you to try current git master to rule out the possibility of this being fixed there?

siebenmann commented 7 hours ago

These servers are our production ZFS fileservers, so we're not in a position to run current git master on them in order to test things.

snajpa commented 7 hours ago

ok :) one last question, when you say you tried to figure out a reproducer, did you also try creating the test pool with ZoL 0.6.x or the original OI/OpenSolaris?

siebenmann commented 4 hours ago

All of the pools involved in the crashes are new, only the filesystems themselves are old. Authentic versions of test filesystems probably can't be created now except by pre-2012 or so Solaris installs, and we don't have any OI/OpenSolaris machines any more to even start trying that. We did not try to deliberately make version 4 filesystems, upgrade them, and then test them, since we couldn't find any way to reproduce the crashes on our existing authentic filesystems upgraded from version 4 to version 5. We did try poking at existing filesystems in various ways that did not cause crashes.

openzfs / zfs