openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.55k stars 1.74k forks source link

ZFS kernel panic VERIFY3(sa.sa_magic == SA_MAGIC) in zpl_get_file_info() on Ubuntu 22.04 #16276

Open siebenmann opened 4 months ago

siebenmann commented 4 months ago

System information

Type Version/Name
Distribution Name Ubuntu
Distribution Version 22.04
Kernel Version 5.15.0-88-generic
Architecture x86_64
OpenZFS Version v2.1.5-1ubuntu6~22.04.1

Describe the problem you're observing

We operate a number of ZFS fileservers exporting filesystems via NFS; these are the descendants of what were originally Solaris fileservers set up in 2008, with filesystems moved from generation to generation with 'zfs send | zfs recv'. As a result, we wound up with some filesystems that were still on ZFS filesystem version 4. In early December of 2023, we did 'zfs upgrade' to upgrade all of those to ZFS filesystem version 5. Since then, two fileservers have experienced a ZFS kernel panic due to an assertion failure, one once and one fileserver twice. These panics can't be provoked by simply reading the filesystems (we do this regularly for backups, and we did this for testing). We are not using encryption.

It's possible that this is related to #12659, #13144, and/or #13937, however our situation seems somewhat different than all of these (eg, lack of encryption).

Include any warning/errors/backtraces from the system logs

This is from our most recent panic and crash (and automated reboot because we set ZFS panics to be kernel panics and we reboot on panics), which happened the second time on the same fileserver. Past panics have had different values for the actual sa.sa_magic field: 1446876386, 1682687197, and 1682672016. The stack traces are otherwise consistent with each other (well, from zpl_get_file_info() backward).

[8956269.023527] VERIFY3(sa.sa_magic == SA_MAGIC) failed (1690920705 == 3100762)
[8956269.030785] PANIC at zfs_quota.c:89:zpl_get_file_info()
[8956269.036291] Kernel panic - not syncing: VERIFY3(sa.sa_magic == SA_MAGIC) failed (1690920705 == 3100762)
[8956269.046001] CPU: 4 PID: 4569 Comm: dp_sync_taskq Tainted: P           O      5.15.0-88-generic #98-Ubuntu
[8956269.055873] Hardware name: Supermicro Super Server/X11SPH-nCTF, BIOS 2.0 11/29/2017
[8956269.063830] Call Trace:
[8956269.066536]  <TASK>
[8956269.068894]  show_stack+0x52/0x5c
[8956269.072480]  dump_stack_lvl+0x4a/0x63
[8956269.076413]  dump_stack+0x10/0x16
[8956269.080001]  panic+0x15c/0x334
[8956269.083327]  spl_panic+0xcc/0xe9 [spl]
[8956269.087361]  ? zio_execute_stack_check.constprop.0+0x20/0x20 [zfs]
[8956269.094042]  ? __wake_up+0x13/0x20
[8956269.097708]  ? taskq_dispatch_ent+0xec/0x150 [spl]
[8956269.102778]  ? select_idle_sibling+0x2b/0xa60
[8956269.107407]  ? wake_affine+0x74/0x310
[8956269.111332]  ? newidle_balance+0x12e/0x470
[8956269.115692]  zpl_get_file_info+0x1fe/0x240 [zfs]
[8956269.120781]  dmu_objset_userquota_get_ids+0x382/0x490 [zfs]
[8956269.126761]  dnode_sync+0x557/0x620 [zfs]
[8956269.131179]  ? do_raw_spin_unlock+0x9/0x10 [zfs]
[8956269.136233]  dmu_objset_sync_dnodes+0x68/0x90 [zfs]
[8956269.141526]  sync_dnodes_task+0x2b/0x50 [zfs]
[8956269.146294]  taskq_thread+0x21e/0x400 [spl]
[8956269.150756]  ? wake_up_q+0x90/0x90
[8956269.154428]  ? param_set_taskq_kick+0xf0/0xf0 [spl]
[8956269.159581]  kthread+0x127/0x150
[8956269.163073]  ? set_kthread_struct+0x50/0x50
[8956269.167518]  ret_from_fork+0x1f/0x30
[8956269.171362]  </TASK>
[8956269.173874] Kernel Offset: 0x24800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[8956269.246646] Rebooting in 10 seconds..
rincebrain commented 4 months ago

I am curious if you're using Lustre or not, since one of those seemed to be Lustre calling the interface wrong. (I'm assuming not, or you'd have mentioned it, but it seemed like a reasonable question to be explicit about.)

It reminds me of the illumos bug where zfs upgrade on a clone pre-v5 with native encryption would panic - I know you're not using native encryption, but I'm wondering if it's still doing something unsafe with quotas all the same.

Those SA_MAGIC values being similarly wild makes me wonder if it got the same wild value from something, e.g. a buffer that it shouldn't have, and wrote it out happily, and now reading it back is going "oh no".

I suppose, in theory, if this is reading quota metadata and it's calculated quota data, we could do the same thing I added in the native encryption failure to decrypt quota metadata case and just trigger regenerating it, but that stacktrace suggests it's finding an insane SA, which I don't think the quota stuff uses directly, it just reads it to try and figure out how much space it's using, I imagine...

So it'd probably be useful to backport the changes to ASSERTs from the future that let you print custom info when they trigger, and then make it print the object it's on when it panics like that, so that you can then go poking around with zdb.

Alternately, one imagines asking zdb to walk the relevant dataset will panic in a similar way if you use flags that will make zdb attempt to do similar things...assuming it's an actual on-disk problem and not an in-memory bug, which seeing what object id it's complaining about would be informative for.

If it's a dnode update race, then if Ubuntu hasn't cherrypicked it, some of the fixes around that might be useful.

siebenmann commented 4 months ago

We're not using Lustre; it's all straight NFS (originally v3, now mostly v4). We routinely ask ZFS for all user quota information, so I think raw quota information must be good. Based on the code it looks like the objectid isn't directly available at the VERIFY3, as zpl_get_file_info() is only passed the bonustype and the hypothetical data for the SA (to reach this VERIFY3 we must be working on a SA, not a dnode). Only dmu_objset_userquota_get_ids() has the dnode itself. So in theory if I can find a zdb operation that will walk a ZFS filesystem reading all SAs for all dnodes, it should trigger this failure in user space.

The infrequency of this panic is puzzling to me. Regardless of whether this is on-disk corruption or (consistent) in-memory corruption, almost all of the ZFS filesystem activity is NFS activity, and NFS clients should retry any operation that fails because the fileserver crashes and restarts, which should trigger it again.

I doubt Ubuntu 22.04 has cherry-picked very much into their ZFS, and I can't spot signs of it in their confusing git repository of ZFS.

rincebrain commented 4 months ago

They've written their own data loss-inducing patches before, so I wouldn't recommend trusting them not to break things, in general.

For debugging, I'd probably smuggle the object id in through one of the fields of the zfs_file_info_t, since it's just going to get overwritten later on in normal operation, so things that don't trip the assert shouldn't care.

siebenmann commented 4 months ago

If I'm reading the code correctly, using zdb to search for this requires using zdb -Gddd .... and then looking at the end of the output for the output from module/zfs/sa.c's sa_build_index(), which I believe will look like Buffer Header: %x != SA_MAGIC:%x object=%#llx. It appears that the SA_MAGIC value is only verified in zpl_get_file_info() and here, and this is called from sa_handle_get_from_db() and then sa_handle_get(), which zdb calls for every znode. However I'm not sure I understand what is going on in these two places and why one panics and the other doesn't.

rincebrain commented 4 months ago

If zdb doesn't panic, or the object isn't the same every time this breaks, my guess would be that it's mangling something in memory (and panicking before it makes it out to disk).

You could try a kASAN kernel, see if someone's reaching out and mangling something they shouldn't be, potentially.

I don't recall or see many changes that might be immediately obviously relevant, though I wouldn't be astonished if this is somehow a race in dnode_sync.

You could also try 2.1.15 or 2.2.4 and see if they play any better here, potentially.

siebenmann commented 4 months ago

Here is what I think I see about what is happening in the code, to try to keep track of this.

One possible theory is that there is a bug in the attribute update code, one that probably only triggers some of the time (perhaps depending on what else is already in memory).