Open arthurfabre opened 7 months ago
That trace points next to no distance at all into the function, and that function hasn't churned at all since 2020, so I'm pretty sure your pointer maps to 1254 in:
1249 /*
1250 * Allocate directly from a Linux slab. All optimizations are left
1251 * to the underlying cache we only need to guarantee that KM_SLEEP
1252 * callers will never fail.
1253 */
1254 if (skc->skc_flags & KMC_SLAB) {
1255 struct kmem_cache *slc = skc->skc_linux_cache;
1256 do {
1257 obj = kmem_cache_alloc(slc, kmem_flags_convert(flags));
1258 } while ((obj == NULL) && !(flags & KM_NOSLEEP));
which would require that the cache object passed in is NULL to break that way.
And if we look up the line in zfs_znode_alloc
:
zp = kmem_cache_alloc(znode_cache, KM_SLEEP);
And znode_cache
should be initialized in zfs_znode_init
long before anyone ever wants zfs_znode_alloc
, so either someone's calling that in a strange order, or the request to create that cache is failing, which I don't think should happen.
My blind guess would be, from that stack trace, doing the import is causing something to decide the pool is imported and trigger traditional mount -t zfs
style calls, and it's racing the pool import and losing. (Other option would be that zpool import
itself is doing the mounts, but I'd be somewhat surprised if that's breaking reliably and nobody else noticed it being broken yet.)
You could try doing zpool import -N [pool]
and see if that fails the same way (since -N
should preclude zpool import
itself triggering any mounts...)
e: The last couple dozen lines of /proc/spl/kstat/zfs/dbgmsg
would also potentially be informative for what it was doing.
Also, any module parameters set to non-default values? (I'm trying to speculate why this might be breaking for you specifically and not anyone else so far...)
Thanks for looking into this!
any module parameters set to non-default values?
Winner winner chicken dinner! I completely forgot I had:
spl_kmem_cache_slab_limit=0
Removing it resolves the issue. I think setting it to 0
broke recently with 29ea6fa:
KMC_KVMEM
for spl_kmem_cache_create()
(I think).size > spl_kmem_cache_slab_limit
will always be true.spl_kmem_cache_create
returns NULL
.spl_kmem_cache_alloc
is called with a NULL
spl_kmem_cache_t *
.That seems a bit foot-gunny, I think if spl_kmem_cache_slab_limit
is set lower than what any KMC_KVMEM
cache tries to allocate this issue will happen.
But I don't have an actual reason to set it to 0
, I think it's leftover debugging from #11574. Happy to close this from my end.
Edit: removed my completely wrong earlier interpretation.
System information
Describe the problem you're observing
After upgrading from OpenZFS
2.2.2-4
to OpenZFS2.2.3-2
,zpool import $poolname
hangs forever, and kernel logs show a NULL pointer dereference.Downgrading to OpenZFS
2.2.2-4
(by manually installing packages from Debian's archive) resolves the issue.Describe how to reproduce the problem
I can consistently reproduce this with
zpool import $poolname
(after maskingzfs.target
on boot withsystemd.mask=zfs.target
to be able to boot).Include any warning/errors/backtraces from the system logs
Disassembly of
spl.ko
with: