Open asomers opened 1 month ago
I saw a very similar stack on Linux with 6.10 when I tried docker pull of n*more than what could in RAM with zfs docker driver. It went away after upgrading to 6.11, but this it was exactly arc_write_done
, the same path. It looked like some kind of race with dbuf_evict
- https://github.com/vpsfreecz/zfs/pull/1#issuecomment-2458598745
looks like we'll need to hunt this one down :D it might be responsible for some of the unexplained crashes on nodes where it's impractical to take a full memory dump
System information
Describe the problem you're observing
Our servers occasionally crash due to what looks like in-memory corruption of the block pointer. The crashes are not repeatable. That is, the same server never crashes in exactly the same way twice. So we don't think that on-disk corruption is to blame. But they happen often enough that we can't dismiss them as one-off events. We see these crashes approximately once for every 10 PiB of ZFS writes. ZFS encryption is NOT in use on any of the affected servers and they all have ECC RAM.
We would be very grateful to anybody with any insights into this problem.
Describe how to reproduce the problem
Unfortunately, we don't have any reproduction case. There is no known trigger. The crashing servers have a complicated workload that involves writes coming from userspace processes, from
zfs recv
, and a little bit from ctld.Include any warning/errors/backtraces from the system logs
A selection of panic messages:
And a representative stack trace