Open atkinsam opened 4 months ago
@prakashsurya I think you wrote this originally, and @amotin did some work in the ZIL more recently. Do either of you have thoughts on what might be happening here? It does look like zil_lwb_flush_vdevs_done()
is called whenever the zio completes, regardless of whether it has an error, but it isn't checking zio->io_error
. If the flush zio fails, maybe we should retry it or wait for a txg sync?
As I see, in case of error zil_lwb_flush_vdevs_done()
sets zcw->zcw_zio_error
, which should be checked by zil_commit_impl()
, calling txg_wait_synced()
before returning. But I suspect there may be a race on a parallel workloads, when while one thread is waiting for the transaction commit, parallel one may see the lwb_state == LWB_STATE_FLUSH_DONE
and skip waiting.
I need to run right now, but I wonder if we could reuse lwb_error
field, used now only for allocation errors, to signal write errors also, and check it also in zil_prune_commit_list()
to block the optimization. Though may need to look deeper to recall all the error handling path there, it was a while ago.
I think that disabling zil_prune_commit_list()
may workaround only part of the problem. I think it helps because new LWB write it produces also fails, and that error is what you handle, not the original one. It should not help with transient write errors, that may cause ZIL not replayable after the failed LWB. If the previous LWB's ZIO is still running when the next is issued, it will propagate its errors to all the following ZIOs via parent-children relations. But if LWB is in LWB_STATE_FLUSH_DONE state, there is no longer ZIO for zil_lwb_set_zio_dependency() to call zio_add_child() on to propagate the error, so all the following ZIOs till the TXG commit completion may successfully complete, but will never be replayed in case of crash. But it seems not exactly trivial to decide how far in LWBs to propagate the error, since on the edge of TXGs zil_sync() will free the stale part of ZIL chain, but it may vary depending on what ZIOs are still running, etc. It needs much deeper thinking, but my head is now is in several other projects.
@robn I wonder if you'd be interested as part of your https://github.com/openzfs/zfs/pull/16314 work.
System information
Reproduced with:
Describe the problem you're observing
During some disk availability testing, specifically a test involving preempt-and-abort'ing NVMe reservations on disks on an imported zpool, we observed ZFS occasionally acknowledge sync writes without those writes being durably written to disk.
Describe how to reproduce the problem
So far, we have only observed this problem when using NVMe reservations, preempt-and-abort'ing reservations on disk on an imported zpool. We have two identical disks in the zpool. No SLOG device. The zpool is created with the two disks in a non-redundant configuration (no raidz, mirror, etc.). We run a fio random-write workload against the pool, preempt-and-abort the disks, crash the file server, and then have fio verify that all ack'd writes are durable.
We observe this issue more frequently when putting these two disks behind an LVM logical volume:
But we have also observed the issue without LVM in the mix. Our assumption is that LVM impacts the frequency of this because it sets
REQ_FAILFAST_MASK
, increasing the probability of us hitting this race.We added some probes throughout the code and found that the
zil_commit_waiter
for our ack'd write is always entering this condition:https://github.com/openzfs/zfs/blob/2566592045780e7be7afc899c2496b1ae3af4f4d/module/zfs/zil.c#L2784-L2793
subsequently getting marked as done in
zil_commit_waiter_skip
. It looks like thezl_last_lwb_opened
is getting moved toLWB_STATE_FLUSH_DONE
even though it has some non-zero zio error, and then the zio error on that previous lwb is never checked when this code decides to mark the waiter as done (it looks like the zios could be freed by this point anyway).We tried removing the call to
zil_prune_commit_list
inzil_commit_writer
:https://github.com/openzfs/zfs/blob/2566592045780e7be7afc899c2496b1ae3af4f4d/module/zfs/zil.c#L3144
When we remove this call, our reproducer has never again failed, through thousands of runs.