zil_prune_commit_list causes write to be acknowledged without being written durably

atkinsam commented 4 months ago

System information

Reproduced with:

Amazon Linux / kernel 6.1.92 / aarch64 / OpenZFS 2.1.7
Amazon Linux / kernel 6.1.92 / x86_64 / OpenZFS 2.1.7
Amazon Linux / kernel 6.1.92 / aarch64 / OpenZFS 2.2.4

Describe the problem you're observing

During some disk availability testing, specifically a test involving preempt-and-abort'ing NVMe reservations on disks on an imported zpool, we observed ZFS occasionally acknowledge sync writes without those writes being durably written to disk.

Describe how to reproduce the problem

So far, we have only observed this problem when using NVMe reservations, preempt-and-abort'ing reservations on disk on an imported zpool. We have two identical disks in the zpool. No SLOG device. The zpool is created with the two disks in a non-redundant configuration (no raidz, mirror, etc.). We run a fio random-write workload against the pool, preempt-and-abort the disks, crash the file server, and then have fio verify that all ack'd writes are durable.

We observe this issue more frequently when putting these two disks behind an LVM logical volume:

pvcreate /dev/sdf
pvcreate /dev/sdg
vgcreate zfsvg /dev/sdf /dev/sdg
lvcreate --type raid0 -n zfslv -l '100%VG' zfsvg -I 128k -i 2
zpool create -o ashift=12 -O recordsize=128K scratch /dev/zfsvg/zfslv

But we have also observed the issue without LVM in the mix. Our assumption is that LVM impacts the frequency of this because it sets REQ_FAILFAST_MASK, increasing the probability of us hitting this race.

We added some probes throughout the code and found that the zil_commit_waiter for our ack'd write is always entering this condition:

https://github.com/openzfs/zfs/blob/2566592045780e7be7afc899c2496b1ae3af4f4d/module/zfs/zil.c#L2784-L2793

subsequently getting marked as done in zil_commit_waiter_skip. It looks like the zl_last_lwb_opened is getting moved to LWB_STATE_FLUSH_DONE even though it has some non-zero zio error, and then the zio error on that previous lwb is never checked when this code decides to mark the waiter as done (it looks like the zios could be freed by this point anyway).

We tried removing the call to zil_prune_commit_list in zil_commit_writer:

https://github.com/openzfs/zfs/blob/2566592045780e7be7afc899c2496b1ae3af4f4d/module/zfs/zil.c#L3144

When we remove this call, our reproducer has never again failed, through thousands of runs.

ahrens commented 4 months ago

@prakashsurya I think you wrote this originally, and @amotin did some work in the ZIL more recently. Do either of you have thoughts on what might be happening here? It does look like zil_lwb_flush_vdevs_done() is called whenever the zio completes, regardless of whether it has an error, but it isn't checking zio->io_error. If the flush zio fails, maybe we should retry it or wait for a txg sync?

amotin commented 4 months ago

As I see, in case of error zil_lwb_flush_vdevs_done() sets zcw->zcw_zio_error, which should be checked by zil_commit_impl(), calling txg_wait_synced() before returning. But I suspect there may be a race on a parallel workloads, when while one thread is waiting for the transaction commit, parallel one may see the lwb_state == LWB_STATE_FLUSH_DONE and skip waiting.

I need to run right now, but I wonder if we could reuse lwb_error field, used now only for allocation errors, to signal write errors also, and check it also in zil_prune_commit_list() to block the optimization. Though may need to look deeper to recall all the error handling path there, it was a while ago.

amotin commented 4 months ago

I think that disabling zil_prune_commit_list() may workaround only part of the problem. I think it helps because new LWB write it produces also fails, and that error is what you handle, not the original one. It should not help with transient write errors, that may cause ZIL not replayable after the failed LWB. If the previous LWB's ZIO is still running when the next is issued, it will propagate its errors to all the following ZIOs via parent-children relations. But if LWB is in LWB_STATE_FLUSH_DONE state, there is no longer ZIO for zil_lwb_set_zio_dependency() to call zio_add_child() on to propagate the error, so all the following ZIOs till the TXG commit completion may successfully complete, but will never be replayed in case of crash. But it seems not exactly trivial to decide how far in LWBs to propagate the error, since on the edge of TXGs zil_sync() will free the stale part of ZIL chain, but it may vary depending on what ZIOs are still running, etc. It needs much deeper thinking, but my head is now is in several other projects.

@robn I wonder if you'd be interested as part of your https://github.com/openzfs/zfs/pull/16314 work.

openzfs / zfs