Add a resilver_txg_start for faster resilvering of multiple disks

Describe the feature would like to see added to OpenZFS

When a disk is added to a pool that already has a resilver in progress, the current txg for the resilver should also be noted for the new drive as a new resilver_txg_start (or similarly named) value, so that the resulting deferred resilver can end early for that disk.

Basically, when the deferred resilver runs, once it hits the transaction identified by resilver_txg_start it can stop resilvering to that drive, as it should be up to date on all later transactions (has already resilvered these).

If all disks being resilvered have reached their resilver_txg_start then the resilver can either end early, rather than continuing all the way to the end for a second time, or the behaviour for disks that are resilvered could effectively become a scrub (instead of writing the data, compare to what's on disk already first). The latter option arguably isn't necessary, but a second check shouldn't really hurt, the goal is to reduce unneccessary writing of data that is already present.

For disks added to a pool that is not already resilvering there should be no difference – they will resilver as normal, and if a deferred resilver is triggered by a second disk, the first disk should be skipped (as it will lack a resilver_txg_start value to resilver up to).

How will this feature improve OpenZFS?

Currently when a disk is added to a pool for resilvering (usually via zpool attach or zpool replace) it is given a resilver_txg value (seen via zdb) which tracks its resilvering progress, so that earlier transactions can be skipped as already resilvered. This is also how drives that are temporarily unavailable (offline'd, lost connection etc.) can be resilvered by only copying changes since they went missing, as resilver_txg is set to the last txg they received.

However, when a drive is added to a pool that is already resilvering, there appears to be no such tracking of transactions it has missed, as a result, the deferred resilver that adding another drive triggers will resilver the entire drive as if it had never been a part of the pool, which is a massive waste of time (plus additional wear on the disk). It also seems to sometimes resilver a drive that wasn't added during an existing resilver, e.g- if you add two new disks to the pool, one at a time, the first will be resilvered twice in its entirety, while the second will be partially resilvered (from the current resilver_txg), then fully resilvered again.

This is also an issue when a disk is added then detached, leaving a stalled resilver – I had this happen recently when discovering (to my dismay) another disk I didn't realise was SMR that I was adding as a replacement. I detached the disk, but this left a stalled resilver, then added a CMR replacement instead, but that replacement has been resilvered twice (once partially, then again from the beginning). While I could have forced the resilver to restart using zpool resilver, I realised this too late and it seems like a very unintuitive and unnecessary thing to do (since ZFS should know which part of the resilvering was missed).

Adding this additional case for resilvering should cover all cases that can be optimised (to avoid resilvering data that doesn't need to be) as a disk should either be outdated (new transaction since it was last seen) or new (missed earlier transactions).

openzfs / zfs

Add a resilver_txg_start for faster resilvering of multiple disks #16774

Describe the feature would like to see added to OpenZFS

How will this feature improve OpenZFS?