Add ability to cause pool to refresh/rewrite all available blocks on underlying physical devices

eharris commented 4 years ago

Describe the problem you're observing

Since the primary goal of ZFS is data-integrity and data-protection, one obvious scenario that isn't yet handled is the ability to do preventative maintenance to avoid simple magnetic "bit-rot" (which can eventually result in unrecoverable read errors) in pools over time. Please note that this is NOT the same as a scrub/resilver, since that only repairs data errors after they have already manifested, which is both more risky than providing the ability to prevent them in the first place, and also not possible unless the pool has redundancy (zpool).

This is a feature request to add a capability to be able to do preventative maintenance on high density magnetic drives (perhaps using PRML or SMR or similar technologies) that are susceptible to degradation of the magnetic signal over time due to other ambient magnetic fields, without any underlying mechanical or electrical failure of the storage device. I am looking for a way to cause a pool to refresh/re-write every block on the underlying physical storage device(s), even unused blocks (which also are not checked by a scrub), without requiring hacky and potentially dangerous workarounds (see below for details).

My intent would be to use this capability on a regular basis (perhaps every 6 months or a year) to ensure that drives won't suffer from bit-rot on infrequently-written portions of the magnetic media. Having the ability to throttle the "rewrite" process or restricted to certain parts of a pool would be nice, but for me that is a secondary concern.

(I did find a possibly related feature in #8230 when I was looking to make sure this issue hadn't already been raised)

Example case:

A several years old zpool of 8x 12-TB magnetic hard drives, with large portions of the data in the pool fairly stagnant/unchanging over the entire lifetime of the pool. Some data may be in snapshots that have existed since the original pool was created 5 or more years ago. The underlying hard drives have started to report a few read errors (detected either via SMART long self-tests which are regularly being executed, or via regular scrubs), but exhibit no other signs of device failure, including no increase in the number of reallocated sectors.

A zpool scrub may detect and repair/resilver some unreadable blocks, but does nothing to prevent the signal in additional blocks from degrading far enough to cause additional future read failures, which are statistically likely to continue to accrue with more frequency as the signal of non-refreshed sectors continues to degrade with time.

In my case, scrub did find and successfully resilver some blocks, but some it did not seem to find. SMART tests on the drive report that the number of unreadable blocks decreased after a successful scrub, but did not eliminate them. Subsequent to the first scrub and also after an additional SMART long self test completed with read errors still reported, still reports the zpool having no errors detected. I can only surmise that these are due to sectors on the physical devices that have remained unused or are otherwise reserved, and since the zpool isn't using them, they were not attempted to be read by the scrub even though SMART detected them.

Workarounds

One potential workaround to accomplish this is to take the pool offline (or make it readonly) and then copy the underlying drives onto themselves using dd or similar tools (e.g. dd conv=noerror,sync if=/dev/sdg of=/dev/sdg). This only works as long as no read errors are encountered, although if the devices are part of a pool that has redundancy, read errors still have the possibility of later being correctable by a zpool scrub. It also has the potential that it may not even accomplish the desired result if the controller tries to be "smart" and not re-write sectors that already contain the same data (this could be for performance reasons or even due to underlying technology choices such as SMR). In my example case above, doing this process on that entire pool would take about 2 weeks, an unacceptably long time to have the pool unavailable or readonly.

Another potential workaround is to add a spare device (zpool add tank spare /dev/new) and then initiate a replace on each underlying device in turn (zpool replace tank /dev/old /dev/new). This has the benefit that it is able to be done without taking the pool offline, but does have its own set of issues. It requires making physical changes to the storage hardware (prone to mistakes), can only be done on one device at a time (assuming you don't have multiple spares), may consume a spare that may be needed if another drive in the pool starts developing errors while the replace is being done, and so on. Since it is a manual process, it also carries the risk that devices could accidentally be missed. It may be that an extra spare of the necessary size may not be readily available. It also causes the potentially undesireable rearrangement of devices in the pool, unless you do it twice (replace with spare, then replace back onto original drive), which also runs the risk that unused portions of the same disk may still not get rewritten, and impacts the performance of the pool for a longer time.

To me, neither of the workarounds is very acceptable, it would be much better (IMHO) if zfs provided this capability itself. I think it is also pretty clearly safer if ZFS provided a mechanism to be able to do it.

kernelOfTruth commented 3 years ago

Looks like it partially could be a bump road related to TRIM command and e.g. WD Red SMR drives:

https://ubunlog.com/en/the-use-of-zfs-is-causing-data-loss-on-some-western-digital-discs/

https://www.truenas.com/community/threads/update-wd-red-smr-drive-compatibility-with-zfs.88413/

Mitigating the DM-SMR issues with ZFS

Both iXsystems and Western Digital treat data loss as a serious event. Given these findings, we cannot recommend the use of these WD Red DM-SMR drives in a FreeNAS or TrueNAS system. However, if you do find that you have these drives in an existing system and cannot replace them, there are some ways to potentially mitigate the DM-SMR issues:

1. Disable TRIM on pools with the DM-SMR drives. Disabling TRIM reduces the risk of the drives entering the unresponsive state where I/O cannot be completed. Disabling TRIM will have a negative impact on long-term drive performance but will enable the drives to operate more safely. On FreeNAS 11.3 this is done with setting “Sysctl” “vfs.zfs.trim.enabled=0” via tuneables. On TrueNAS 12.0, TRIM is disabled by default, but can be enabled via the pool webUI in TrueNAS 12.0 U1. You can check via the CLI that “zpool get autotrim” returns the value “off”.
2. If possible, use smaller VDEVs. Mirrors are best and VDEVs with less than 4 drives are better. These actions will increase I/O sizes and reduce the resilver times significantly.
3. Use a ZFS dataset record size and ZVOL block size that is large enough to force larger than 64K writes to each drive. If you have a ZFS RaidZ VDEV of <5 data drives, use 256K or higher. If your VDEV has more drives, use 512K or higher.
4. Within ZFS, there is a parameter which is called “allocators” which determines roughly the parallelism of WRITES to a drive. By setting the sysctl vfs.zfs.spa_allocators = 1” via tuneables in the webUI, the randomness of WRITES is reduced and this improves the performance of the SMR drives.
5. Upgrading to TrueNAS 12.0 with OpenZFS 2.0 may be beneficial. There are algorithmic changes which change TRIM from sending immediate (synchronous) TRIM commands to sending background asynchronous TRIM commands where smaller TRIMs may be skipped. AutoTRIM can be disabled in favor of manual or scheduled TRIM tasks, but these manual TRIM tasks may overwhelm a DM-SMR drive. In addition, TrueNAS 12.0 includes Asynchronous Copy-on-Write (CoW), which reduces the number of smaller WRITES in some access patterns. These improvements may improve performance of DM-SMR drives but have yet to be validated. In the meantime, it is recommended that TRIM be disabled.

It should be noted that the above recommendations have not been tested in a large population of systems to see whether they are sufficient to avoid future issues. They are provided as technical advice to assist in a transition away from these DM-SMR drives. iXsystems recommends that WD Red DM-SMR drives be avoided for use with ZFS wherever possible.

kernelOfTruth commented 3 years ago

the following could be helpful:

https://github.com/Seagate/SMR_FS-EXT4

both full kernel source and individual patches are available: https://github.com/Seagate/SMR_FS-EXT4/commits/master

might help to give some insights

stale[bot] commented 2 years ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

eharris commented 2 years ago

Still hoping for this

Artoria2e5 commented 1 year ago

psssst… isn't conv=noerror,sync enough to ignore read errors? Maybe with iflag=fullblock for more assurance. Block size and speed tradeoff is pretty bad, admittedly.

(ddrescue has a --same-file option. doesn't defeat the offlining point though!)

sveken commented 5 months ago

This could be useful for single disk external hard drives that are used for backups once every 6-12 months.

eharris commented 5 months ago

I would argue that this is useful for any magnetic storage drives in any pool, not just ones that are rarely used or for backups.

amotin commented 5 months ago

SSDs must be doing this refresh on firmware level, they should just need some periodic power-on time. External influence would only wear out the flash faster. About refresh for HDDs I haven't even heard, but if we assume it exists, I'd expect combination of SMART tests and ZFS scrubs should give firmware enough chances to do its job. For drive-managed SMR HDDs most people use doing this in software would be a pain for both software and firmware, while firmware should also be able to do it better, if it is really needed. IMHO all that software can do is use TRIM/UNMAP to report empty ranges to allow firmware to better manage ranges that are important.

eharris commented 5 months ago

@amotin as pointed out in OP, the point of this feature is to prevent magnetic media data errors from occurring. SMART tests and scrubs are only about detection, not about prevention. TRIM (in the context of this ticket) is irrelevant, because it does nothing to prevent bit-rot of existing data, it merely marks unused areas to improve performance when writing new data, and is also only available on a very small subset of magnetic drives.

amotin commented 5 months ago

@eharris All modern drives, both SSDs and HDDs, are able and expected to detect and correct small data errors without ever reporting them to software. AFAIK SSD firmwares are doing it routinely, refreshing the data that may reach dangerous levels of errors or just were accessed too many times. I don't know if HDD firmwares actually refresh unstable sectors or only recovers them each time on read, but who am I to say HDD vendors what to do? TRIM makes firmware easier to do refresh, by not refreshing what is not needed and providing more free space to work on.

openzfs / zfs

Add ability to cause pool to refresh/rewrite all available blocks on underlying physical devices #10591

Describe the problem you're observing

Workarounds