Open behlendorf opened 10 years ago
Because data-integrity is always the first priority for ZFS it would be desirable if we could quickly and cheaply detect when a disk starts mishandling its writes. Specifically, the case where a disk accepts the data, writes the data to disk, but in the process damages it thereby rendering it useless. I've seen this happen for a variety of reasons using a variety of both cheap and expensive hardware. But the bottom line is always that when this does happen the sooner ZFS notices and takes the drive offline the better.
The problem is that if the running workload only needs to read known good blocks from disk or recently written data when can be entirely serviced from the ARC. Then the system can potentially operate for a long time without detecting the problem and all the while causing more damage.
For this to be noticed relatively quickly ZFS must regularly read back data from the disk which was recently written. These read operations will have some impact on performance so the tricky bit here is to minimize that cost as much as possible.
One way this might be accomplished is to extend the label sync code to periodically (or always) read-verify the label after it is written. This has a couple advantages.
- Improved data-integrity due to immediate label verification.
- Verification reads will be self limited by txg syncs.
- Performance impact minimized be performing the read immediately after the label is written thereby minimizing the seek cost.
- In addition to validating the checksum a full data comparison against the known good label could be performed.
Another more generic approach would be to add an optional verify stage in to the pipeline. Regardless of what is done I wanted to get this filed as a possible improvement.
Brian, this is a really good idea! I work at a fault-tolerant computer company, and when we first did our own HW disk controllers, we had to add a mode for 'verify after write' to avoid these kind of issues. One especially insidious one that can bite a user of mirrored disks (because they think they are safe...)
Write to disk A (okay) Write to disk B (okay) (some time later) block Q written on disk A goes bad, but no-one notices since it is rarely read disk B fails admin replaces disk B and kicks off a rebuild read of block Q fails with a URE user is screwed
Would it be reasonable to add this as a pool-wide property?
Because data-integrity is always the first priority for ZFS it would be desirable if we could quickly and cheaply detect when a disk starts mishandling its writes.
Dang, I thought ZFS already did this!
@FransUrbo For reads which have to go to disk this will already happen. But there can be long running workloads where you just never end up having to read from the disk after its cached. It's those cases where this could happen.
@dswartz I could see an argument for doing this as either a pool or a dataset property. But it would definitely need to be configurable somehow. As for the damaged mirror case you described above, that's exactly why people should be encourages to regularly scrub their pool.
@dswartz I could see an argument for doing this as either a pool or a dataset property. But it would definitely need to be configurable somehow. As for the damaged mirror case you described above, that's exactly why people should be encourages to regularly scrub their pool.
Agreed!
Hi guys, just a suggestion.
Since the scenario being discussed is really an ARC/L2ARC issue, and that is always a relatively small amount of data... wouldn't it make sense to set a new feature that is called "revalidate ARC" or something..
It can run every 5minutes, or some configurable amount of time -- this way, the entire pool or FS doesn't need to be re-read over and over again (which is the same function as a scrub anyway) -- it just needs to validate that the data in the cache isn't different than the data on disk.. If it is, a deep inspection or anything else can occur.
This has the added benefit of being scalable and measurable -- since the ARC sizes are never more than a few % of the total zpool size. And a system can back off of this at times of high load... for example, this could be set to never take more than X ms per run or slice of the ARC.
Actually, I'm not really concerned with the integrity of the data in the L2ARC. The checksum will always be validated on read and if it's ever wrong ZFS just fetches the correct data from disk. This is pretty optimal. What worries me if when ZFS thinks good data was written to the primary pool but the disk/controller threw it on the floor or otherwise damaged it. This may go undetected for a very long time. The idea here is to help us more quickly detect misbehaving hardware as cheaply as possible.
I'd suggest using some sort of tail scrub here. A configurable scrub which only scrubs the data that's come in since the last scrub or tail scrub would be immensely useful in read heavy or worm datasets. I would personally run an hourly or 4x daily tail scrub depending on the drives, and then only do weekly full scrubs instead of daily.
Just to clarify my suggestion, I wasn't talking about validating the L2ARC, I was suggesting that we use the L2ARC as a reference list of pointers to verify on disk (so a new journal or equivalent would not need to be created)... to catch on-disk corruption while reads could conceivably be handled by the ARC for an extended period of time.
Creating a long journal with a tail-scrub function would also work if the amount of data written is relatively small AND the L2ARC is relatively big. However, in the case you want to re-read every byte written without doing a scrub... the earlier suggestion of a new sync= parlance like "paranoid" might be the right approach... since for admittedly high read zvols the penalty will be very small and the ZIL will cushion the write/read impact anyway.
For those of us using ESXI against ZFS volumes, we care a lot about read/write latency, and would love a mode where we can leave sync=standard and not impact NFS synchronous behavior -- so, if there were a way to do this and still return from the write once the data has been committed but while the verify is underway, that would be awesome.
To address aarcane's approach, when a full (or incremental) zfs scrub is run, a snapshot pointer could be set, and a zfs scrub that only runs on the delta data since that last pointer would accomplish the tail function. This could be run and re-run as often as requested to keep the incremental scrub up to date. The full scrub could be done whenever suits the user's own data strategy.
I think either or both of these approaches are relatively lightweight to code and could significantly re-use existing functions.
Yes, seems that the difficulty is not "hard" - "immediately auto scrub (read back and verify) all newly written blocks, to ensure "at the very least, a successful initial write", is simple, almost trivial? Killer feature indeed... Could this feature be added back to milestone 1.0? Also, see also #2832 Verify checksum of the ZFS module text and rodata before each transaction group commit
Another more generic approach would be to add an optional verify stage in to the pipeline.
Does this mean a write response to the client would be delayed until the write has been verified?
FWIW, seagate sas and scsi drives have already implemented this in hardware with the name "Idle Read After Write" and also one student apparently built a software prototype (pages 68-73 but no code, just analysis and reports)
FWIW, the SCSI protocol has an operation, VERIFY, that is intended to perform verification that the data written on medium matches the request. AFAIK, nobody uses this. Likely because it will be verrryyyyy slllllloooowwww ooww. From a ZFS perspective, in order to perform a verification that data is on medium, the write() to vdev will need to be replaced by verify(). If you cannot do this, then it is not possible to actually verify the data, because you'll be reading it back from the block device's write cache.
Also, for Flash SSDs, the garbage collection can move data around. So it is not guaranteed that the medium you wrote to is the medium you read from.
So it is not guaranteed that the medium you wrote to is the medium you read from.
that counts as one additional reason to be able to perform a quick incremental scrub.
Does this mean a write response to the client would be delayed until the write has been verified?
For a synchronous write yes, for a normal asynchronous write no (it would happen as part of the txg sync). The advantage of doing the verify as an optional pipeline stage would be that we still have the original data and can immediately reissue the known bad write. That may not be possible latter if this is a non-redundant pool configuration and the data has already been evicted from the (L2)ARC.
The downsides as @richardelling pointed out are that it's likely an immediate read would be served from the devices internal cache and not the physical media. That would let us verify the transport layer, which is nice, but not the physical media itself. The SCSI protocols VERIFY operation when available does seem like the best way to verify the data, but that's really going to hurt performance.
That said, I do think this could be a useful feature. I wouldn't want to enable a verify pipeline stage by default for all writes. But being able to optionally enable it for a dataset I can see being useful situations where verified correctness is more important than raw performance.
Now enabling verification for all label writes by default I think could be reasonable. I'd expect that to have minimal performance impact, and it should uncover misbehaving more quickly.
I've got a system which introduces silent corruption very occasionally to my drives, and breaks data even with zfs raid1. I could definitely use this feature right now. is this something that people are still looking at?
I use copies=3 on a single NVMe, as hardware for this use case is limited. Write latency isn't an issue for most of my datasets, so read after write in the pipeline, to verify transport and drive write cache, would be great. Tail scrub would quickly/cheaply restore redundant copies when at least one write (at eventual SSD location) is good. Having those 2 features and tracking the stats would save me a world of pain.
Enhancement: using SSD idle time since write as a proxy for "data written to final location", tail scrub could scrub just those older writes, or scrub the whole tail but only mark older writes as scrubbed, so newer writes are scrubbed again after enough idle time has passed.
@elahn I think with copies=3 you are over-focusing on one specific way of data corruptions, neglecting others. In my practice I more than once saw corruptions affecting multiple data copies on mirror pools, probably caused by memory corruptions. I don't mind if somebody implement more selective scrub for only some later TXGs. Actually IIRC ZFS can scrub specified range of TXGs from the beginning, since it is used to resilver vdevs that were offline for a while. I think it just missing interfaces at this point. But specifically for SSDs I am not sure there is such a thing as "final location" for the data, since translation layer may have to periodically regenerate the data, or move them between SLC cache and normal storage, and if you expect corruptions on original writes, I am not sure why the same can not happen during regeneration process you can not control. I'd say you may have higher chances to suffer from memory corruptions, whole NVMe failure, or flash wear out, complicated by tripling the writes amount.
Because data-integrity is always the first priority for ZFS it would be desirable if we could quickly and cheaply detect when a disk starts mishandling its writes. Specifically, the case where a disk accepts the data, writes the data to disk, but in the process damages it thereby rendering it useless. I've seen this happen for a variety of reasons using a variety of both cheap and expensive hardware. But the bottom line is always that when this does happen the sooner ZFS notices and takes the drive offline the better.
The problem is that if the running workload only needs to read known good blocks from disk or recently written data when can be entirely serviced from the ARC. Then the system can potentially operate for a long time without detecting the problem and all the while causing more damage.
For this to be noticed relatively quickly ZFS must regularly read back data from the disk which was recently written. These read operations will have some impact on performance so the tricky bit here is to minimize that cost as much as possible.
One way this might be accomplished is to extend the label sync code to periodically (or always) read-verify the label after it is written. This has a couple advantages.
Another more generic approach would be to add an optional verify stage in to the pipeline. Regardless of what is done I wanted to get this filed as a possible improvement.