Sector recovery ability

RubenKelevra commented 7 years ago

Currently the block size of a zfs-device is usualy somewhere between 512/4096 Byte and 128 kByte. zfs is often used on hdd storages, which have an internal sector size of 512 or 4096 Byte.

These sectors are secured with data integrity algorithm, which can detect errors a might fix simple bitflips. Else these algorithms are mostly able to detect that one sector is unable to be read.

In this case, one sector of 512 Byte might 'contaminate' the checksum for a whole block of zfs-data.

There's a simple solution in zfs today, copied data. All data is written twice on the same disk, which cut the write-performance in half (under best circumstances). And reduces the maximum storage size to less than 50%.

Else there's the possibility for a raid setup.

What about a simple solution to fix this issue? XOR is very cheap and some bits more per block-size is the effort worth I guess.

We could easily attach the XOR for all sectors to the end of one zfs-block followed by a simple checkum of the XOR sum.

If a sector is unreadable, the XOR is read, the checksum of it is verified, the missing sector is fixed in ram and the zfs-block checksum is verified.

zviratko commented 7 years ago

1) you can't read a sector and "reconstruct it" this way, if bit-rot occurs and passes drive firmware EC codes, it is complete garbage. So you'd need a larger chunk of data. 2) you need to know which bits are broken, this is not something a simple XOR solves 3) you need a large number of EC bits per sector 4) bit-rot usually occurs during reads, reading the sector again might return the correct data (or fail), so it would be easier to just retry

I'm not an expert, but this makes no sense :)

behlendorf commented 7 years ago

If a sector is unreadable by ZFS we'll get an IO error from the device for the whole ZFS block. If the data isn't what's expected we'll fail the checksum. In either case data will be reconstructed if the pool is configured with enough redundancy. Trying to do anything of a sector level doesn't really fit with the design of ZFS.

RubenKelevra commented 7 years ago

@zviratko 1) EC codes are added per sector, so you get a sector or not. Garbage is most of the time not the case. 2) you know the missing/broken bits because they are returned as readerror, since you know which are broken you can simply XOR the missing sektor 3) nope, XOR is enough. 4) yes, your right, most of the time a simple reread several hundert times is enough to flip the bit once back and fix the wrong EC to successfully return data instead of a read-error. This is what simple data recovery programs do.

So yes, this do make a lot of sense. :)

@behlendorf well, er ... This is a feature request, this is why this is not the case at the moment. The question is, is it a good idea to add this feature.

Currently we're talking about ZFS as data-storage system for large storage arrays, but ZFS can do much more and pushing hard into the desktop computing segment, where data redundancy is not a such big word.

In laptops and desktop computers a network or detachable drive backup is much better. They are usually better protected from being lost (laptops) or being destroyed (e.g. electric shock). Especially on laptops a raid 1 or even raid z1 solution is most of the time not possible due weight restrictions and/or economical reasons because you need that much space or that fast space, which would cost you twice the amount of money just to have it.

I don't know how your laptop is currently running, but mine has two devices, one fast and one large which are single drives.

Data loss occurs most of the time slowly, so some sectors with bitflips is the first indicator on current hard drives.

If we talk about a 128k record size zfs on a 4k byte sector hdd the amount of storage loss for a dual xor recovery would be just 8k plus overhead. I have no idea how much overhead this would cause - but I'm sure you can answer this.

The idea is to fix the first occurs of drive failures before they even affect any data without much speedloss or capacity reduction - and I'm sure this would be a neat addon on single drive operations with ZFS - especially since no other filesystem has this (as far as I know) :)

So please reconsider the closing of this ticket and open it for a longer discussion again.

behlendorf commented 7 years ago

It's an interesting idea. I'm happy to reopen the issue for further discussion and mark it as a feature request. As I understand it you're proposing that we optionally store the XOR of the sectors of a block so we can reconstruct individual sectors in a block even for single disk configurations. Here are my initial thoughts.

Assuming 4K sectors and 128K blocks storing the XOR parity information will cost us approximately 3% of the total drive capacity. For a single disk configuration this may be worth it since it's far better than the 2x for full replication. This get considerably better for say 512b sectors and 1M blocks at a mere 1/10%. This is also a best case estimate, for small files with say 4k or 8K blocks space usage will suffer.
The parity data is too large to store in the block pointer itself but we could consider using one of the DVAs for this. However, that may end up conflicting with other features like encryption which are which are under development.
We'd only want to apply this data blocks since all meta-data is already replicated.
This assumes the likely failure scenario is a single sector which cannot be read. Do we have any data which backs this up?
This is the kind of feature which would make ZFS more resilient in a desktop (single disk) environment.
There's nothing preventing us from also applying the retry technique described above on a per-sector basis when we don't have this reconstruction functionality.
Depending on how this is implemented it could have a significant impact on write performance because we need to store that XOR's block somewhere and that likely means an additional IO.

RubenKelevra commented 7 years ago

Hey @behlendorf,

It's an interesting idea. I'm happy to reopen the issue for further discussion and mark it as a feature request.

thanks for reconsidering this as feature again. :)

As I understand it you're proposing that we optionally store the XOR of the sectors of a block so we can reconstruct individual sectors in a block even for single disk configurations. Here are my initial thoughts.

Depending on how this is implemented it could have a significant impact on write performance because we need to store that XOR's block somewhere and that likely means an additional IO.

Well, I don't know how the information about each record is saved at the moment in a disk format. But it might be worth to consider a reserved flagbit to be used for this (if there are any left) - if the bit is set, the last two sectors of a record are XOR-data.

This would save a additional IO and a maybe a lot of additional overhead. It would be added as feature flag to the volume itself. On this feature-flag the sectorsize of the XOR-Data should be stored, since this adds the most flexibility. There might be users out their which uses 4k devices but have a minimum IO-Size of 512 Byte - accepting the additional overhead.

This would also allow to just enable this feature for future-data on a storage which is already in use. So backward compatible, as long as the feature is not enabled on a volume (volume flag is not set).

The data-integrity algorithm should not check the XOR-Part for integrity, since this part can be damaged as well. It should be considered as possible to loss without affecting the data-integrity of the data-block itself.

The parity data is too large to store in the block pointer itself but we could consider using one of the DVAs for this. However, that may end up conflicting with other features like encryption which are which are under development.

Maybe this idea also solves the encryption conflicts? The XOR-Data should not be encrypted, since they are simply garbage and it would hinder the decryption of the data-block itself if the XOR-data would be unreadable.

We'd only want to apply this data blocks since all meta-data is already replicated.

Right.

There's nothing preventing us from also applying the retry technique described above on a per-sector basis when we don't have this reconstruction functionality.

I guess we're reading the disk at the moment not sector-wise but in large chunks. So a read-error would be probably returned for the last chunk - I guess.

If this happens, we have to split the read-request to many small read-request sector-wise, to get the erronous sector. If we located the sector in one record, the error should be stored in the log.

After this, the question is, can the error be recovered by the XOR-data or maybe the XOR-data is only effected, if so the complete record should be rewritten. This gives the disk the chance to use a backup-sector for the broken one.

If the error cannot be recovered ZFS should not try to reread any data or change the broken part of the disk by write-requests.

The best solution (for data-integrity) would be to fail the entire disk at this time - after piping a report to ZED and writing the information to the disk itself of course. Else all open write-request should be written to the disk, but no new write-requests.

Then the best solution for recovering data is to boot the computer from a different medium (if the ZFS-storage is the root-device) and open the disk raw via dd-rescue.

Here is to note, that there are two programs out there called dd-rescue, I refer to the GNU one.

dd-rescue copies one disk to another, on each read-error it jumps to a different part of the disk, to reduce the mechanical stress to the disk itself and also reading the data as fast as possible, to get the best out of the disk as long as it's still working.

After reading all "good" sectors from a disk dd-rescue jumps to the read-errors and split the read-request to smaller chunks to get everything around the error part. This goes down to sector-size read-requests.

After all sectors got read which are readable, ddrescue goes to the roof on rereading the broken sectors, to get everything from the disk. This might take several hours for some 10,000 sectors.

The most data can be recovered on this way.

The sectors which are broken can be filled with a given binary pattern, standard is just a bunch of binary zeros.

This gives another task for this feature request, a pattern must be defined for ZFS to search for instead of non-readable sectors. So I guess RLE would be mendantory (at least) for this feature to work, since else it cannot be assured that a zero sectors is not actually data - right?

After this is done, the new disk can be mounted via zfs again and a scrub can check all errors on the disk. The sector recovery can XOR all sectors errors out and the data integrity algorithm can recheck the data for inconsistences.

This assumes the likely failure scenario is a single sector which cannot be read. Do we have any data which backs this up?

I did the ddrescue recovery of disks for friends and at the job very often. Broken parts on disks split in two categorys: several hundered KB up to 1-2 MB broken or just single sectors broken. The only edge-case here seems to be 2 sectors following eachother are broken, that's why I think the XOR-data should be considered to be able to recover two sectors.

So my thoughts on this in short:

The sector size for this recovery-ability should be configurable at the time we enable this feature. If nothing is configured, the minimal IO-size should be assumed.
We should be able to recover two sectors, so the XOR-data should be 2 sectors.
If possible, the XOR-data should be stored directly in the record itself, after the data-block to reduce IO-requests for storing it elsewhere.
If possible a flag should be used on each record, to flag XOR-data or no XOR-data.
RLE should be mendantory for XOR-data enabled devices, to ensure no sector contains only binary zeros.
All sectors which contains only binary zeros are considered as read-errors, to be fixed by XOR-data, if the XOR-data-flag is set on this record.
XOR-data should not be checked by the data-integrity algorithm nor be compressed or encrypted. Else a read-error in the XOR-data destroy the data-integrity of the data-block itself.
If a record is not readable because of checksum-errors or read-errors in general, it should be not overwritten by ZFS in any single disk or sparse-disk-case to give the user the abilty to recover data.
If a read-request results in an read-error, it should be retried in single read-requests by sector-size defined in the XOR-flag.
If a record is readable by XORing 1 or 2 unreadable sectors, the complete record should be rewritten to give the disk the chance to use backup-sectors.

RubenKelevra commented 7 years ago

@behlendorf I know, a heck of a brainstorming and this is really not a high priority feature request, but do you got some more thoughts about it? :)

behlendorf commented 7 years ago

@RubenKelevra I think the idea here has merit for single disk configurations if someone has the time to explore it. Really the ideal place for these sectors from a performance stand point would be at the end of the blocks. But due to the way to I/O pipeline is structured I'm concerned that would significantly complicate some already pretty subtle code. Still a good place to start would be to prototype what you're suggesting and see how bad it really is.

RubenKelevra commented 7 years ago

@behlendorf sounds cool. Sadly I'm not a blessed C(++) programmer. :)

I guess the difficulty would be, to be sure, it's always using the right size of a block at the right part of the existing code, to not create a huge exploitability by adding this optional feature.

behlendorf commented 7 years ago

Yes, exactly. That issue already ends up being complicated due to compression.

openzfs / zfs

Sector recovery ability #5455