RAID parity protection at "inter block" level

michaelwu0505 commented 3 years ago

Describe the feature would like to see added to OpenZFS

Background In the below discussion, I am using the terminology defined here. It should be the same terminology as used in ZFS community. But since I am fairly new here, I include this reference just to be safe.

Quote from the discussion in the above link:

the high level operation of RAID-Z isn't all that different from regular RAID, but the parity is special in a big way: each block has its own dedicated parity sectors.

My understanding is that in RAID-Z, the data and parity sectors are stored onto different disks in the array. Therefore, when reading back a block, the disks that have the corresponding data sectors need to be involved. Which leads to a lower read IOPS for RAID-Z.

New Feature

Instead of using parity sectors for data protection, we can use "parity blocks" to protect "data blocks" that are stored on different disks in the RAID array.

For example, the data layout of a 6 disk, double parity array could look like this when 4 data blocks(0-4) are written:

Disk0: data block0 Disk1: data block1 Disk2: data block2 Disk3: data block3 Disk4: parity block0 Disk5: parity block1

When reading a data block from the array, only a single disk read is needed. This new RAID configuration effective has the same "read" IOPS as 6 disks combined.

Parity blocks are rotated. Concept is similar to conventional RAID4 vs. RAID5.

For data blocks that cannot be paired with data blocks of the same size on other disks, padding data blocks can be used.

How will this feature improve OpenZFS?

My usage scenario is for read intensive PostgreSQL database on ZFS. The default "block size" of PostgreSQL is 8kB. With the proposed new RAID protection feature, when setting ZFS recordsize=8kB, it will have better read IOPS compared to raidz2.

Additional context

In #5455, a new feature is proposed to save parity sectors for a block with other data sectors onto the same disk. This provides bad sector recover ability in single disk pool setup.

ahrens commented 3 years ago

Can you elaborate on how this is different than RAID4/5, and how it address the "RAID write hole"? E.g. if data block 1 is freed, and then we want to write to that same location, we need to read and modify the parity blocks 0 and 1. But that can't be all done atomically.

michaelwu0505 commented 3 years ago

I think the behavior is very similar to RAID4/5. But in ZFS, each block will have it's own checksum.

"RAID write hole" issue needs to be tackled by "copy on write" for "multiple blocks at the same time". Continue with my previous 6 disk example, let's say that disk sector size is 4kB and ZFS recordsize=8kB. When a file that is 32kB is written to the filesystem, the data layout will be:

Disk0: data block0 (file content 0-8kB, on disk sector 0 & 1) Disk1: data block1 (file content 8-16kB, on disk sector 0 & 1) Disk2: data block2 (file content 16-24kB, on disk sector 0 & 1) Disk3: data block3 (file content 24-32kB, on disk sector 0 & 1) Disk4: parity block0 (on disk sector 0 & 1) Disk5: parity block1 (on disk sector 0 & 1)

When file content at 9kB offset is overwritten, data block1 and parity blocks 0 and 1 needs to be updated by "copy on write":

Disk0: data block0 (file content 0-8kB, on disk sector 0 & 1) Disk1: data block1 (file content 8-16kB, on disk sector 0 & 1), data block1v2 (new file content 8-16kB, on disk sector 2 & 3) Disk2: data block2 (file content 16-24kB, on disk sector 0 & 1) Disk3: data block3 (file content 24-32kB, on disk sector 0 & 1) Disk4: parity block0 (on disk sector 0 & 1), parity block0v2 (on disk sector 2 & 3) Disk5: parity block1 (on disk sector 0 & 1), parity block1v2 (on disk sector 2 & 3)

Then ZFS metadata needs to point to data block1v2, parity block0v2, and parity block1v2 at the same time. Not sure if this step is possible.

Another way of interpreting/implementing the new feature

We can achieve the same read IOPS benefit if we can have a special type of "block" which have the following properties:

Block data can be read back partially. For example, in 8kB chunks.
Composing sectors have their own checksums. For example, for every two sectors of 4kB (total of 8kB), there exists a checksum.
Data sectors and parity sectors are allocated to pool disks such that: (a) sectors for the same 8kB chunk are allocated to the same disk. This way, when reading back the 8kB chunk, only 1 disk read is issued. (b) data and parity sectors are allocated to pool disks similar to how raidz does it.

Again using the 6 disk double parity example. In this implementation, the recordsize would be set to 32kB. When writing a file that is 32kB to the filesystem, the data layout would look like:

Disk0: data sectors 0 & 1 (file content 0-8kB) Disk1: data sectors 2 & 3 (file content 8-16kB) Disk2: data sectors 4 & 5 (file content 16-24kB) Disk3: data sectors 6 & 7 (file content 24-32kB) Disk4: parity sectors 0 & 1 Disk5: parity sectors 2 & 3

In this implementation, when file content is modified, the entire block is copy-on-write as usual.

michaelwu0505 commented 3 years ago

A possible way of getting the benefit today with ZFS on top top traditional RAID6

Continue with the 6 disk double parity requirement. If I setup a traditional RAID6 array with stripe size=8kB and use it as vdev for zpool. Setting ZFS ashift=13 (8kB) and recordsize=8kB, I think the 8kB read IOPS with be the same as 6 disks.

ahrens commented 3 years ago

It sounds like 3 potential solutions are being proposed here:

When file content at 9kB offset is overwritten, data block1 and parity blocks 0 and 1 needs to be updated by "copy on write":

I don't see how you can reuse the space of "data block 1 v1" without doing a partial-stripe write, and thus encountering the "raid 5 write hole"

2.

We can achieve the same read IOPS benefit if we can have a special type of "block" which have the following properties: Block data can be read back partially. For example, in 8kB chunks.

I think this could work. The only obvious problem I see is now you need to store the checksum of each 8KB chunk. Practically speaking the blkptr_t is only so large, so you would have some limits, which might be acceptable. E.g. you can have up to 8 chunks, each of which has a 32-bit checksum. So you can have recordsize=64k but do sub-block reads of 8k chunks (each of which is on a separate disk, so only one disk is involved in a 8k-aligned sub-block read). sub-block writes would require read/modify/write of the whole 32K (to a new location). (and this doesn't work with compression)

3.

setup a traditional RAID6 array

Sure, that totally works. The traditional array needs to solve the "raid 5 write hole". And it may be able to do that better than ZFS if it has additional hardware (e.g. built-in NVRAM). Whereas RAIDZ works with just HDD's (and gets good performance in many use cases, but not small random reads). If we designed a RAID solution for ZFS from scratch today, maybe it would look like RAID6 and we would require an SSD for logging writes to solve the write hole.

michaelwu0505 commented 3 years ago

When file content at 9kB offset is overwritten, data block1 and parity blocks 0 and 1 needs to be updated by "copy on write":

I don't see how you can reuse the space of "data block 1 v1" without doing a partial-stripe write, and thus encountering the "raid 5 write hole"

I think my example might have led to some confusion. The revised example:

When file of 32kB is first written:

Disk0: data block0 (file content 0-8kB, on disk sector 0 & 5) Disk1: data block1 (file content 8-16kB, on disk sector 10 & 15) Disk2: data block2 (file content 16-24kB, on disk sector 20 & 25) Disk3: data block3 (file content 24-32kB, on disk sector 30 & 35) Disk4: parity block0 (on disk sector 40 & 45) Disk5: parity block1 (on disk sector 50 & 55)

The point here is that the "stripe" is logical to ZFS. ZFS metadata knows that these data blocks belongs together as a "double parity unit". The block locations on the disks are irrelevant. Also, data blocks 0-3 does not need to belong to the same file. They could be part of different files, but being parity protected together.

After file content at 9kB offset is overwritten:

Disk0: data block0 (file content 0-8kB, on disk sector 0 & 5) Disk1: data block1v2 (new file content 8-16kB, on disk sector 210 & 215) Disk2: data block2 (file content 16-24kB, on disk sector 20 & 25) Disk3: data block3 (file content 24-32kB, on disk sector 30 & 35) Disk4: parity block0v2 (on disk sector 240 & 245) Disk5: parity block1v2 (on disk sector 250 & 255)

So after the copy on write sequence for "data block1v2", "parity block0v2", and "parity block1v2" has completed, "data block1 v1" is no longer in use. I think ZFS block level copy on write behavior is standard to ZFS? But in new feature proposal, we will need new way to switch from multiple v1 blocks to v2 blocks atomically. Not sure if this can be supported.

We can achieve the same read IOPS benefit if we can have a special type of "block" which have the following properties: Block data can be read back partially. For example, in 8kB chunks.

I think this could work. The only obvious problem I see is now you need to store the checksum of each 8KB chunk. Practically speaking the blkptr_t is only so large, so you would have some limits, which might be acceptable. E.g. you can have up to 8 chunks, each of which has a 32-bit checksum. So you can have recordsize=64k but do sub-block reads of 8k chunks (each of which is on a separate disk, so only one disk is involved in a 8k-aligned sub-block read). sub-block writes would require read/modify/write of the whole 32K (to a new location). (and this doesn't work with compression)

Thanks for the explanation.

~~So having 8 chunks means that the maximum disks that can be supported in this kind of RAID will be 8 disks.~~

setup a traditional RAID6 array

Sure, that totally works. The traditional array needs to solve the "raid 5 write hole". And it may be able to do that better than ZFS if it has additional hardware (e.g. built-in NVRAM). Whereas RAIDZ works with just HDD's (and gets good performance in many use cases, but not small random reads). If we designed a RAID solution for ZFS from scratch today, maybe it would look like RAID6 and we would require an SSD for logging writes to solve the write hole.

Yes. I also think it would be better if ZFS can support the new feature without the need of more expensive/complex RAID5/6 hardware.

michaelwu0505 commented 3 years ago

We can achieve the same read IOPS benefit if we can have a special type of "block" which have the following properties: Block data can be read back partially. For example, in 8kB chunks.

I think this could work. The only obvious problem I see is now you need to store the checksum of each 8KB chunk. Practically speaking the blkptr_t is only so large, so you would have some limits, which might be acceptable. E.g. you can have up to 8 chunks, each of which has a 32-bit checksum. So you can have recordsize=64k but do sub-block reads of 8k chunks (each of which is on a separate disk, so only one disk is involved in a 8k-aligned sub-block read). sub-block writes would require read/modify/write of the whole 32K (to a new location). (and this doesn't work with compression)

Thanks for the explanation.

~~So having 8 chunks means that the maximum disks that can be supported in this kind of RAID will be 8 disks.~~

I was wrong: there is no limit on the number of disks. Just the size of recordsize. For example, if chunk size is 8kB, then maximum recordsize will be 8x8=64kB.

In #5455, they also needed to store additional XOR data in blocks. A proposed solution was to save the XOR data in additional sectors inside the block:

Well, I don't know how the information about each record is saved at the moment in a disk format. But it might be worth to consider a reserved flagbit to be used for this (if there are any left) - if the bit is set, the last two sectors of a record are XOR-data.

This would save a additional IO and a maybe a lot of additional overhead. It would be added as feature flag to the volume itself. On this feature-flag the sectorsize of the XOR-Data should be stored, since this adds the most flexibility. There might be users out their which uses 4k devices but have a minimum IO-Size of 512 Byte - accepting the additional overhead.

This would also allow to just enable this feature for future-data on a storage which is already in use. So backward compatible, as long as the feature is not enabled on a volume (volume flag is not set).

I think the "checksums" could also be saved in auxiliary sector inside the block. But this will be quite wasteful of space since checksum in much smaller than XOR-Data.

michaelwu0505 commented 3 years ago

Can you elaborate on how this is different than RAID4/5, and how it address the "RAID write hole"?

@ahrens, my previous answer was "I think the behavior is very similar to RAID4/5", but in fact there is a very big difference. Due to copy on write, there will be no "RAID write hole":

E.g. if data block 1 is freed, and then we want to write to that same location, we need to read and modify the parity blocks 0 and 1. But that can't be all done atomically.

Once data block 1 is freed, and then we want to write to that same location (same location as in terms of disk sectors), the newly written block is no longer protected by parity blocks 0 and 1.

I had given another example in the above reply with data and parity blocks all located at different disk sector locations. Hope that example can more clearly explain what I meant.

Please let me know if you still think there will be "RAID write hole"? Thanks.

IvanVolosyuk commented 3 years ago

Let's imagine we have: DataBlock0, DataBlock1, ParityBlock. DataBlock1 is freed and overwritten - you have a window of time when DataBlock0 cannot be reconstructed from parity if DataBlock0's disk fails. I thought about this myself. There seem to be a conceptual problem if datablock0 and datablock1 can be changed independently. Idea of splitting record into parts which can be read separately seem like an interesting one, though it will lower compression ratio for the same recordsize if each part to be compressed separately. Not sure if it can be implemented in ZFS.

michaelwu0505 commented 3 years ago

I think if all Blocks are changed by Copy-on-Write and committed to ZFS block tree atomically, then there will be no vulnerable window.

Let's say that DataBlock0, DataBlock1, and ParityBlock forms a "parity protected group". This means that: (1) Each block should belong to a different disk. (2) When any block is being changed, for example, freed or overwritten, then the affected blocks in the "parity protected group" should be changed by Copy-on-Write and switched to the new version atomically.

DataBlock1 being freed When DataBlock1 is to be freed, a copy of ParityBlock (which is ParityBlock_v2) should be written to disk. Then, the block tree of ZFS is updated to remove DataBlock1, remove ParityBlock, and add ParityBlock_v2. Then the ZFS Uberblock is updated.

Content of DataBlock1 being changed I'm not sure if DataBlock1 can be "overwritten" in ZFS? When the content of DataBlock1 is to up updated, I imaging that a new copy of DataBlock1 (which is DataBlock1_v2) should be written to disk. Also, ParityBlock_v2, which protects DataBlock0 and DataBlock1_v2 should computed and written to disk. Then, "parity protected group" meta data is updated to specify that DataBlock0, DataBlock1_v2, and ParityBlock_v2 forms a protected group. Finally, ZFS block tree is updated and committed by Uberblock change.

openzfs / zfs