feature: delayed writing of `dup` copies

jowagner commented 2 years ago

Users are trying to force SSDs to store dup copies twice by adding an encryption layer such as LUKS below btrfs. https://github.com/kdave/btrfs-progs/issues/319#issuecomment-1069286104 points out that this setup may still not yield the expected redundancy as the 2 copies for dup are written in a short time window.

A mitigation would be to first write only the first copy and to keep a queue of blocks that need to be copied. The dup completion should happen after at least a certain (user-configurable?) amount of other data has been written to the SSD. This could be implemented by keeping track of the total bytes written so far and tagging each entry in the queue with the bytes written at the time the first copy is written. Each time the total bytes written increases, the head of the queue is checked whether the difference in total bytes written has reached the target amount. If so, the first copy is copied to the location of the 2nd copy, the entry removed from the queue and the queue checked again (in case multiple blocks have become eligible for completion).

While the 2nd copy is not yet written, the wrong data at the location of the 2nd copy may cause problems if the queue is lost (e.g. because it is decided to not change the on-disk format and to keep the queue in memory only) or if the queue is not checked in all code paths. In most cases, the checksum or transid should reveal which copy is usable but spurious errors may be logged. It may be desirable to blank the first 48 bytes (checksum and UUID) to make it clear that the copy is temporarily invalid (and if I understand correctly that LUKS always writes full 512-byte sectors one could as well zero the first 512 bytes of the node at no extra cost).

If the queue is not persistent and there are still blocks in the queue at umount (or after a configurable commit time limit) btrfs can write some random data to unused areas of the SSD to reach the total bytes written to be able to start writing the remaining dup blocks.

Rather than keeping track of total bytes written in btrfs, one could query this information from the OS for the full drive, not just the partition(s) used by btrfs. Including writes to other partitions, e.g. swap, would speed up reaching the total writes needed to trigger the next dup completion.

This feature should of course only be active if specifically selected by the user.

kdave commented 2 years ago

The idea sounds interesting but the actual implementation brings a lot of problems and I'm not sure making things better. A delayed write will lead to a lot of data not being written to the second device for an undefined period of time. Until the 2nd batch of blocks is written it needs to be kept in memory and also the space reservations need to be kept until the data are safely written. I really don't like to rely on some tricks with UUID/transaction/partial checksum, it's some kind of probabilistic approach to what is stored and in turn it's becoming harder to tell apart corrupted blocks from valid blocks with incomplete information. Also some of the blocks need to be rewritten again so it's increased IO and unclear on-disk state of the filesystem.

Zygo commented 2 years ago

Note that if the device places both copies of any metadata page in the same failure domain, the filesystem can be destroyed (at least until we have btrfs repair tools that can rebuild interior nodes of the trees). So if we want to guarantee metadata resiliency in dup profile, we not only have to avoid storing mirror pages in the same block at write time, but we also somehow have to prevent the SSD from rearranging them without our knowledge after the fact. We could do that with ZNS because we're implementing the garbage collection and address translation inside btrfs for that case, but there's no way to do that for any normal device-managed SSD.

There's three distinct problems that need to be solved to make this work:

We have to defeat devices that coalesce identical writes after the fact. It doesn't matter if there's 10 hours between writing dup copy 1 and dup copy 2, if the SSD firmware says "oh I've seen that before" and stores the second copy on the same physical media as the first, no matter when it's written. We can try to defeat SSD firmware by creating a profile "dup with obfuscation", which does something like encrypt the page using the filesystem UUID, block, and mirror number as key. Hopefully nobody creates SSD firmware with "btrfs optimizations" that undo the obfuscation and defeat the resiliency, but it's possible so we have to assume it will happen eventually...
We have to defeat devices that are vulnerable to loss of consecutive writes (in time or LBA space). This has been discussed before in the context of writeback cache drops, but the SSD media failure case has an important difference: the media could fail long after the fact, so it's not enough to wait until the data is stable on media--we have to somehow force the data to be located on distinct components of the media (different erase blocks or chip banks or whatever).
Whatever we do to solve problem 2, we have to maintain as the device does GC and similar processes over which we have no control. I don't see how we solve problem 2, and for problem 3 we need to solve a superset of problem 2.

dup metadata is only useful because it lets us roll the dice and hope that we land on "something goes wrong, but a specific amount of wrong between zero failures and total loss of the device, and the device still lets us read one of our dup copies so we can run btrfs restore." Recent developments in SSDs make dup metadata less costly on high-end SSDs and more useful on low-end SSDs, so over thousands of randomly selected devices we get a higher rate of good outcomes. btrfs still has no control over whether the resilience works when the drive fails, we just take advantage of it when it does.

Here's a thought experiment to evaluate ideas against: Assume you're writing SSD firmware. Is it possible to detect a proposed btrfs write pattern, and identify redundant blocks in that pattern with high probability? If the answer is yes, the proposed write pattern will not work, because some SSD vendor somewhere will implement the algorithm that defeats it, probably unintentionally (e.g. they'll do a dump of existing software's write sequences, then create firmware that performs those sequences quickly, without regard to impact on resilience). There is a lot of diversity in SSD firmware behavior out there, and a few firmwares are known to be unintentionally bad for btrfs because of this kind of micro-optimization in firmware.

jowagner commented 2 years ago

Thanks for your detailed considerations. Some of the counter arguments do not apply in the encrypted scenario but I agree that the added complexity of this feature is hard to justify.

Zygo commented 2 years ago

Encryption solves only problem 1. The metadata write pattern is easy to recognize even without knowing the block contents. To use encryption to solve problem 2, you'd have to encrypt the LBAs and write sizes, and do something about information leaked through timing, and then performance would be terrible.

kdave / btrfs-progs

feature: delayed writing of `dup` copies #455