trapexit / mergerfs

a featureful union filesystem
http://spawn.link
Other
4.17k stars 170 forks source link

writing redundantly #847

Open lpirl opened 3 years ago

lpirl commented 3 years ago

Dear trapexit, dear contributors of mergerfs.

First, thanks for this beautiful piece of FLOSS. I use mergerfs with great success in a few productive scenarios.

This feature request asks if mergerfs could be used for fault tolerance.

Is your feature request related to a problem? Please describe. To prepare for the unavailability of mounts, we want to write data redundantly (no striping, no integrity checking, plain copies). We'd need something like the great policies mergerfs offers to coordinate writes (where to, etc.) and reads (conflict resolution, etc.). So if one would like to implement a "FUSE RAID", one would have to implement a lot of the functionality mergerfs already offers.

Describe the solution you'd like So the question that is, could we introduce a write policy/filter, which writes data redundantly for fault tolerance?

Describe alternatives you've considered For network-mounted file systems, RAID is not an option. ChironFS seems to be dead.

Additional context Unreliable file systems, such as WebDAV mounts, which we want to pool.

I well-understand how mergerfs is different from RAID and friends, I searched the issues, I did see the notes in the README; but I didn't find a discussion on this.

You probably want to avoid feature bloat and that is understandable. However, this is also be a great opportunity for mergerfs. :)

Thanks for considering.

trapexit commented 3 years ago

I've talked about this a number of times if you look through the tickets. Might have mentioned it in the docs too but don't remember.

The problem is that, IMO, it's simply not as simple as people think it is to accomplish safely. The feature is pretty straight forward to add but explaining error conditions and handling them safely is not. This kind of thing is already a point of confusion for many people in mergerfs. If you unlink two files and one fails... what happens?

So if mergerfs is asked to write 4K to a file and it writes 4k successfully to one and the other fails... what do you do? Return success and ignore the failure? Return the failure? But now you have two files out of sync. What if it doesn't fail but writes 2k? Just keep trying till it succeeds? Easy to do but usually indicates something else is going on. Do you do the writes at the same time and return when both are done or do you write to one and dispatch the others async on a thread?

There are a lot of things to consider and making that all configurable and explainable isn't easy. And people won't even notice till something bad happens a drive dies or network goes down or whatever.

Whereas if they just use rsync or rclone like they would normally to clone data they can catch these errors much easier and fully manage when and how data is duplicated. So that would be my question. Why not just rsync? All the gdrive people basically do that all the time. 2 rclone mounts with the same data on both. Put there by an out of band rclone call.

lpirl commented 3 years ago

I've talked about this a number of times if you look through the tickets. Might have mentioned it in the docs too but don't remember.

Hm, then apparently I didn't use the right keywords for searching. Pardon. Thanks for taking the time to answer nevertheless.

The problem is that, IMO, it's simply not as simple as people think it is to accomplish safely. The feature is pretty straight forward to add but explaining error conditions and handling them safely is not. This kind of thing is already a point of confusion for many people in mergerfs. If you unlink two files and one fails... what happens?

I like how it is implemented today (don't fail silently), but this discussion can probably get philosophical quickly. If the fault model is that we can tolerate crash faults, mergerfs would notify about the error but not fix inconsistencies.

So if mergerfs is asked to write 4K to a file and it writes 4k successfully to one and the other fails... what do you do? Return success and ignore the failure? Return the failure?

I totally do see the problems with writing and would vote for the same approach as for unlinking (asked to write n times → doesn't work → error).

In a distributed file system, I've seen an approach where a desired and minimum number of successful writes can be specified. What a can of worms… :)

But now you have two files out of sync. What if it doesn't fail but writes 2k? Just keep trying till it succeeds? Easy to do but usually indicates something else is going on.

Same as for unlinking, writing must fail, inconsistency stays. An argument for this point of view could be that in RAID, the file system uses pooled disks, errors are handed up, the file system must take; for mergerfs, the user uses pooled file systems and must take care of errors.

For reading, mergerfs has everything in place, no (e.g., read youngest mtime, etc.)?

Do you do the writes at the same time and return when both are done or do you write to one and dispatch the others async on a thread?

Hm, implementation detail but I'd intuitively go for parallelism.

There are a lot of things to consider and making that all configurable and explainable isn't easy. And people won't even notice till something bad happens a drive dies or network goes down or whatever.

True… it would definitively add complexity to code, docs and understanding.

Whereas if they just use rsync or rclone like they would normally to clone data they can catch these errors much easier and fully manage when and how data is duplicated. So that would be my question. Why not just rsync? All the gdrive people basically do that all the time. 2 rclone mounts with the same data on both. Put there by an out of band rclone call.

FWIW, this is a setup for use cases different from "FUSE file system, please pool those mounts". It doesn't work well with mounts of different sizes or for odd numbers of mounts (e.g. 1G, 1G, 1G), the out-of-band solution makes it inefficient (e.g., rsync 1G for 4k changes) or less integrated (e.g., hackery with inotify).

What do you think, if one would test this on branch, would this end in a maintenance nightmare?

Unfortunately, I don't have time to join implementing this. At least, this issue can maybe serve as a place to "thumbs up" and subscribe?

trapexit commented 3 years ago

For reading, mergerfs has everything in place, no (e.g., read youngest mtime, etc.)?

Depends on the usecase. Some people want read HA where a read failure results in a retry. And the largest mtime doesn't mean that there wasn't an issue. What if the file is opened RW and used that way? If a write fails the fact it failed should likely impact the reads (read from a different branch?) And what do you do about write failures? Do you try to unlink it? Do you write to a log somewhere that the particular underlying file is "bad"? What do you do when you open the file? What if there already exists multiple files? If you have N=2 for replicas and you find 4 files do you unlink 2? Open all 4? Does it check to see if the files are the same size? Same data? What if they aren't?

What do you think, if one would test this on branch, would this end in a maintenance nightmare?

That's not a concern. The concern is 1) having very clear understanding of the intended behavior and accounting for everything I've mentioned prior. 2) the fact that I know someone will want something similar but different and implementing the N versions of this will be nontrivial without considering all the different designs upfront.

Not saying it's out of the question... I'm saying that this is always more complicated than people tend to make it out to be and in the past even getting them to understand all the edge cases and deciding how they'd want to see them handled often didn't go well.

lpirl commented 3 years ago

Very true. So what would be the first step towards designing all this? Writing down all the edge cases? Decide on a fault model? …? I could write "start a list with all the edge cases" on my to do list. :)

trapexit commented 3 years ago

Very true. So what would be the first step towards designing all this? Writing down all the edge cases? Decide on a fault model?

I suppose. All those questions have to be answered.

19wolf commented 3 years ago

FWIW My use case would be to have MergerFS write a "backup" (and I mean that very loosely) file to a different drive. In my case it could write File to driveA and File.backup to driveB (but not actually change the extension). MergerFS would treat the "backup" as read-only and write it when the main file is updated unless the user specifically tells MergerFS to use the "backup"- in the case of a drive failure for example (some option like, "treat anything missing from driveA as real").

trapexit commented 3 years ago

I'd prefer a description of the problem and not the solution. There are lots of way to get redundancy and duplication. Creating a fake raid1 kind of thing really isn't what mergerfs is intended to offer. Why would it need to be in mergerfs vs doing what others do today which is rsync files across drives or use the dup tool and leave mergerfs' policies manage the rest?