Reed-Solomon layer over IPFS

jackie-scholl commented 7 years ago

Consider this scenario:

You are some entity with a whole lot of data. You're using IPFS to store and retrieve that data for your servers (either public cloud or on-prem). You're probably using an (as-yet-unimplemented) private IPFS network to keep your files totally safe. To ensure resiliency, you always make sure that each file you care about is stored on IPFS nodes in at least three geographically isolated locations, out of the 10 locations that you have storage servers in. That means that for every GB of storage you need, you need 3 GB of disk space. But when you look over to your peers, they manage to store more data with less disk and higher resiliency using Reed-Solomon erasure coding! In fact, your friends at Facebook are using 10-of-14 encoding, which allows them to survive the failure of any 4 nodes out of the 14 storing their data while only using 1.4 GB of disk per GB of storage. But you're using IPFS, so you can't achieve that efficiency. Or can you?

A Possible Architecture for Reed-Solomon on IPFS

(Note: I use the term "availability zone" here to indicate groups of servers that are likely to fail together. I assume failures between availability zones are independent.)

I present here an idea for how one might achieve the efficiency of Reed-Solomon storage and also preserve many of the benefits of IPFS.

There are four components to this architecture:

A coordination service that decides what files need to be stored, and on what nodes they should be stored
Storage nodes, which are running an IPFS client and receive instructions for pinning and unpinning IPFS files from the coordination service. They also report to the coordination service how much space they have left, which files they have pinned, what availability zone they're in, etc.
Encoding nodes, which receive requests from the coordination service to Reed-Solomon-encode a particular IPFS file and put the chunks on IPFS, and then return the necessary data to the coordination service.
Decoding nodes, which sit on the network and offer up files that they do not have locally, but are stored RS-encoded on the storage nodes.

Here's how the basic flow works:

Some application puts a file in IPFS that needs to be reliably stored
The application tells the coordination service that the file should be stored
The coordination service chooses three storage nodes in different availability zones and tells them each to pin the file
The file is accessed for a while, but then requests for the file start coming less and less frequently
The coordination service decides that the file is getting accessed rarely enough that it should go into "cold storage"
The coordination service picks an encoding node (at random?) that's not busy, and tells it to RS-encode this file with (just as an example) 7-of-10 resiliency (also known as RS(7, 10))
The encoding node pins the "source" file and runs the Reed-Solomon algorithm and generates 10 chunks, such that any 7 can be used to perfectly recreate the file. The chunks are self-describing, in that they include all the parameters an algorithm will need to later decode the chunks
The encoding node makes each chunk its own file, and puts them all in a directory
The encoding node puts the directory on IPFS
The encoding node returns the reference (hash, address, whatever you want to call it) the directory to the coordination service
The coordination service stores this as a key-value pair, where the key is the reference of the source file and the value is the reference to the directory returned by the encoder
The coordination service steps through the directory and, for each file in the directory, tells a storage node in a different availability zone to pin that file. In this case, there are 10
Once the coordination service has confirmed that each storage node has the file it needs, the source file can be considered reliably stored in these nodes, but it is not directly accessible from them
The coordination service chooses at least one (but probably many) decoding node(s), and sends them the key-value pair from before
The decoding nodes store this key-value pair, and also they pin (directly, not recursively) the value part, which if you'll remember was the directory with the list of chunks
The decoding nodes now announce that they are serving the original/source file (the key part of the key-value pair), despite the fact that it is not stored locally
At this point, the source file is highly available through the storage/decoding system, so the coordination service is able to tell those original three storage nodes that they don't need to store the source file. It's also able to tell the encoding node that it doesn't need to store any of that data any more
Whenever a request comes in for the original file, the decoding node looks up the value for that key, and recursively pins that value reference
This causes IPFS to download each chunk, which will end up coming from those 10 storage nodes from before
The decoding node is actively watching the directory and as chunks come in
As soon as the decoding node has 7 chunks, it is able to reassemble the source file
As soon as the source file is reassembled, it can be served to the requester, and the decoding node can erase the assembled file and also un-pin the chunks that it used

One of the amazing things is that I believe this could be done without touching the IPFS client code at all. Obviously the coordination service, storage node, and encoding node are all using IPFS in pretty mundane ways, but because of the amazing property that IPFS can be mounted as a filesystem and also stored files are just a directory a the filesystem, I think one could implement the decoder with a program that only touches the filesystem and perhaps doesn't even "know" IPFS. The way you would do this would be to implement a FUSE filesystem that pretends it has each of the files that the decoder node should be broadcasting. You then tell the IPFS client that files should be stored in some directory on this filesystem, and it will detect all the files the decoder is pretending to have. When a request for some file comes in through IPFS for some object, the IPFS client will turn around and ask the FUSE filesystem for that object. The decoder will, in turn, look at the directory it has stored and ask the IPFS mount for each of the chunks. The IPFS mount will ask the IPFS client, which will go and fetch them (from the storage servers) and pass them on to the mount which will pass them up to the decoder which will wait until it has enough (7 in this example) and recombine them and pass the resulting file (which will be exactly the source file) up to the IPFS client which will pass them on to the original requester. So, yeah, a bit complicated and a bit slow, but I think it would work!

Some advantages to this architecture:

Files that are stored with RS-encoding would appear exactly the same to clients
Files can be silently moved from mirroring to RS-encoded storage
You should be able to seamlessly move from one set of parameters to another and re-encode your files
Highly distributed. If we leave the coordination service out of the picture (as I have no details on how it would work anyways), the entire system can seamlessly handle node failures and network partitions and other weird things
You can have as many decoding nodes as you like, and they can be serving whichever files you like. Given that the metadata for each file that needs to be stored on the node is probably only a few kilobytes, you could reasonably serve millions of files from each node, and it wouldn't be out of the question to serve several billions of files from each node. This way you can have every decoding node serve every file
You can tune latency (mirroring vs RS-encoding) and reliability on a per-file basis, and even change your decision over time
You can scale encoding, storage, and decoding separately for each other, so you can tune for you particular use case
If that proposal for private IPFS networks gets implemented, you might be able to put the RS-encoded stuff on its own network, so it's totally invisible to the rest of the world. This could be helpful if your source files are on a public network that your end users can access, but you don't want them to be able to see how you store stuff on the backend

Some disadvantages:

If the storage node storing a given chunk goes down, it's not clear how you would re-generate the chunk in order to regain resiliency. You might need to re-code the file entirely
It would probably end up being very slow
How does that coordination service work? We're asking it to keep a lot of data. Would it end up having to be centralized?

(P.S. I'm interested in any comments/questions/suggestions/whatever that anyone might have!) (P.P.S. I wasn't sure if this was the right place to post this? I'd be happy to move it if that'd be helpful.)

jackie-scholl commented 7 years ago

A couple of updates:

Found there's already an idea for an ipfs cluster: https://github.com/ipfs/ipfs-cluster/. Presumably this could integrated as a feature of that cluster system
The actual erasure code chosen should be swappable, so one could replace Reed-Solomon with, for example, the local reconstruction codes coming out of Microsoft Research. You can learn more about those at http://research.microsoft.com/en-us/um/people/chengh/slides/lrc_in_spaces.pdf and https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/LRC12-cheng20webpage.pdf

hsanjuan commented 7 years ago

Hi @raptortech-js ! I was reading this and I was also thinking it is an awesome idea for ipfs-cluster (which I am working on), and that it fits well in the model! Right now we're just trying to have it do the basic thing (multipin), reliably, but it is clear in the future it will need to incorporate more elaborate strategies which allow the use of storage more efficiently.

How does that coordination service work? We're asking it to keep a lot of data. Would it end up having to be centralized?

The current approach is to keep a distributed op-log which builds the central state on different nodes, which at least means that coordination tasks could be taken care of by any other node if the leader goes down. However scaling needs dictate that this coordination must be kept simple and with low overhead and that anything that can be distributed should be. For a system like this it probably requires thinking very well how much work can be outsourced from the leader. And there's still a few challenges involved in such a proposal, like reliably detecting that a file is eligible for cold storage across the whole cluster (need to keep a hit counter for every node and add them up? should all requests travel through a centralized master?), or figure out if it is not painfully slow to recompose the original files (for a start, right now IPFS cannot pin two things at the same time, but I'm hoping this changes with the new storage system) etc.

In any case, thank you for this idea, I'm taking it into account to design a flexible architecture that can support things like this (as plugins ideally) in the future without much hassle. I think it would be also good if you opened an issue in ipfs-cluster referencing this one, where we can more concretely discuss how an ipfs-cluster implementation would look like and what particular problems would need to be solved first.

ipfs / notes

Reed-Solomon layer over IPFS #196

A Possible Architecture for Reed-Solomon on IPFS