ipfs / notes

IPFS Collaborative Notebook for Research
MIT License
401 stars 30 forks source link

Reed-Solomon layer over IPFS #196

Open jackie-scholl opened 7 years ago

jackie-scholl commented 7 years ago

Consider this scenario:

You are some entity with a whole lot of data. You're using IPFS to store and retrieve that data for your servers (either public cloud or on-prem). You're probably using an (as-yet-unimplemented) private IPFS network to keep your files totally safe. To ensure resiliency, you always make sure that each file you care about is stored on IPFS nodes in at least three geographically isolated locations, out of the 10 locations that you have storage servers in. That means that for every GB of storage you need, you need 3 GB of disk space. But when you look over to your peers, they manage to store more data with less disk and higher resiliency using Reed-Solomon erasure coding! In fact, your friends at Facebook are using 10-of-14 encoding, which allows them to survive the failure of any 4 nodes out of the 14 storing their data while only using 1.4 GB of disk per GB of storage. But you're using IPFS, so you can't achieve that efficiency. Or can you?

A Possible Architecture for Reed-Solomon on IPFS

(Note: I use the term "availability zone" here to indicate groups of servers that are likely to fail together. I assume failures between availability zones are independent.)

I present here an idea for how one might achieve the efficiency of Reed-Solomon storage and also preserve many of the benefits of IPFS.

There are four components to this architecture:

Here's how the basic flow works:

  1. Some application puts a file in IPFS that needs to be reliably stored
  2. The application tells the coordination service that the file should be stored
  3. The coordination service chooses three storage nodes in different availability zones and tells them each to pin the file
  4. The file is accessed for a while, but then requests for the file start coming less and less frequently
  5. The coordination service decides that the file is getting accessed rarely enough that it should go into "cold storage"
  6. The coordination service picks an encoding node (at random?) that's not busy, and tells it to RS-encode this file with (just as an example) 7-of-10 resiliency (also known as RS(7, 10))
  7. The encoding node pins the "source" file and runs the Reed-Solomon algorithm and generates 10 chunks, such that any 7 can be used to perfectly recreate the file. The chunks are self-describing, in that they include all the parameters an algorithm will need to later decode the chunks
  8. The encoding node makes each chunk its own file, and puts them all in a directory
  9. The encoding node puts the directory on IPFS
  10. The encoding node returns the reference (hash, address, whatever you want to call it) the directory to the coordination service
  11. The coordination service stores this as a key-value pair, where the key is the reference of the source file and the value is the reference to the directory returned by the encoder
  12. The coordination service steps through the directory and, for each file in the directory, tells a storage node in a different availability zone to pin that file. In this case, there are 10
  13. Once the coordination service has confirmed that each storage node has the file it needs, the source file can be considered reliably stored in these nodes, but it is not directly accessible from them
  14. The coordination service chooses at least one (but probably many) decoding node(s), and sends them the key-value pair from before
  15. The decoding nodes store this key-value pair, and also they pin (directly, not recursively) the value part, which if you'll remember was the directory with the list of chunks
  16. The decoding nodes now announce that they are serving the original/source file (the key part of the key-value pair), despite the fact that it is not stored locally
  17. At this point, the source file is highly available through the storage/decoding system, so the coordination service is able to tell those original three storage nodes that they don't need to store the source file. It's also able to tell the encoding node that it doesn't need to store any of that data any more
  18. Whenever a request comes in for the original file, the decoding node looks up the value for that key, and recursively pins that value reference
  19. This causes IPFS to download each chunk, which will end up coming from those 10 storage nodes from before
  20. The decoding node is actively watching the directory and as chunks come in
  21. As soon as the decoding node has 7 chunks, it is able to reassemble the source file
  22. As soon as the source file is reassembled, it can be served to the requester, and the decoding node can erase the assembled file and also un-pin the chunks that it used

One of the amazing things is that I believe this could be done without touching the IPFS client code at all. Obviously the coordination service, storage node, and encoding node are all using IPFS in pretty mundane ways, but because of the amazing property that IPFS can be mounted as a filesystem and also stored files are just a directory a the filesystem, I think one could implement the decoder with a program that only touches the filesystem and perhaps doesn't even "know" IPFS. The way you would do this would be to implement a FUSE filesystem that pretends it has each of the files that the decoder node should be broadcasting. You then tell the IPFS client that files should be stored in some directory on this filesystem, and it will detect all the files the decoder is pretending to have. When a request for some file comes in through IPFS for some object, the IPFS client will turn around and ask the FUSE filesystem for that object. The decoder will, in turn, look at the directory it has stored and ask the IPFS mount for each of the chunks. The IPFS mount will ask the IPFS client, which will go and fetch them (from the storage servers) and pass them on to the mount which will pass them up to the decoder which will wait until it has enough (7 in this example) and recombine them and pass the resulting file (which will be exactly the source file) up to the IPFS client which will pass them on to the original requester. So, yeah, a bit complicated and a bit slow, but I think it would work!

Some advantages to this architecture:

Some disadvantages:

(P.S. I'm interested in any comments/questions/suggestions/whatever that anyone might have!) (P.P.S. I wasn't sure if this was the right place to post this? I'd be happy to move it if that'd be helpful.)

jackie-scholl commented 7 years ago

A couple of updates:

hsanjuan commented 7 years ago

Hi @raptortech-js ! I was reading this and I was also thinking it is an awesome idea for ipfs-cluster (which I am working on), and that it fits well in the model! Right now we're just trying to have it do the basic thing (multipin), reliably, but it is clear in the future it will need to incorporate more elaborate strategies which allow the use of storage more efficiently.

How does that coordination service work? We're asking it to keep a lot of data. Would it end up having to be centralized?

The current approach is to keep a distributed op-log which builds the central state on different nodes, which at least means that coordination tasks could be taken care of by any other node if the leader goes down. However scaling needs dictate that this coordination must be kept simple and with low overhead and that anything that can be distributed should be. For a system like this it probably requires thinking very well how much work can be outsourced from the leader. And there's still a few challenges involved in such a proposal, like reliably detecting that a file is eligible for cold storage across the whole cluster (need to keep a hit counter for every node and add them up? should all requests travel through a centralized master?), or figure out if it is not painfully slow to recompose the original files (for a start, right now IPFS cannot pin two things at the same time, but I'm hoping this changes with the new storage system) etc.

In any case, thank you for this idea, I'm taking it into account to design a flexible architecture that can support things like this (as plugins ideally) in the future without much hassle. I think it would be also good if you opened an issue in ipfs-cluster referencing this one, where we can more concretely discuss how an ipfs-cluster implementation would look like and what particular problems would need to be solved first.