ipfs / roadmap

IPFS Project && Working Group Roadmaps Repo
MIT License
298 stars 26 forks source link

[2021 Theme Proposal] probabilistic pinning (Cluster) #84

Closed RubenKelevra closed 1 year ago

RubenKelevra commented 3 years ago

Note, this is part of the 2021 IPFS project planning process - feel free to add other potential 2021 themes for the IPFS project by opening a new issue or discuss this proposed theme in the comments, especially other example workstreams that could fit under this theme for 2021. Please also review others’ proposed themes and leave feedback here!

Theme description

Currently the IPFS-Cluster daemon selects the members of a cluster which should hold a specific pin when the pin is added to the cluster-pinset.

This process requires that the pinset is updated quite frequently when cluster members leave and join - if you select a certain amount of copies. So many (if not all) clusters currently use the 'raid-1' approach where each cluster member has to hold the entirety of the pinset.

This is not only inefficient, but also reduces the chance that people are willing to take part, since multiple dozens of gigabites is still a lot of storage.

Even when one take part with a home computer, the new cluster member will have a hard time ever deliver the same amount of traffic back to the network it just downloaded, because of slow upload rates. We can fix this by using probabilistic pinning, described by @hsanjuan and explored a little more by me previously in this ticket.

Hypothesis

In short a cluster wouldn't define how many copies of the data should be hold by the cluster, like now but instead how likely it is that a cluster member will hold a random pin.

If you select 10% as probabilistic value a tenth of the cluster will end up on each cluster member on average, without having to specify any details about that in the cluster pin - each cluster member can calculate this on his own, just reading the cluster-pinset:

The Cluster-Member-ID and the CID of a pin will be hashed, and when the hash is within the specified 10% of the range member will pin it.

There's some more optimizations I've discussed in the ticket, since we can calculate on each cluster member where the data should be stored, which allows us to not ask the DHT or all cluster members, but instead ask the right ones directly.

It would also allow each cluster-member to specify how much data should be stored as maximum on the node, by changing the probability from the cluster default.

So you can for example opt to store more if you have enough harddrive space, or less when the cluster gets bigger.

It also allows us to 'grow' or 'shrink' the chunk of the cluster-pinset each member holds by writing an update of the default value to the pinset, when the cluster gets bigger but the amount of data stays the same.

Vision statement

This feature allows us to spread extremely large quantities of data to many volunteers, without having each of them provide the full amount of space for the data.

This leads to volunteers participating in all clusters they like, instead of choosing just one or two - like it's done for years with Boinc, where you can just select all projects you like and just spend some processing time to all of them.

Why focus this year

There's talk of getting the internet archive into the ipfs network. Storing it all in a cluster makes sense to me - while I'm not sure if a regular home computer could handle all the pinset-metadata.

Anyway - no single server can hold this amount of data, so we need to split it and the current methods do not really allow to store the data safely in a cluster (not all nodes holding this data goes offline at the same time) and also splitting it, while allowing anybody with a computer to take part.

With probabilistic pinning there's no risk that someone can store all pieces of data and then go offline, since it's totally random who holds what and every new node gets directly a random part of the cluster without any data needs to be added.

This also avoids degredations of a cluster, where old data is hold by the oldest members with the oldest harddrives which might fail at the same time, or while replicating.

Example workstreams

Tl;dr:

This is a method for allowing a cluster to get very large, by storing a pseudo-random part of the pinset on each node.

This could allow us to store extremely large amounts of data in the cluster, while not having a single person use up their entire harddrive.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

RubenKelevra commented 1 year ago

@2color do I now have to bump my roadmap proposals once a month to keep them open, or what's the idea here?

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

wenyue commented 6 months ago

What is the status of this feature? I think this is a great idea.

RubenKelevra commented 6 months ago

@hsanjuan can maybe answer that