ipfs-inactive / archives

[ARCHIVED] Repo to coordinate archival efforts with IPFS
https://awesome.ipfs.io/datasets
183 stars 24 forks source link

Call for Participants/Collaborators for data.gov Sprint #107

Open flyingzumwalt opened 7 years ago

flyingzumwalt commented 7 years ago

What We're Doing

The IPFS team are working on an experiment with the Stanford University Libraries. This work is starting immediately. We're looking for collaborators to participate in the experiment. The goal is to download all of data.gov (~350TB), add the data to IPFS nodes at Stanford, replicate the data onto IPFS nodes at multiple collaborating institutions and, through IPFS, allow anyone in the world to hold copies of the parts of data.gov they care about.

Our objectives:

For detailed information about the work plan, see the Github issues for the work sprint and the main "Epic": Replicate 350 TB of Data Between 3 Peers (and then the World). (Note: These github issues are subject to change.)

Who we're Looking to Collaborate With

Institutional Collaborators

We are looking to collaborate with institutions who are able to allocate 300+ TB of network-available storage on short notice. Ideal collaborators would be institutions with data archivists on staff, or organizations who are familiar with the efforts to archive data.gov.

Individual or Private Participants

When we've finished the first round of tightly-coordinated tests, we will make the data available on the general IPFS network. That will be a great opportunity for everyone to help replicate the data and help us improve the experience of using & running an ipfs node.

Our Timeline

We are beginning work on this project immediately. The IPFS team have allocated major resources for a two-week sprint 16-27 January. After that sprint, community efforts and conversations will continue, but the IPFS engineers will turn their focus to other areas for the remainder of Q1.

How to Get Involved

To get involved, comment on this issue or contact @flyingzumwalt directly at matt at protocol dot ai

Q & A

What does an Institutional Collaborator need to provide?

How much storage do we need to allocate?

UPDATE: Based on initial crawls of the first 3000 datasets, we might need far less storage than we initially estimated. The total corpus of data.gov datasets might be less than 50TB, or even less than 10TB, but the actual numbers are difficult to estimate until we finish crawling all 192,000 datasets. However, we have identified other big datasets to replicate in addition to data.gov.

If the new estimates are true, then collaborators would be able to allocate far less than 300TB in order to participate. Note, however, that you might want to use spare storage to store redundant copies of the data or to store other datasets from other harvesting initiatives.

Original answer: Ideally institutional collaborators should allocate enough storage to hold the entire corpus of datasets. Our current estimate is 350TB.

What if we can't get that much storage right away? Our first rounds of replication will be 5TB, 10TB, 50TB, 100TB, etc. so you could participate at those volumes even if you don’t have 300Tb available yet.

It will also be possible to “pin” specific datasets or subsets of the whole collection.

Do our machines need public IP addresses?

No. You don't need a public IP address. IPFS relies on peer-to-peer TCP connections. As long as your machines are able to connect to the internet, our engineers will probably be able to help you connect your ipfs nodes to the other ipfs nodes. If you want/need help, create an issue in the ipfs/archives repository and we will help you out.

What kind of bandwidth will we need?

The more the better.

When do we need to make the storage available?

The bulk of the tests & replication work will happen next week (23-27 January) and will continue after that.

Why IPFS?

There are a number of benefits to creating decentralized archives with IPFS. For example:

Related Discussions:

Exciting Features in the Works

There are a number of work-in-progress IPFS features that apply to this endeavor. This experiment will accelerate work on some of them.

flyingzumwalt commented 7 years ago

UPDATE: Based on initial crawls of the first 3000 datasets, we might need far less storage than we initially estimated. The total corpus of data.gov datasets might be less than 50TB, or even less than 10TB, but the actual numbers are difficult to estimate until we finish crawling all 192,000 datasets.

If the new estimates are true, then collaborators would be able to allocate far less than 300TB in order to participate. Note, however, that you might want to use spare storage to store redundant copies of the data or to store other datasets from other harvesting initiatives.

flyingzumwalt commented 7 years ago

Update on the update: We're identifying other big datasets and adding them to the corpus, like this 30TB NOAA dataset so we'll definitely have plenty of data to replicate!