Main data.gov Epic: Replicate 350 TB of Data Between 3 Peers (and then the World)

The Main Epic: Replicate 350 TB of Data Between 3 Peers (and the World)

People (hypothetical):

Jack (Stanford)
Michelle (U Toronto/EDGI)
Amy (a University in Midwest)
IPFS team
Anyone out there following along

Technical Considerations:

If we can roll out filestore in time (see #95 and #91), we can update this plan to have Jack tell ipfs to "track" the data rather than "adding" it to ipfs. This would allow him to serve his original copy of the dataset directly to the network without creating a second copy on his local machines. In the meantime, we can start the experiment using ipfs add with smaller volumes of data (ie. 5-10TB). This will allow us to start surfacing and addressing issues around:

Providers UX
Blockstore Performance
Delegated Content Routing
Memory Usage
Deployment/Ops Experience

Advance Prep: Downloading the Data & Setting up the Network

113 Jack Downloads all of data.gov (~350TB) to storage devices on Stanford's network
114 Jack, Michelle and Amy install and configure ipfs

Test-run: 5TB

[awaiting instructions] Everyone sets up the monitoring tools so they can report on performance and to provide info in case of errors
117 Jack adds the first 5 TB to IPFS. The hashes get published to the testbed network's DHT
Jack gives the root hash for the dataset to Michelle and Amy
118 Michelle and Amy pin the root hash on their ipfs nodes. The nodes replicate all of the data.
119 Michelle and Amy run tests to confirm that the data were successfully replicated

Test-runs: 50 TB, 100 TB, 300 TB

Jack gradually adds more of the dataset to ipfs, giving the new root hashes to Michelle and Amy. They replicate the data.

Move to the Public Network

After testing is complete, switch the nodes to the public/default IPFS network. Provide the blocks on the DHT and publish the root hashes for people in the general public to pin.

Follow-up

At the end of the sprint, we will need to follow up on a lot of things. See #103

UPDATE: Based on initial crawls of the first 3000 datasets, @mejackreed has modified his estimates of the total size of data.gov. The entire corpus of data.gov might only be between 1TB and 10TB. We have identified at least one other large climate dataset, that we will try to download in addition to data.gov.

How this impacts the experiment

If it does turn out that the entire data.gov corpus is under 10TB, it will impact this experiment in a couple ways:

More people will be able to participate in the network, pinning the entire corpus on their IPFS nodes
The additional datasets, like this 30TB NOAA dataset will be included in the experiment and replicated to institutional collaborators for the purposes of testing the system and backing up those datasets temporarily, but it will be easy to either pin or skip those datasets independently of the main data.gov corpus. At the very least, it will be much easier to find new homes for those datasets and move them to the new homes over IPFS.
The IPFS team will have to find an even bigger dataset to test our systems at loads over 100TB. 😄

ipfs-inactive / archives