Open flyingzumwalt opened 7 years ago
UPDATE: Based on initial crawls of the first 3000 datasets, @mejackreed has modified his estimates of the total size of data.gov. The entire corpus of data.gov might only be between 1TB and 10TB. We have identified at least one other large climate dataset, that we will try to download in addition to data.gov.
If it does turn out that the entire data.gov corpus is under 10TB, it will impact this experiment in a couple ways:
The Main Epic: Replicate 350 TB of Data Between 3 Peers (and the World)
People (hypothetical):
Technical Considerations:
If we can roll out filestore in time (see #95 and #91), we can update this plan to have Jack tell ipfs to "track" the data rather than "adding" it to ipfs. This would allow him to serve his original copy of the dataset directly to the network without creating a second copy on his local machines. In the meantime, we can start the experiment using
ipfs add
with smaller volumes of data (ie. 5-10TB). This will allow us to start surfacing and addressing issues around:Advance Prep: Downloading the Data & Setting up the Network
113 Jack Downloads all of data.gov (~350TB) to storage devices on Stanford's network
114 Jack, Michelle and Amy install and configure ipfs
Test-run: 5TB
117 Jack adds the first 5 TB to IPFS. The hashes get published to the testbed network's DHT
118 Michelle and Amy pin the root hash on their ipfs nodes. The nodes replicate all of the data.
119 Michelle and Amy run tests to confirm that the data were successfully replicated
Test-runs: 50 TB, 100 TB, 300 TB
Jack gradually adds more of the dataset to ipfs, giving the new root hashes to Michelle and Amy. They replicate the data.
Move to the Public Network
After testing is complete, switch the nodes to the public/default IPFS network. Provide the blocks on the DHT and publish the root hashes for people in the general public to pin.
Follow-up
At the end of the sprint, we will need to follow up on a lot of things. See #103