ipfs-inactive / archives

[ARCHIVED] Repo to coordinate archival efforts with IPFS
https://awesome.ipfs.io/datasets
183 stars 24 forks source link

Download all of data.gov #113

Open flyingzumwalt opened 7 years ago

flyingzumwalt commented 7 years ago

For more info about this task, what we will do with the data, and how it relates to other archival efforts, see Issue #87

Story

Jack Downloads all of the datasets from data.gov (~350TB) to storage devices on Stanford's network.

What will be Downloaded

The data.gov website is a portal that allows you to find all the "open data" datasets published by US federal agencies. It currently lists over 190,000 datasets

The goal is to download those datasets, back them up, and use IPFS to replicate the data across a network of participating/collaborating nodes.

@mejackreed has posted all of the metadata from data.gov, which cointains pointers to the datasets and basic metadata about them. The metadata are in ckan.json files. You can view the metadata at https://github.com/OpenGeoMetadata/gov.data That will be the main starting point for running all of the scripts that download the datasets.

jonnycrunch commented 7 years ago

Does this really need to be >300TB. After looking at the data, there is a lot of data redundancy. Same data is in csv, html and json. does only one organization have to load the entire 300 TB? Most of the data can be broken up to 'health', 'environment', "agriculture' and is composed on heterogeneous files ( typically a few hundred MB per file.) The meta data describing the data would be most important ( Publisher, Identifier, modified date, etc).

mejackreed commented 7 years ago

We have the ckan metadata already. And yes I agree some of the data is redundant, based on how ArcGIS OpenData allows for different types of exports. A smarter heuristic of this would be nice, but may take some more analysis time.

flyingzumwalt commented 7 years ago

@mejackreed do you think you will need help writing the download scripts or running them? We can probably find people to help you.

mejackreed commented 7 years ago

Sure thing. Help definitely wanted! I have a naive downloader here: https://github.com/mejackreed/GovScooper/blob/master/README.md#usage already.

flyingzumwalt commented 7 years ago

cc @jbenet @gsf @b5

b5 commented 7 years ago

Happy to help!

I think it makes sense to first decide weather or not to download in passes, using metadata to cut down on data redundancy (as per @jonnycrunch's suggestion), or to just beef the whole thing. I'd personally vote for the "passes" approach, but first checking to ensure that the data is truly redundant.

mejackreed commented 7 years ago

Yep i have an idea on how to evaluate whether or not the data is redundant or not. Resources that come from a server that has /arcgis.com/ and have .geojson + .csv + .kml are usually just transformations of the same data. A way to understand these types of datasets / resources and codify the heuristics is needed.

An example: https://github.com/OpenGeoMetadata/gov.data/blob/8f440134f13e7559086e7a07b8081098198c9a18/ad/01/6d/50/3d/38/4b/50/bc/b9/e5/62/2f/d7/c0/1b/ad016d503d384b50bcb9e5622fd7c01b/ckan.json

jonnycrunch commented 7 years ago

There are 194422 distinct entries in catalog. Meta data is about 2GB.

https://catalog.data.gov/api/3/action/package_search?rows=1&start=0

Here is an example of one entry:
https://catalog.data.gov/api/3/action/package_show?id=1e68f387-5f1c-46c0-a0d1-46044ffef5bf

each entry has a resource list:

First pass could be to hit all of the URLs in each resource grab 'Content-Length' headers to calculate the exact amount of space needed simultaneously gathering all of the necessary resource urls.

There are also some meta schema resources referenced in the 'extras' section that would be important to grab: https://project-open-data.cio.gov/v1.1/schema/catalog.jsonld

jonnycrunch commented 7 years ago

Mmm, at least there were 194422 entries, now there are only 194401. Now I understand the urgency!

b5 commented 7 years ago

+1 for hitting all resources for content length. I'd add grabbing filetype while we're at it. quick browsing showed some of the resources listed were .zip archives (ugh)

mejackreed commented 7 years ago

So in my initial tests of downloading these resources, many of them do not return Content-Length header unfortunately. Hoping to kick off some larger runs this afternoon to get more details.

mejackreed commented 7 years ago

@jonnycrunch 194014 entries here: https://github.com/OpenGeoMetadata/gov.data

Best to grab the archive.zip and easy to parse the layers.json file