Open flyingzumwalt opened 7 years ago
Does this really need to be >300TB. After looking at the data, there is a lot of data redundancy. Same data is in csv, html and json. does only one organization have to load the entire 300 TB? Most of the data can be broken up to 'health', 'environment', "agriculture' and is composed on heterogeneous files ( typically a few hundred MB per file.) The meta data describing the data would be most important ( Publisher, Identifier, modified date, etc).
We have the ckan metadata already. And yes I agree some of the data is redundant, based on how ArcGIS OpenData allows for different types of exports. A smarter heuristic of this would be nice, but may take some more analysis time.
@mejackreed do you think you will need help writing the download scripts or running them? We can probably find people to help you.
Sure thing. Help definitely wanted! I have a naive downloader here: https://github.com/mejackreed/GovScooper/blob/master/README.md#usage already.
cc @jbenet @gsf @b5
Happy to help!
I think it makes sense to first decide weather or not to download in passes, using metadata to cut down on data redundancy (as per @jonnycrunch's suggestion), or to just beef the whole thing. I'd personally vote for the "passes" approach, but first checking to ensure that the data is truly redundant.
Yep i have an idea on how to evaluate whether or not the data is redundant or not. Resources that come from a server that has /arcgis.com/
and have .geojson
+ .csv
+ .kml
are usually just transformations of the same data. A way to understand these types of datasets / resources and codify the heuristics is needed.
There are 194422 distinct entries in catalog. Meta data is about 2GB.
https://catalog.data.gov/api/3/action/package_search?rows=1&start=0
Here is an example of one entry:
https://catalog.data.gov/api/3/action/package_show?id=1e68f387-5f1c-46c0-a0d1-46044ffef5bf
each entry has a resource list:
First pass could be to hit all of the URLs in each resource grab 'Content-Length' headers to calculate the exact amount of space needed simultaneously gathering all of the necessary resource urls.
There are also some meta schema resources referenced in the 'extras' section that would be important to grab: https://project-open-data.cio.gov/v1.1/schema/catalog.jsonld
Mmm, at least there were 194422 entries, now there are only 194401. Now I understand the urgency!
+1 for hitting all resources for content length. I'd add grabbing filetype while we're at it. quick browsing showed some of the resources listed were .zip archives (ugh)
So in my initial tests of downloading these resources, many of them do not return Content-Length header unfortunately. Hoping to kick off some larger runs this afternoon to get more details.
@jonnycrunch 194014 entries here: https://github.com/OpenGeoMetadata/gov.data
Best to grab the archive.zip and easy to parse the layers.json file
For more info about this task, what we will do with the data, and how it relates to other archival efforts, see Issue #87
Story
Jack Downloads all of the datasets from data.gov (~350TB) to storage devices on Stanford's network.
What will be Downloaded
The data.gov website is a portal that allows you to find all the "open data" datasets published by US federal agencies. It currently lists over 190,000 datasets
The goal is to download those datasets, back them up, and use IPFS to replicate the data across a network of participating/collaborating nodes.
@mejackreed has posted all of the metadata from data.gov, which cointains pointers to the datasets and basic metadata about them. The metadata are in ckan.json files. You can view the metadata at https://github.com/OpenGeoMetadata/gov.data That will be the main starting point for running all of the scripts that download the datasets.