Add common data sources

oxinabox commented 6 years ago

[x] https://github.com/fivethirtyeight/data
[x] https://github.com/BuzzFeedNews/
[x] http://datadryad.org/
[ ] https://dataverse.harvard.edu/
[ ] https://datarepository.wolframcloud.com/
[ ] https://ieee-dataport.org/
[ ] https://www.kaggle.com/datasets
[x] data.gov

oxinabox commented 6 years ago

I think the github repos will be fairly easy to scrape.

The others may actually be really easy because they might use a lot of modern HTML practices like filling in #ids.

oxinabox commented 6 years ago

OAI-PMH is a standard API that is exposed by Figshare and DataDryad and probably many others.

pdurbin commented 6 years ago

Dataverse supports OAI-PMH. You can find a list of OAI sets by installation at https://docs.google.com/spreadsheets/d/12cxymvXCqP_kCsLKXQD32go79HBWZ1vU_tdG4kvP5S8/edit?usp=sharing

oxinabox commented 6 years ago

@pdurbin thanks. I've not properly gone through OAI-PMH, am I right in saying that I should be able to us it to generate a data deps registration line (given some ID, like an URL). A data deps registration line needs at minimum a list of URLs to download a local copy. And really wants to have a bunch of metadata like author, website, and papers to cite. Ideally also has a SHA checksum for each file. I think OAI-PMH exists almost specifically to make it easy to get this kind of information. But I am not sure.

You can see the current generator prototype (with reference outputs) I have for the UCI ML repo at https://github.com/oxinabox/DataDepsGenerators.jl/pull/1/files

pdurbin commented 6 years ago

@oxinabox I'm sorry but I'm not familiar enough with OAI-PMH to know the answer. Someone on the dataverse-community mailing list might, and you've be welcome to start a thread about this: https://groups.google.com/forum/#!forum/dataverse-community

pdurbin commented 6 years ago

@oxinabox if that was you over at http://irclog.iq.harvard.edu/dataverse/2018-01-12 I'm sorry I missed you. Yes, you can think of SWORD as being for uploads and OAI-PMH as being for downloading metadata (but not files, generally speaking).

oxinabox commented 6 years ago

indeed it was me. I'm thinking about this a bit more again. Sometimes that metadata includes file URLs and checksums (I think?). And even if it does't it includes other data I want to harvest, like author and copyright status.

pdurbin commented 6 years ago

Right, from DDI you can get names of files and such. For tabular files, you can even get summary stats on variables (columns), like this example from https://dataverse.harvard.edu/api/datasets/export?exporter=ddi&persistentId=doi:10.7910/DVN/TJCLKP

<var ID="v17909793" name="stars" intrvl="discrete">
  <location fileid="f3040230"/>
  <labl level="variable">stars</labl>
  <sumStat type="medn">3.0</sumStat>
  <sumStat type="vald">74.0</sumStat>
  <sumStat type="max">196.0</sumStat>
  <sumStat type="stdev">38.35085209417775</sumStat>
  <sumStat type="min">0.0</sumStat>
  <sumStat type="invd">0.0</sumStat>
  <sumStat type="mean">19.081081081081102</sumStat>
  <sumStat type="mode">.</sumStat>
  <varFormat type="numeric"/>
  <notes subject="Universal Numeric Fingerprint" level="variable" type="Dataverse:UNF">UNF:6:HLicTVd/u3Cwzb/nrk29VA==</notes>
</var>

I'm not really an expert on all this, but again if you email https://groups.google.com/forum/#!forum/dataverse-community someone with more information could weigh in.

oxinabox commented 6 years ago

The DataOne api is really nice: It does exactly what I want http://wiki.datadryad.org/DataONE_RESTful_API Metadata + actual links to files + checksums

Looks like it would add a fair few sites, https://www.dataone.org/current-member-nodes#uploads including DataDryad

The way to do this would be to implement an abstract dispatch type DataOne, then if required implement DataDryad as a concrete case of it.

BeastyBlacksmith commented 5 years ago

If I may i would like to point to EDMOND the open data repository of the Max-Planck society

oxinabox commented 5 years ago

@BeastyBlacksmith I am not actively adding new data sources at the moment. But I will review PRs

You also might want to raise an issue with the EDMOND team to follow the google/schema.org guidelines for including JSON-LD structure data fragments on the pages. This will make DataDepsGenerators.jl work with it automatically, and will make it show up in Google Dataset search

https://developers.google.com/search/docs/data-types/dataset

pdurbin commented 5 years ago

@oxinabox should I help answer questions for pull request #40 ? I didn't notice it until just now.

oxinabox commented 5 years ago

@pdurbin it is kinda stalled, since that GSOC project is over. But feel free to.

oxinabox / DataDepsGenerators.jl

Add common data sources #2