Open oxinabox opened 6 years ago
I think the github repos will be fairly easy to scrape.
The others may actually be really easy because they might use a lot of modern HTML practices like filling in #ids.
OAI-PMH is a standard API that is exposed by Figshare and DataDryad and probably many others.
Dataverse supports OAI-PMH. You can find a list of OAI sets by installation at https://docs.google.com/spreadsheets/d/12cxymvXCqP_kCsLKXQD32go79HBWZ1vU_tdG4kvP5S8/edit?usp=sharing
@pdurbin thanks. I've not properly gone through OAI-PMH, am I right in saying that I should be able to us it to generate a data deps registration line (given some ID, like an URL). A data deps registration line needs at minimum a list of URLs to download a local copy. And really wants to have a bunch of metadata like author, website, and papers to cite. Ideally also has a SHA checksum for each file. I think OAI-PMH exists almost specifically to make it easy to get this kind of information. But I am not sure.
You can see the current generator prototype (with reference outputs) I have for the UCI ML repo at https://github.com/oxinabox/DataDepsGenerators.jl/pull/1/files
@oxinabox I'm sorry but I'm not familiar enough with OAI-PMH to know the answer. Someone on the dataverse-community mailing list might, and you've be welcome to start a thread about this: https://groups.google.com/forum/#!forum/dataverse-community
@oxinabox if that was you over at http://irclog.iq.harvard.edu/dataverse/2018-01-12 I'm sorry I missed you. Yes, you can think of SWORD as being for uploads and OAI-PMH as being for downloading metadata (but not files, generally speaking).
indeed it was me. I'm thinking about this a bit more again. Sometimes that metadata includes file URLs and checksums (I think?). And even if it does't it includes other data I want to harvest, like author and copyright status.
Right, from DDI you can get names of files and such. For tabular files, you can even get summary stats on variables (columns), like this example from https://dataverse.harvard.edu/api/datasets/export?exporter=ddi&persistentId=doi:10.7910/DVN/TJCLKP
<var ID="v17909793" name="stars" intrvl="discrete">
<location fileid="f3040230"/>
<labl level="variable">stars</labl>
<sumStat type="medn">3.0</sumStat>
<sumStat type="vald">74.0</sumStat>
<sumStat type="max">196.0</sumStat>
<sumStat type="stdev">38.35085209417775</sumStat>
<sumStat type="min">0.0</sumStat>
<sumStat type="invd">0.0</sumStat>
<sumStat type="mean">19.081081081081102</sumStat>
<sumStat type="mode">.</sumStat>
<varFormat type="numeric"/>
<notes subject="Universal Numeric Fingerprint" level="variable" type="Dataverse:UNF">UNF:6:HLicTVd/u3Cwzb/nrk29VA==</notes>
</var>
I'm not really an expert on all this, but again if you email https://groups.google.com/forum/#!forum/dataverse-community someone with more information could weigh in.
The DataOne api is really nice: It does exactly what I want http://wiki.datadryad.org/DataONE_RESTful_API Metadata + actual links to files + checksums
Looks like it would add a fair few sites, https://www.dataone.org/current-member-nodes#uploads including DataDryad
The way to do this would be to implement an abstract dispatch type DataOne
,
then if required implement DataDryad
as a concrete case of it.
If I may i would like to point to EDMOND the open data repository of the Max-Planck society
@BeastyBlacksmith I am not actively adding new data sources at the moment. But I will review PRs
You also might want to raise an issue with the EDMOND team to follow the google/schema.org guidelines for including JSON-LD structure data fragments on the pages. This will make DataDepsGenerators.jl work with it automatically, and will make it show up in Google Dataset search
https://developers.google.com/search/docs/data-types/dataset
@oxinabox should I help answer questions for pull request #40 ? I didn't notice it until just now.
@pdurbin it is kinda stalled, since that GSOC project is over. But feel free to.