How should one get a data download URL from a DOI?

cboettig commented 6 years ago

@noamross you raise an excellent point here that despite a DOI being the canonical way to refer to / access a dataset, there really is no well-defined / machine-readable way to determine a download URL for the data object: DOIs usually redirect to human-readable pages only. I'm really curious what @mfenner thinks about this; seems to me that this would ideally be addressed by the DOI system itself; but perhaps there's a good argument against that.

To me, the ideal solution would just be some content negotiation directly against the DOI, e.g. including "Accept: text/csv" (or maybe something more generic) in your GET header would get the data resource itself.

Alternately, it would be nice if the metadata returned by DataCite's existing Content Negotiation system included the data download URL, e.g. following the http://schema.org/Dataset which DataCite already uses, it could add (as Google recommends):

 "distribution":[
     {
        "@type":"DataDownload",
        "encodingFormat":"CSV",
        "contentUrl":"http://www.ncdc.noaa.gov/stormevents/ftp.jsp"
     },

from which we could infer the download url(s). A third option would be for this kind of structured information to be provided in a machine-readable way but from the data repository (e.g. after resolving the DOI to its HTML redirect) rather than at DataCite's end.

Absent any of these server-side solutions, I agree that there would be immediate value in an R package to provide a way for R users at least to script access to data from the DOI without having to research each repository's API or available packages first.

noamross commented 6 years ago

I'm not sure how this would solve the multiple file problem. There are lots of repos with multiple files under one DOI. One answer would be to download a zipped directory for either all cases (simplest), or for the just the case of multiple files. But it seems useful to be able to target one file as well.

sckott commented 6 years ago

100% agree that content-negotation wold be a great way to sort this out. doesn't seem like that's possible right now.

short of that, i can envision client side mappings between data providers and how to get URLs for actual data files - those mappings could be in JSON e.g., so they are lang. agnostic

cboettig commented 6 years ago

Re targeting one file: I agree this would be nice; it's basically up to the repository.

For instance, Dryad gives a DOI to the whole package and individual DOIs to each of the component parts, so the user could use the package DOI if they want the whole package, or the DOI for a particular csv if they want that csv. (Currently they still have to resolve download urls though, as those DOIs all resolve only to HTML landing pages).

In this case, DataCite actually does list each of the parts as related works in the datacite-xml version from CN, e.g. https://data.datacite.org/application/vnd.datacite.datacite+xml/10.5061/dryad.2k462

However, that's not always the case of course. e.g. KNB gives unique IDs to all the parts (e.g. to each csv file), but I believe only the package as a whole gets an actual registered DOI, and thus DataCite record has no information about these additional parts (e.g. compare, on DataCite https://search.datacite.org/works/10.5063/f1bz63z8 vs on knb: https://knb.ecoinformatics.org/#view/doi:10.5063/F1BZ63Z8).

In both cases, there's a metadata file we can parse for identifiers to the components, but it's different structures and actually in both cases it's not obvious how to translate those into downloads. (Actually all Dryad entries are in DataONE anyway, so we can download any of these using the dataUrl provided by a DataONE solr query of the DOI / other identifier).

Basically, I'd like to see this same ability DataONE has to return dataUrl given an identifier implemented at the DataCite level, so that it worked on any DataCite DOI and not just those in the DataONE network.

Of course with zenodo / github you're just stuck downloading the whole .zip anyway, which simplifies things but makes it impossible to request just one file.

noamross commented 6 years ago

For a Zenodo-Github repo one could in theory get the individual file from GitHub. Of course, that's no guarantee the file will still be there, but the commit hash should at least ensure that if it is, it's the right one. There could be a failsafe otherwise.

mfenner commented 6 years ago

Very interesting discussion and good timing. DataCite will work on this very topic in the coming months thanks to a new grant, and my initial thoughts and feedback from @maxogden and @cameronneylon are at https://github.com/datacite/freya/issues/2. My ideas are around content negotiation using the application/zip content type and using bagit for a basic description and checksums.

mfenner commented 6 years ago

I would treat DOIs for software hosted in code repositories differently, as they have standards way to get to the content, and we should support that in DOI metadata, e.g. adding the commit hash as related_identifier.

cboettig commented 6 years ago

@mfenner Hooray, thanks! Content negotiation w/ application/zip type + bagit sounds great to me.

Being able to identify that an object is SoftwareSourceCode and get a related_identifier from which it could be installed is nice; though of course it would also be good to always be able to just get the bagit zip file of the sourcecode directly from the data repository (e.g. for archived software that ceases to be available from those more standard channels).

sckott commented 6 years ago

(If we need to do Bagit creation client side we already have https://github.com/ropensci/datapack)

mfenner commented 6 years ago

@sckott datapack looks great! As @cameronneylon pointed out in the issue I referenced, there is both a data consumer and data producer side to this.

I would do two small modifications to the bagit standard: include the DataCite metadata as XML file (to avoid extra effort), and zip the bagit as application/zip (I think this is not part of the bagit spec). And I like a low tech approach that doesn't create hurdles, so no schema.org or other json-ld. I would use content negotiation with application/zip for backwards compatibility, but would like to explore other ways (e.g. providing a contentUrl in the metadata or using a content-location header).

We will have a project kickoff meeting next Wednesday and I can report on any progress. You can also follow along via https://github.com/datacite/freya/issues/2 and related issues.

ropensci-archive / doidata

How should one get a data download URL from a DOI? #1