ropensci / datapack

An R package to handle data packages
https://docs.ropensci.org/datapack
44 stars 9 forks source link

option to output ORE resource maps in JSON-LD ? #84

Open cboettig opened 7 years ago

cboettig commented 7 years ago

As outlined here https://www.openarchives.org/ore/0.9/jsonld ?

Apologies if this doesn't make sense or is out of scope, haven't really wrapped my head around DataONE Data Packaging. Is everything currently always an XML-RDF serialization here?

(The package name, README & vignette can make it a bit ambiguous exactly what standard the 'datapack' refers to, which unfortunately sounds similar to OKFN's json-schema for a "data package": https://specs.frictionlessdata.io/data-package/. It could be made more obvious that it is the "DataONE package model that is being implemented, which I gather builds on ORE and possibly PROV, but exactly how / to what extent isn't clear to me.)

gothub commented 7 years ago

@cboettig Yeah, so the engineering doc that the vignette points to maybe isn't the best source for explaining the DataONE package model that is implemented. From the vignettedatapack-overview (in github, not CRAN release yet):

It is primarily meant as
a container to bundle together files for transport to or from data repositories
that support the [DataONE Data Package model](https://releases.dataone.org/online/api-documentation-v2.0.1/design/DataPackage.html), including the member repositories in the [DataONE Federation](https://dataone.org).

The DataONE package model builds on the ORE OAI package model, and the serialization supported by DataONE is RDF/XML. Relationships from the ProvOne data modelare delivered to DataONE via this serialization(resource map). Including this relationships is consistent with the ORE model as described [here(http://www.openarchives.org/ore/1.0/datamodel#GlobalRels.)

So, all this needs to be explained clearly in the vignette for the typical user.

cboettig commented 7 years ago

@gothub that's great, thanks. I guess what's not clear still is the extent to which the package would be useful to someone interested in creating ORE-OAI or Prov serializations outside of the DataONE context. Which I guess is why I bring up this question in reference to the title issue, e.g. I think it would be good for any ORE-OAI toolkit to support that JSON-LD serialization they describe https://www.openarchives.org/ore/0.9/jsonld , but if the goal isn't so general and DataONE isn't consuming that format than obviously it would be out of scope.

gothub commented 7 years ago

@cboettig It's certainly possible to create a JSON-LD serialization from a DataPackage, but how do we determine if that is useful?

The dataone R package delivers the content from a DataPackage along with the RDF/XML resource map via routines such as uploadDataPackage. Is it useful to have the JSON-LD serializion without a delivery/packaging mechanism? Would having the JSON-LD serialization available give rise to someone developing a delivery/packaging mechanism. If the later point is true, then I think it is worthwhile to implement.

cboettig commented 7 years ago

@gothub I was thinking these descriptions might have some value in-and-of themselves, or for generic delivery mechanism (e.g. particularly with the provenance annotation of the files), but you're probably right that use case is pretty limited. So such a stand-alone serialization probably doesn't make sense. Will close, at least I understand the picture a bit better though!

mbjones commented 7 years ago

@cboettig and @gothub We never envisioned this as exclusively the DataONE packaging model, but rather as a generic data packaging model that could support multiple implementations. Our plan was to first support the ORE serialization which was widely implemented for over a decade and semantically rich, and then later add others like OKFN style data packages as described in issue #40. This package got started at OS Codefest and both @sckott and others were involved in the discussions to create a data package model that could both act as an intermediary between client tools and repositories, but also could be used within R itself as a first class data loading mechanism, given the shortcomings of R's current data handling (lack of support for metadata, multiple formats, etc). So, there's a lot more that could be done.

So, I see great value in a JSON-LD export from these packages, although currently the OKFN data-package spec is not rich enough to support the relationships that are currently expressed in the DataONE model, so it would be a bit lossy -- mainly for provenance info. But it would still be useful.

cboettig commented 7 years ago

@mbjones Thanks for clarifying, that background is very helpful, and renews my conviction that this abilities would be valuable generally.

Re a JSON-LD export, right, I wouldn't involve the OKFN spec; note that my link goes to ORE standard's own page on how to serialize ORE in JSON-LD, which I assume should make it straight-forward and lossless(?) I think this would make the metadata both easier to visualize than RDF and perhaps more appealing to other developers who might consume this data and/or combine / extend it with other formats (including PROV as you already do).

I do still think the datapack name isn't ideal, as it's not clear from the name that this focuses on an implementation of the ORE standard, the PROV standard, and the bagit standard (or how it relates to the existing OAIHavester, (which I gather is just for consuming but not serializing ORE-OAI data, just as the name implies), let alone confusion with OKFN "data package" standard (also an unfortunately vague name, particularly since I'm told it is not technically an "OKFN" product and thus should be referred to as simply the "data package" standard.)

mbjones commented 7 years ago

@cboettig Yes, well, we have been calling these collections of data, EML, and a manifest data packages since around 2000 to distinguish them from the much more ambiguous concept of a data set. Its even baked into EML 2 itself as the packageId field, which represents the globally unique identifier associated with a data package. So the terminology is embedded in our community practice, and so its a natural choice, as corroborated by the OKFN folks choosing to use the term later as well.

Right now, our major issue isn't with the use of data package per se, but rather that the qualifier 'data' is often too limiting for what goes into these. They really are packages of research products, and can include data, code, metadata, graphics, text, multimedia, and other products of the research cycle. Other terms for these packages have been used by various members of the field, starting with Carl Lagoze's seminal work on Active Digital Objects in the 1990s, and then the work in the UK on Research Objects around 2010 (see ref below), and more recent work by Victoria Stodden et al. on the concept of Research Compendia as envisioned by Gentleman (see ref below). See also the entire Open Archives Initiative (OAI) Reference Model which is built around package concepts such as an Archival Information Package that is used extensively in libraries and national repositories such as NCEI. NIST is even looking at utilizing ORE and the DataONE model of incorporating ORE in BagIt to handle their next generation packaging recommendations. There is a huge literature on this stuff, and an equally large number of overlapping names and concepts for data packages. The whole field is richly interwoven, with equal contributions from the fields of data, workflow, and provenance, each with their own overlapping naming preferences and history. I'm not sure where that leaves us on naming. We've considered research package as an alternative, but have stuck with data package for historical continuity.

Lagoze, C., Lynch, C. A., & Daniel, R. (1996). "The Warwick Framework: a container architecture for aggregating sets of metadata." Cornell University Technical Report TR96- 1593, Ithaca, N.Y.: Cornell University Library, 28 June. Retrieved January 30, 2006, from http://techreports.library.cornell.edu:8081/Dienst/UI/1.0/Display/cul.cs/TR96-1593

Bechhofer, S.; Bechhofer, S.; De Roure, D.; Gamble, M.; Goble, C.; Buchan, I. (2010). "Research Objects: Towards Exchange and Reuse of Digital Knowledge". Nature Precedings. doi:10.1038/npre.2010.4626.1

Gentleman, Robert, and Duncan Temple Lang. 2007. “Statistical Analyses and Reproducible Research.” Journal of Computational and Graphical Statistics 16 (1): 1–23. doi:10.1198/106186007X178663. http://www.tandfonline.com/doi/abs/10.1198/106186007X178663.

P.S. Our R package was originally named datapackage, but the CRAN maintainers forced us to change it to datapack. I never really understood why.