Implementing the R Data Package Library

ropensci-archive / datapkg

:no_entry: ARCHIVED :no_entry: Read and Write Data Packages

https://docs.ropensci.org/datapkg

Other

40 stars 6 forks source link

Implementing the R Data Package Library #12

Open danfowler opened 8 years ago

danfowler commented 8 years ago

Just a general note to flag that the specifications on http://specs.frictionlessdata.io/ have now been cleaned up to reflect only those relevant to Frictionless Data.

Data Package

[ ] Get a Data Package
- [ ] Support Data Package Identifier spec: https://github.com/ropenscilabs/datapkg/issues/5
- [ ] Support for zip files that contain data packages: https://github.com/ropenscilabs/datapkg/issues/10 (This is not currently formally specified, but supported by the Python library and generally useful)
[ ] Add metadata to a Data Package programmatically
- [ ] Edit existing Data Package: https://github.com/ropenscilabs/datapkg/issues/7
[ ] Get a spec (JSON Schema) for a Data Package profile from the Data Package registry
- [ ] Load tabular profile from registry: http://schemas.frictionlessdata.io/registry.csv
- [ ] Validate a data package descriptor against a data package spec: https://github.com/ropenscilabs/datapkg/issues/9
[x] Create a Data Package
[ ] Generic interface for adding and reading resources
- [ ] For tabular data resources, this should call JTS.Resource, but this is not yet implemented in Python or Javascript
  JSON Table Schema
[ ] Validate that a passed schema is actually a JSON Table Schema e.g. https://github.com/frictionlessdata/jsontableschema-py#validate
[ ] model the JSON Table Schema and provide helper methods for working with it e.g. https://github.com/frictionlessdata/jsontableschema-py
[ ] type/format/constraints casting/testing according to the spec
- [ ] Support newly added missingValue property: https://github.com/ropenscilabs/datapkg/issues/11
[ ] An iterable resource abstraction that takes a schema and data, and casts while iterating

@jobarratt @karthik @jeroenooms

ezwelty commented 7 years ago

@danfowler @jeroen @christophergandrud

I needed a tool that could build a data package (both data and metadata) interactively or programmatically within R, and read-write the package unchanged. Finding the current packages lacking in that regard (https://github.com/ropenscilabs/datapkg, https://github.com/christophergandrud/dpmr), I decided to roll my own...

This maybe wasn't the most efficient approach, but I think I have something that's now quite pleasant to use: https://github.com/ezwelty/dpkg

I'm open to feedback or suggestions for moving forward while minimizing duplication of effort. I'm also curious as to the state of Data Packages as a standard. I recently noticed that the published schemas don't match the specs, and that some of the "core" datasets have errors or haven't updated to the latest version of the specs.

cboettig commented 7 years ago

@ezwelty Very cool, thanks for sharing. Looks like a nice implementation, though I'll leave the datapkg maintainers to comment on details. Your comment on the "state of Data Packages" caught my attention.

Seems like the mismatch between the schemas and core datasets could be minimized by validation, though perhaps they are simply following older versions of the schema. Though as far as I can tell, the datapackage.json standard doesn't actually include any line for stating what version of the datapackage schema it's using, or indeed anything at all to let you know the file is following a schema, which seems like a crazy oversight to me.

If you haven't already, you might want to look at some of the R packages that implement JSON schema validation, (e.g. see our https://github.com/ropensci/jsonvalidate), which would be an easy utility to add to your implementation.

I'd be curious to hear more about your use cases for creating these files. Depending on your needs, you might want to look at some other standards as well; e.g. the "Ecological Metadata Standard" (EML, an XML based format) is still going strong some 20 years after it's inception, and provides similar ability to annotate tables, units, and other metadata. (we have the R package implementation at https://github.com/ropensci/EML) Certainly on the heavyweight side, but used by quite a lot scientific repositories in the ecological / environmental data space. On somewhat the opposite end of the spectrum is Google's recommendations for Dataset metadata: https://developers.google.com/search/docs/data-types/datasets (which is basically the fields from http://schema.org/Dataset), which gives you a lightweight but very flexible approach using JSON-LD.

rufuspollock commented 7 years ago

First @ezwelty this is great news and thank-you for contributing this work. I'll leave @cboettig to comment more on how this could link with ropensci/datapkg

I'm open to feedback or suggestions for moving forward while minimizing duplication of effort. I'm also curious as to the state of Data Packages as a standard. I recently noticed that the published schemas don't match the specs, and that some of the "core" datasets have errors or haven't updated to the latest version of the specs.

Re the schema: the schemas should be up to date with the live specs - can you let us know what is out of sync so we can correct 😄

Re core datasets: We've done a major push in the last few months to finalize a v1.0 of the specs. This has involved some change to the specs which though relatively minor are breaking. Most of the core datasets (which I'm responsible for maintaining) are still at pre-v1 as we have been intentionally holding off updating until v1.0 is finalized. I'm working right now to get them updated.

If you have more questions we have a chat channel here http://gitter.im/frictionlessdata/chat

Though as far as I can tell, the datapackage.json standard doesn't actually include any line for stating what version of the datapackage schema it's using, or indeed anything at all to let you know the file is following a schema, which seems like a crazy oversight to me.

@cboettig that is planned for v1.1 since at that point it becomes useful (to indicate v1.1 vs v1.0) prior to that point we did not think it will prove that useful as hand publishers usually don't add it and tools did not support it.

@cboettig we also have extensive validation libs in all the main libraries (it is fairly trivial given the json schema). The reason for the core datasets being out of sync is as above 😄

Thanks for the feedback and we'd love to have more comments and suggestions - our channel is here http://gitter.im/frictionlessdata/chat or just comment in this thread.

ezwelty commented 7 years ago

Thanks @cboettig and @rufuspollock for the thoughtful and informative replies!

Schemas vs. specs As the tip of the iceberg, here are the differences I see between the data-package schema and the data-package specs. Am I looking at the wrong schema files?

Specs	Schema
`name` recommended	`name` required
`id`	---
`profile`	---
`created`	---
`licenses`	`license`
---	`dataDependencies`
---	`author`

Schema versioning My suggestion would be to simply have profile (or something like it) be the path to the schema in use. It seems fragile and unnecessarily complicated to have one field identifying the type of data package and another identifying the version and from that having to construct a path to a schema against which to validate.

Validation Once the dust settles, it will be trivial to use something like jsonvalidate to validate data package metadata. I reserve the right to not support pre-1.0 specs ;), but can try to keep up with future versions if my contribution gets some use.

Compression I sometimes deal with very large tabular datasets (meteorological data, for instance) and find myself sometimes chopping up data into multiple CSV files to fit onto git repositories. I'm tempted to add support for reading and writing resources with paths like "data/data.csv.gz" to/from compressed files since that often reduces file size by 10x. I've found threads on the subject – and mostly for zipping an entire data package, not individual resources – but no official specs. What's the latest on this?

Use cases @cboettig That's a great question. I work on the Columbia Glacier in Alaska, a locus of glaciological research since the 1970s, and accordingly, a hot spot for data loss. Having put a lot of work into finding and restoring lost datasets, and depressed by the data loss I've witnessed, I decided to adopt a simple standard to document data for future generations of scientists. For now, I wanted a way to write datasets and documentation, and read the data back in for my own analysis. All this will likely culminate in a comprehensive community data archive for the Columbia Glacier, but since that requires Terabytes of storage, I'll need to secure agency support and use something other than pure GitHub (GitHub Large File Storage perhaps?).

Data Packages are not the only or "right" choice, but I wanted something simple, human-readable, and data agnostic. In my experience, all that really matters is that some context is provided and that the variables (formats, units, datums, timezones) are unambiguously defined for a human interpreter – all of which can be accomplished in the datapackage.json and README.md without being burdened by elaborate syntax or fancy tools. That the metadata follows a standard is a bonus, but not critical to the longevity of the data.

cboettig commented 7 years ago

@rufuspollock Thanks, very cool. Yup, I think a simple line like "schema": "http://schemas.datapackages.org/v1.1/data-package.json" or so would go a long way, (e.g. analogous to how XML declares schema, or json-ld declares context).

@ezwelty Thanks for the context, very interesting! Re compression, you've considered HDF5 for this? Includes metadata specification and compression. There's always a tradeoff between compression and plain-text archivability/readability. Also note that sharding and compressing CSVs is explicitly supported in EML annotations, see https://knb.ecoinformatics.org/#external//emlparser/docs/eml-2.1.1/./eml-physical.html.

Re data archiving, I largely agree with your points, but respectfully disagree on the standard. To make this concrete, consider the example of Arctic and glacier data from https://arcticdata.io/catalog/#data. That very nice (IMHO) search tool works by indexing EML metadata files for all data in the repo. That Arctic Data repository is designed to archive heterogenous data without being super normative about the formats, just requires users to submit metadata. However, if it all users archived data in their own intuitive, simple, but non-standard form, building this kind of search tool that enables discovery by letting you filter on author, geographic range etc would be far harder.

Somewhat separate issue from the metadata: I'd also respectfully note that depositing data in a formal scientific repository gives you the benefit of a DOI and a much more transparent commitment to long-term scientific availability of the data (which is backed up and mirrored by a large network of scientific repositories). I love GitHub as much as anyone but a single commercial entity will never be an ideal long-term archive for future generations of scientists.

okay, end soapbox, sorry. Good questions and I'll be curious to see what approach you take in any event.

cboettig commented 7 years ago

@ezwelty hehe, searching by author I see you already have 9 datasets in Arctic Data, e.g. https://arcticdata.io/catalog/#view/doi:10.5065/D6HT2MFW. Very cool. But I'm confused, why then replicate this in data package + github? Surely discoverability and achivability is easier from articdata.io?

ezwelty commented 7 years ago

@danfowler Apologies for completely derailing your issues post. @cboettig Soapboxes welcome, I have plenty to learn about the scientific data landscape. Thanks for this informative conversation.

I've avoided HDF5 because it is opaque – one relies on the HDF5 library just to know what is inside. I could pair HDF5 files with external metadata, but would rather use "layman" data formats (e.g. CSV, JSON, TIFF), compressed as necessary, in a sensible file structure.

Sorry, I wasn't considering discoverability, just the interpretation that follows discovery. Of course, a standard metadata schema is necessary, at least for searching on dimensions (time, position) which can't be done by a pure text search. But no amount of package metadata makes up for a missing data schema (whether freeform or following a standard). To pick on a colleague, consider "MassBalance.dat" in: https://arcticdata.io/catalog/#view/doi:10.18739/A2BK9J. Even if one knows that "b" means mass balance, since no unit is provided, the data is unusable as-is. I appreciate how the (Tabular) Data Package focuses its attention on the structure of the data while being more flexible about the rest. (That said, I'm sure temporal and spatial ranges would be welcome additions to Data Package and Data Resource).

The Columbia Glacier is a landmark study site in glaciology, and there is support for it becoming one of the first benchmarks for glacier modeling, but I'm the first person interested in actually coordinating the effort. The data definitely needs to end up on a formal scientific repository and benefit from a DOI. It's also important that it be able to evolve over time with future revisions and contributions from many different authors. Can this be accomplished with an existing repository? (I notice that Zenodo now supports version DOIs). One of my ideas was to use GitHub as a community front-end, with pointers to the data hosted on a formal repository, but I'm really unsure how best to proceed.

To complicate matters:

The data is the product of many different projects, publications, and funding agencies over the last 40 years. This makes it a hard sell for The National Snow and Ice Data Center (NSF, NASA, NOAA) down the street for me, the Arctic Data Center (NSF), and the USGS Science Data Catalogue (USGS). The datasets I have on arcticdata.io are a small subset of NSF-supported data which were deposited into the EOL Archive (http://data.eol.ucar.edu/) by a very clunky process.
The data is very large (~2 TB counting the time-lapse, aerial, and satellite data), which rules out some repositories (https://figshare.com/) and makes others very expensive in the absence of dedicated funding (https://www.pangaea.de/).

I've got my eye on the agency-agnostic repositories like https://figshare.com/, https://zenodo.org, http://datadryad.org/, https://www.pangaea.de/, and https://datahub.io/, but hoping that I can find salvation in one of the cryosphere or arctic-focused repositories. I'm grateful for any and all suggestions!

Why Data Package + GitHub? For staging, basically. It's unclear where the data will ultimately be hosted, with what metadata standard, with what formatting, with what license, ... so I picked a publishing platform that makes updates very easy and a simple metadata standard which I can port to a beefier standard when the time comes.

cboettig commented 7 years ago

@ezwelty Thanks for this excellent response, I think it's both informative and well reasoned.

But no amount of package metadata makes up for a missing data schema (whether freeform or following a standard). To pick on a colleague, consider "MassBalance.dat" in: https://arcticdata.io/catalog/#view/doi:10.18739/A2BK9J. Even if one knows that "b" means mass balance, since no unit is provided, the data is unusable as-is. I appreciate how the (Tabular) Data Package focuses its attention on the structure of the data while being more flexible about the rest.

Excellent example, though it's worth noting this is only because the the EML standard doesn't require the schema, not because it can't express it. Defining what column labels mean an what the units are is one of the more common uses of EML, and and an EML document can be very minimal elsewhere while still expressing the schema very precisely, including units following a standard vocabulary for unit names, defining categorical abbreviations, precision, error and missing codes and way more than you'd ever actually use. (I was actually not able to spot how you define units for a column in data package schema, but it must be there, right?) So avoiding examples like the one you point to above are probably more about users and incentives than about the choice of technical standard. (though I'm convinced clunky tools and clunky submission procedures, often with middlemen between the researcher and the repo make this worse, which underscores my own interest in tools like the one you wrote that can serialize metadata programmatically). Looks like they have focused on providing metadata which aids discoverability (temporal/geographic coverage etc) but overlooked that schema metadata which like you say is all-important for reuse

Re your issues of heterogeneous support and large file sizes, these are tough issues. I don't know the policy of that Arctic Data repo on this front; hope you don't mind I take the liberty of cc'ing @mbjones who can probably clarify. They may be able to work with you to address these issues (e.g. while it looks like certain NSF-funded projects are required to submit data there, it's not clear that this excludes contributions from those who aren't funded?)

I do think some of the general-purpose repositories like figshare/zenodo have such lightweight metadata for the diversity of objects they archive that discoverability there can be a real problem down the road. (Also worth noting that while Zenodo is built on CERN's database and probably not going anywhere soon, figshare is another commercial model that unfortunately no longer has the benefit of CLOCKSS backing). Like GitHub I think those archives are awesome, but not an ideal replacement for more discipline specific repositories that have richer metadata models and focused areas for discovery. Of course who knows, time might prove otherwise. As you probably know, Dryad focuses only on data tied to a specific publication, and presumably is based on a discovery model where you find the paper first and then go look up the particular dataset; again the metadata they collect is too vague to search for "data on glaciers in this time / space range". They go for minimal barrier of entry, which makes some sense in that context, but means the much of the tabular data are arbitrary Excel files with no definitions of columns, units or the rest. (Again this isn't a knock on the repository per se, they all have reasons for how they are; but these do create real challenges for doing research that tries to synthesize across data in the repo). I'm not as familiar with Pangea.

amoeba commented 7 years ago

Hey @ezwelty and @cboettig. I snooped into this conversation and wanted to comment.

The linked arcticdata.io dataset is actually an ISO19139 metadata from EOL converted 1:1 to EML and not an EML document crafted using the full ability of EML. So @cboettig 's comments are right on: This record can be way better! The long story made short of this is that the record you link was generated years ago and newer datasets coming into arcticdata.io are much much better in terms of metadata. See our latest published dataset: https://arcticdata.io/catalog/#view/doi:10.18739/A22Z26. Note that the attributes (variables) are documented (hopefully satisfactorily).

This conversation feels a bit off-topic for this Issue though. I think your data needs can be adequately met by arcticdata.io. Would you be down to talk outside this Issue about this? Feel free to email me (see my profile) or pass me your email.

rufuspollock commented 7 years ago

@ezwelty your feedback is awesome

Schemas vs. specs As the tip of the iceberg, here are the differences I see between the data-package schema and the data-package specs. Am I looking at the wrong schema files

Noted here https://github.com/frictionlessdata/specs/issues/490 and we'll triple check these items asap. /cc @pwalsh

Schema versioning My suggestion would be to simply have profile (or something like it) be the path to the schema in use. It seems fragile and unnecessarily complicated to have one field identifying the type of data package and another identifying the version and from that having to construct a path to a schema against which to validate.

Agreed. Proposal here: https://github.com/frictionlessdata/specs/issues/491

Validation Once the dust settles, it will be trivial to use something like jsonvalidate to validate data package metadata. I reserve the right to not support pre-1.0 specs ;), but can try to keep up with future versions if my contribution gets some use.

Absolutely. Don't worry about pre-v1.0 at this point.

Compression I sometimes deal with very large tabular datasets (meteorological data, for instance) and find myself sometimes chopping up data into multiple CSV files to fit onto git repositories. I'm tempted to add support for reading and writing resources with paths like "data/data.csv.gz" to/from compressed files since that often reduces file size by 10x. I've found threads on the subject – and mostly for zipping an entire data package, not individual resources – but no official specs. What's the latest on this?

We'll have a pattern on this that will be on track for incorporation. Key discussion and proposals are:

https://github.com/frictionlessdata/specs/issues/290 - resource compression
https://github.com/frictionlessdata/specs/issues/132 - data package bundling (and compression)

Use cases @cboettig ...

Data Packages are not the only or "right" choice, but I wanted something simple, human-readable, and data agnostic. In my experience, all that really matters is that some context is provided and that the variables (formats, units, datums, timezones) are unambiguously defined for a human interpreter – all of which can be accomplished in the datapackage.json and README.md without being burdened by elaborate syntax or fancy tools. That the metadata follows a standard is a bonus, but not critical to the longevity of the data.

I've spent the last 15 years getting to Data Packages - I've done dozens of data projects and built multiple data platforms including CKAN and others. The design of Data Packages aim to get to something that is the simplest possible - but no more. The ideas in Data Packages are all inspired by other existing and well tried approaches from CSV to the package.json of node.

The essence is the spirit of zen: simplicity and power.

rufuspollock commented 7 years ago

@amoeba would arcticdata.io looks an amazing project 👍 - would you guys be interested in finding out more about Data Packages? We've had several recent collaborations with scientific groups who find the spec + tooling really easy to use and integrate. The spec is here http://specs.frictionlessdata.io/ and I other folks like @danfowler would be happy to talk more -- probably easier outside of a github thread!

@cboettig would welcome chance to talk more about EML, data packages and other specs (outside of this thread which will o/w be completely hijacked!). Would be good to hear from you more about your thoughts. I'll ping you by email!

ezwelty commented 7 years ago

Right, email it is. I've added my email to my profile (Open Email®). @amoeba I'll email you about https://arcticdata.io. @cboettig I've learned tons about the landscape of scientific data, thank you. Happy to embrace EML or whatever the future home of my data demands of me :)

amoeba commented 7 years ago

@rufuspollock Yeah I think we should talk. I couldn't make the Contact widget on your personal website work. Could you email me at the email in my GH profile?

mbjones commented 7 years ago

@rufuspollock thanks! arcticdata.io is one of many repositories in the DataONE federation using ORE as our data packaging mechanism. We've been aware of your data packaging approach for several years, and you and I have talked in other github issues about the pros/cons of ORE vs JSON. We use numerous features within the ORE model to allow an expansive and semantically rich characterization of data packages. For example, in addition to the basic aggregation function of the package, we also use the PROV ontology for providing semantically-precise provenance relationships within and among complex packages (see example here: https://search.dataone.org/#view/urn:uuid:3249ada0-afe3-4dd6-875e-0f7928a4c171 , and its associated ORE model here: https://cn.dataone.org/cn/v2/meta/urn:uuid:1d23e155-3ef5-47c6-9612-027c80855e8d ).

I think supporting data-package in JSON format would be a good thing, but at this point conversion from ORE to OKFN format would be a bit lossy, which may be ok. JSON-LD would be even better so that we don't detach the semantic relationships. The huge advantage of JSON-LD over RDF is its readability; it would be a huge gain, and with JSON-LD you seem to get that without the loss of semantics associated with plain JSON. Certainly worth further discussion on how to move these specs closer together.

mbjones commented 7 years ago

Oh, and the R implementation of the ORE data packaging spec is in ROpenSci as well: https://github.com/ropensci/datapack. Very similar in spirit to the tools you are building here. I was just commenting on issue ropensci/datapack#84 in which @cboettig was requesting a JSON-LD serialization of datapack ORE packages.

rufuspollock commented 7 years ago

@mbjones great to connect again 😄

We've been aware of your data packaging approach for several years, and you and I have talked in other github issues about the pros/cons of ORE vs JSON.

We should distinguish apples and oranges here. XML/JSON are low-level serialization formats. ORE or Data Packages are another level up and define a schema of some kind for storing metadata about datasets (e.g. you have a title field, a name field, a provenance field). The next level above that is specific structures relevant to particular types of data e.g. you must include the location property, you must use this taxonomy and have these additional properties.

As you rightly point out you could have an ORE serialization to JSON (or JSON-LD). And you could serialize data packages to XML (if you really wanted to!).

The comparison of ORE to Data Packages would be around what they can and can't express, their ease of use and simplicity, their tooling etc. I note that Data Packages are completely extensible (in the way I imagine ORE is) in that you can add new properties, or extend.

Thus Tabular Data Package extends basic Data Package to provide a schema for tabular data, and Fiscal Data Package extends Tabular Data Package to cover financial information.

You can also, if you want, add RDF style info to Data Packages (via json-ld if you wanted). It is true though that Data Packages have largely eschewed the RDF route in favour of simplicity (my experience with many data wranglers over the years was that RDF was quite tough for most people to grok).

Be happy to have a chat - I've pinged @mbjones @cboettig and maybe we could organize a joint call -- I know I'd like to learn more about people's experiences and learnings here. /cc @danfowler @pwalsh

ropensci-archive / datapkg

Implementing the R Data Package Library #12

Data Package

JSON Table Schema