jmlondon / kotzeb0912

telemetry data from 7 adult bearded seals captured in Kotzebue, Alaska
Other
1 stars 0 forks source link

Is an R Package for data sharing a 'best practice' we want to promote ? #2

Open jmlondon opened 7 years ago

jmlondon commented 7 years ago

There was some discussion about how we might be able to archive R packages (and maintain the R package structure) on DataONE. I was a proponent of this b/c I (and others) have adopted a general practice of trying to do every research project as an R package. If I could just keep that workflow/practice and then add on the functionality of archiving with DataONE, that would be great.

But, before exploring that (and asking DataONE and others to undertake significant effort), I think it is worth discussing whether the R package is really a good 'best practice' for our reproducible science efforts.

background on R packages

R packages exist mainly for easily sharing and installing added functionality within the R programming language. There are a wide variety of approaches, examples, use cases, and implementations.

One of these is what is commonly referred to as an "R data package". Such a package rarely has R functions in it but, instead, has data products that can be easily loaded into an R session.

library(my_cooldata_library)
load(my_cooldata)

For a general discussion of data in R packages, see Data - R packages

Also, worth reading the official R documentation on data in packages.

Snippet of interest to this discussion:

Data files can have one of three types as indicated by their extension: plain R code (.R or .r), tables (.tab, .txt, or .csv, see ?data for the file formats, and note that .csv is not the standard22 CSV format), or save() images (.RData or .rda). The files should not be hidden (have names starting with a dot). Note that R code should be “self-sufficient” and not make use of extra functionality provided by the package, so that the data file can also be used without having to load the package or its namespace.

Why R Data Packages

  1. a familiar structure for R users.
  2. easy install and use of data among colleagues and collaborators
  3. extendable and flexible structure is already used to support reproducible research (e.g. rrtools)
  4. metadata associated with package (e.g. authors, version, citation, dependencies) are standardized in the DESCRIPTION file
  5. keeps the code and data in one container
  6. package vignette and roxygen allow for documentation

Why Not R Data Packages

  1. language specific (data archive should be language agnostic)
  2. general practice in R community is for /data do only contain RData files
  3. raw files would be in inst/extdata
  4. 2 + 3 means the data (which we want to highlight) is buried in the R package structure
  5. DataONE does not currently support directory structure within their platform
  6. communicating provenance and attribution might be difficult (e.g. does a DOI reference the software package or the data?)
  7. metadata would be more complicated -- need to separate out the package/software from the actual data

If not, what then?

This is open for discussion and I encourage others to chime in.

datapack directory approach

The first thought I have --- and the initial approach I might take with this package -- is to carve out a datapack directory in the root level of the repository. This directory will contain all of my metadata files, any files that make or clean data, and the data themselves. This directory is what will be archived at DataONE. The provenance for everything in this directory should point to something either within this directory or on DataONE (or an external link that DataONE will support).

This approach should be available with existing tools and existing infrastructure.

It does, however, beg the question: What's the purpose of the R Package?

My approach would be to use the R package as a way to easily distribute data to other R users. In some cases, one of the scripts in the datapack directory might create a derived .RData file from each of the data files and copy it over to the package's /data directory. Another approach would be to include an install_data or update_data function that would allow the user to install the data directly from the DataONE repository.