ropensci / auunconf

repository for the Australian rOpenSci unconference 2016!
18 stars 4 forks source link

R package to store/access metadata associated with data/functions #18

Open jonocarroll opened 8 years ago

jonocarroll commented 8 years ago

First off, I see that there is already ropensci/EML and the associated idea, but I'm not a fan of S4, and I'm thinking bigger.

I've brought this up in discussions elsewhere in the past and I know that hadley hasn't made attributes a priority in his workflows (e.g. in relation to assertr() https://twitter.com/hadleywickham/status/559183346144522241) -- in fact, it was only recently that attributes were preserved in dplyr pipelines. They're certainly not preserved in plyr functions.

I'd love to be able to attach a python-esque docstring to data and functions that can be printed without invoking the full help menu (?library), which might contain the last time the object was updated (either automated or manually stated), source, attribution, etc... It's certainly possible to use comment() on a data.frame but I'm thinking perhaps these can be stored similarly to .Rmd files (with full markdown capability?) in a cache and searched/loaded independently to ensure they survive processing. This could include a checksum on the object to enforce reproducibility and perhaps even a trigger system if an object is declared immutable but is altered (override <- ... does one dare?). Needless to say, these would have to be transparent to existing structures, so that would need some careful consideration and balance.

Just thoughts at this stage.

jonocarroll commented 8 years ago

Roxygen would be the natural method of doing this, which should make it transparent to anything existing (#' docstring)

ivanhanigan commented 8 years ago

Great idea. My proposal issue title was perhaps too specific to ropensci/EML. I think the generic issue you describe is better, because it starts by not assuming the extant technology / solutions are a fait accompli. Especially if there are issues with depending on S4. The bonus of EML in my eyes is the convenience of leveraging international standards and schemas, and the tools that exist to work within that standard (also see https://github.com/DataONEorg/rdataone to interface with the EML-based metacat data repositories).

I am not sure from what you wrote if your idea builds on existing internatlonal standards or intends to develop new standards (ie is this python-esque docstring considered a 'standard'? If not will this development generate a schema for attributes on data/functions that will then become an international agreed standard or another R flavoured dialect?).

My suggestion was based on a pressing unmet need I face when ingesting, synthesising and disseminating data and code, especially while working at the coal-face of data analysis (ie generating metadata while working with data.code rather than creating metadata prior/post data analysis). The act of doing metadata at the same time as doing data munging is appealing to me, especially if it is automated to the hilt.

In terms of choosing the standard, EML seems to be the most generically applicable standard I have used across environmental, social, health and geographic data types (others I tested were ANSLIC, DDI and RIF/CS).

I also like the idea that this topic may have cross-cutting potential with https://github.com/ropensci/auunconf/issues/9 as the automagic ingestion of data/metadata to R will facilitate validation analyses, and may also cut across https://github.com/ropensci/auunconf/issues/8 where such metadata may make reproducible workflows easier, quicker and more re-usable/re-configurable, also I imagine WRT https://github.com/ropensci/auunconf/issues/13 has to also deal with communicating uncertainty related to underlying construction/collection of data such as measurement error, modelling error or related such complicated algorithmic processing of data prior to generating the uncertain results that they wish to communicate.

Good stuff mate, let me know if I am off track with where you saw this thread going?

On Wed, Mar 30, 2016 at 1:35 PM, Jonathan Carroll notifications@github.com wrote:

Roxygen would be the natural method of doing this, which should make it transparent to anything existing (#' docstring)

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/ropensci/auunconf/issues/18#issuecomment-203204495

jonocarroll commented 8 years ago

Python docstrings are a standard in that they are strongly encouraged and are handled as official attributes, but I'm only using those as an example to launch from.

From a structural point of view, the EML standard would be perfect, but I was thinking more in terms of Roxygen defined attributes than an XML structure. The attributes would be retrievable as first-class objects via some method, or printable with a context.print() method. A known set of expandable attributes would be a good start. Some thought would need to go into the object structure, whether it's better to define a new OOP construct, an extension of data.frame, or some auxiliary structure.

I have in mind (and remember, this is all purely brainstorming at this point) the case where you load some data from a trusted source, validate that it is indeed unchanged (validate_checksum(data)), print out the context (context(data)$owner; context(data)$last_modified), etc... ditto for functions that do what one thinks they do (context(my_function)$assumptions). The context travels with the data/function and can be tested against, e.g.

## ensure that the function is at least as up-to-date as the data
stopifnot(context(data)$last_modified < context(my_function)$last_modified)

A somewhat complicated extension of this would be to overload <- when this package is loaded so that data/functions with an immutable flag can't be overwritten. Some automation could be included there to update the last_modified or owner attribute.

Some related reading: http://simplystatistics.org/2015/11/06/how-i-decide-when-to-trust-an-r-package/