Open jonocarroll opened 8 years ago
Roxygen
would be the natural method of doing this, which should make it transparent to anything existing (#' docstring
)
Great idea. My proposal issue title was perhaps too specific to
ropensci/EML. I think the generic issue you describe is better, because it
starts by not assuming the extant technology / solutions are a fait
accompli.
Especially if there are issues with depending on S4. The bonus of EML in
my eyes is the convenience of leveraging international standards and
schemas, and the tools that exist to work within that standard (also see
https://github.com/DataONEorg/rdataone to interface with the EML-based
metacat
data repositories).
I am not sure from what you wrote if your idea builds on existing
internatlonal standards or intends to develop new standards (ie is this
python-esque docstring
considered a 'standard'? If not will this
development generate a schema for attributes on data/functions that will
then become an international agreed standard or another R flavoured
dialect?).
My suggestion was based on a pressing unmet need I face when ingesting, synthesising and disseminating data and code, especially while working at the coal-face of data analysis (ie generating metadata while working with data.code rather than creating metadata prior/post data analysis). The act of doing metadata at the same time as doing data munging is appealing to me, especially if it is automated to the hilt.
In terms of choosing the standard, EML seems to be the most generically applicable standard I have used across environmental, social, health and geographic data types (others I tested were ANSLIC, DDI and RIF/CS).
I also like the idea that this topic may have cross-cutting potential with https://github.com/ropensci/auunconf/issues/9 as the automagic ingestion of data/metadata to R will facilitate validation analyses, and may also cut across https://github.com/ropensci/auunconf/issues/8 where such metadata may make reproducible workflows easier, quicker and more re-usable/re-configurable, also I imagine WRT https://github.com/ropensci/auunconf/issues/13 has to also deal with communicating uncertainty related to underlying construction/collection of data such as measurement error, modelling error or related such complicated algorithmic processing of data prior to generating the uncertain results that they wish to communicate.
Good stuff mate, let me know if I am off track with where you saw this thread going?
On Wed, Mar 30, 2016 at 1:35 PM, Jonathan Carroll notifications@github.com wrote:
Roxygen would be the natural method of doing this, which should make it transparent to anything existing (#' docstring)
— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/ropensci/auunconf/issues/18#issuecomment-203204495
Python docstrings are a standard in that they are strongly encouraged and are handled as official attributes, but I'm only using those as an example to launch from.
From a structural point of view, the EML standard would be perfect, but I was thinking more in terms of Roxygen defined attributes than an XML structure. The attributes would be retrievable as first-class objects via some method, or printable with a context.print()
method. A known set of expandable attributes would be a good start. Some thought would need to go into the object structure, whether it's better to define a new OOP construct, an extension of data.frame
, or some auxiliary structure.
I have in mind (and remember, this is all purely brainstorming at this point) the case where you load some data
from a trusted source, validate that it is indeed unchanged (validate_checksum(data)
), print out the context (context(data)$owner; context(data)$last_modified
), etc... ditto for functions that do what one thinks they do (context(my_function)$assumptions
). The context travels with the data/function and can be tested against, e.g.
## ensure that the function is at least as up-to-date as the data
stopifnot(context(data)$last_modified < context(my_function)$last_modified)
A somewhat complicated extension of this would be to overload <-
when this package is loaded so that data/functions with an immutable
flag can't be overwritten. Some automation could be included there to update the last_modified
or owner
attribute.
Some related reading: http://simplystatistics.org/2015/11/06/how-i-decide-when-to-trust-an-r-package/
First off, I see that there is already ropensci/EML and the associated idea, but I'm not a fan of S4, and I'm thinking bigger.
I've brought this up in discussions elsewhere in the past and I know that hadley hasn't made attributes a priority in his workflows (e.g. in relation to
assertr()
https://twitter.com/hadleywickham/status/559183346144522241) -- in fact, it was only recently that attributes were preserved indplyr
pipelines. They're certainly not preserved inplyr
functions.I'd love to be able to attach a python-esque docstring to data and functions that can be printed without invoking the full help menu (
?library
), which might contain the last time the object was updated (either automated or manually stated), source, attribution, etc... It's certainly possible to usecomment()
on adata.frame
but I'm thinking perhaps these can be stored similarly to.Rmd
files (with full markdown capability?) in a cache and searched/loaded independently to ensure they survive processing. This could include a checksum on the object to enforce reproducibility and perhaps even a trigger system if an object is declared immutable but is altered (override<-
... does one dare?). Needless to say, these would have to be transparent to existing structures, so that would need some careful consideration and balance.Just thoughts at this stage.