Caching data - Githubissues

karthik commented 10 years ago

While retrieving data from APIs are fantastic (and the core functionality behind most rOpenSci's packages), API's can disappear, and data can change, both of which can affect reproducibility. Similar to RStudio's packrat, it would be great to consider ideas to cache/snapshot timestamped API calls along with code and narrative.

Moving from #18

AmeliaMN commented 10 years ago

Cool! This is something I've been thinking about, too. Really would be nice to have a standardized format for documenting data in R. Something a little more than the man page in the help, and a little different than a code book. Having it be very integrated with the data itself would be ideal, so that you almost couldn't grab the data without the associated documentation (or, it would be so good that you wouldn't want to).

For the API problem, maybe there needs to be some code written, but for flat data files I think it's more of a section for #25.

sckott commented 10 years ago

Seems like this can be thought about for various use cases

Exploration where you want temporary caching just to speed up analyses, etc.
Permanent caching associated with research outputs to provide reproducibility.

The 1st one could have solutions on- as well as off-line, while the 2nd only online. The 1st could be solved simply via {r cache=TRUE} with knitr, and maybe we just stop there. But do we want to build in tools to our own packages to make it easy with one parameter setting in each function to cache data (w/o the need for knitr)? Seems like the majority of users would simply want to write to a cache via R.cache and friends, but some perhaps may want more flexibility to write to a database?

Permanent caching associated with published work seems like a much harder problem, perhaps beyond scope here.

cboettig commented 10 years ago

I think 2 need not be out of scope since it is something we can address with publishing to figshare, knb, dataone, etc.

This is another thing to discuss with Matt Jones (and probably others like Mark) - how to archive raw / query data as well as processed, and do it in realtime rather than just upon publication, while also keeping it linked.

Having decent draft data online, perhaps securely, is both a good way to boost overall data pub rates (by reducing the burden later) and best practice.

Carl Boettiger http://carlboettiger.info

sent from mobile device; my apologies for any terseness or typos On Mar 27, 2014 1:24 PM, "Scott Chamberlain" notifications@github.com wrote:

Seems like this can be thought about for various use cases

Exploration where you want temporary caching just to speed up analyses, etc.

Permanent caching associated with research outputs to provide reproducibility.

The 1st one could have solutions on- as well as off-line, while the 2nd only online. The 1st could be solved simply via {r cache=TRUE} with knitr, and maybe we just stop there. But do we want to build in tools to our own packages to make it easy with one parameter setting in each function to cache data (w/o the need for knitr)? Seems like the majority of users would simply want to write to a cache via R.cache and friends, but some perhaps may want more flexibility to write to a database?

Permanent caching associated with published work seems like a much harder problem, perhaps beyond scope here.

Reply to this email directly or view it on GitHubhttps://github.com/ropensci/hackathon/issues/21#issuecomment-38855923 .

sckott commented 10 years ago

You're right that we do have publishing tools, but seems like these are likely to be separate in practice: data from API calls to the web that can be cached & data that is published to dryad/figshare/dataone. The latter is likely cleaned up, and in tabular format, while the former is likely as a json/xlm payload (and not curated at all by the user)

Definitely should get Matt's thoughts on this.

Good idea: If users are caching data along the way of their analyses, perhaps they are more likely to share that data with their paper

cboettig commented 10 years ago

Yeah, exactly. I think there is increasing momentum to encourage also archiving the uncleaned, raw data, be it in json or whatever. Just because it's from an online source is no guarantee it won't change or go dark like nbii. @mbjones thoughts?

Carl Boettiger http://carlboettiger.info

sent from mobile device; my apologies for any terseness or typos On Mar 27, 2014 5:37 PM, "Scott Chamberlain" notifications@github.com wrote:

You're right that we do have publishing tools, but seems like these are likely to be separate in practice: data from API calls to the web that can be cached & data that is published to dryad/figshare/dataone. The latter is likely cleaned up, and in tabular format, while the former is likely as a json/xlm payload (and not curated at all by the user)

Definitely should get Matt's thoughts on this.

Good idea: If users are caching data along the way of their analyses, perhaps they are more likely to share that data with their paper

Reply to this email directly or view it on GitHubhttps://github.com/ropensci/hackathon/issues/21#issuecomment-38878055 .

mbjones commented 10 years ago

I totally agree - any data source could go dark. There are no guarantees. But we could also get into a bit of infinite regress here, so you don't want to automatically cache/archive everything. At some point you have to decide if any given source represents a viable long-term archive for the data. This requires evaluating the mission of the data service and its sustainability, and is ultimately subjective. For example, data services that replicate data geographically and across institutions, have a long-term plan for sustainability, have a commitment to API persistence, and are on solid financial and technical footing should be trusted. In comparison, a well-designed but new data service run on some departmental server as part of a 3 year science grant is likely to be transient -- just as are the data stored on web sites at those same departments. And we should beware of cloud providers that make a service available, but might pull it at any time if the service turns out not to be profitable (like Google did with their initial data service). We should trust archives that have a mission to preserve data for decades and a reasonable plan of doing so. It should also not depend on a researcher continuing to pay to archive an object in perpetuity, as that will obviously fail at some point. GitHub, for example, does not seem to have such a mission, and could very well end up like SourceForge in a decade. I'm not sure about FigShare, but I suspect they probably do have such a mission, as does the KNB and Dryad. I really like the FigShare failsafe of being part of CLOCKSS.

When it does make sense, and the user has the right to do so, I am generally of the opinion that one should archive the exact data stream received, and then deal with transformations downstream to create derived views via processing. When the objects retrieved are already archive ready (e.g., a NetCDF with embedded metadata, or a CSV file with attached EML), that seems like the best approach. When archiving these, it would be best to use the same persistent identifier as the original provider to make it clear that it is the same data, and provide linkages in the metadata to the original source data set. This may be difficult in some archives that don't allow you to choose your ID, or that structure their archives differently. Within repos that are part of DataONE, creating a data or metadata object using the same ID will automatically mark that as a replica copy as long as the checksums for the objects all match. If the checksums do not match, the repo will get an error saying the ID is already in use and the repo has to choose another.

However, some APIs, especially those returning JSON, provide exceedingly little machine-readable information about the data being sent. This is great for lightweight data exchange between two apps that have an implicit knowledge of the data, but they are often minimalistic, one-off data structures that are specific to that single service, with no linked schema and no mechanism for schema validation or metadata provision. Thus, not archive ready. For example, if we were to archive a JSON return such as {"r1":[1,2,3],"r2":[4,5,6],"r3":[7,8,9]}, would we know it is the same data table as {"c1":[1,4,7],"c2":[2,5,8],"c3":[3,6,9]} but transposed? Are the arrays representing rows and columns of a data table, or are they independent of one another? And how would we attach attribute and unit metadata assuming we could somehow extract it from the service (which we often can't)? Maybe as more standard ways to link JSON to schemas and metadata are developed this situation could change, as JSON is easy to parse, just not to document at the current time. Maybe dat will provide some of these conventions. I'm also following JSON-LD in this regard. In the meantime, I think it could make sense to first transform the JSON to an archive-friendly format, create metadata describing the data, and then archive that package. JSON and NetCDF seem to be compatible data models, and there's some work on a transform tool (see ncdump-json).

I think you've hit on a valuable subject, and there are many reasonable paths.

ropensci / unconf14

Caching data #21