Sustainable software and how to provide a stable ecosystem of R packages

karthik commented 10 years ago

Problem: The issue with dependencies. CRAN only provides latest versions of packages and older archives are unreliable and not guaranteed to be installable and there is no guarantee it won't break existing code. Would be great to have thoughts or ideas, especially from folks like @hadley / @jjallaire on how to deal with this.

Moving from #18

jeroen commented 10 years ago

See http://arxiv.org/abs/1303.2140 for two possible solutions.

TLDR: I think the only viable solution is to convince the cran maintainers to extend the r devel/release cycle to all of CRAN. This would relatively straightforward and make their life easier as well.

sckott commented 10 years ago

@jeroenooms And how likely is it that they'll extend r devel/release to all of CRAN?

jeroen commented 10 years ago

It won't be easy, but I think if we push really hard it can happen :) It is in their interest as well. I've been planning to make a case for it on the r-devel mailing list. The Rcpp 0.11 drama makes a good argument that pushing bleeding edge updates straight to users is probably not a good idea.

hadley commented 10 years ago

I think @jeroenooms is way to optimistic about getting CRAN maintainers to do anything. I would recommend not putting cran maintainers on any critical paths.

jeroen commented 10 years ago

An optimist is simply an uninformed pessimist :)

I posted a little rant about this on r-devel.

cboettig commented 10 years ago

@hadley does this mean you're volunteering RStudio to do regularly versioned releases of all of CRAN? ;)

hadley commented 10 years ago

@cboettig that's only slightly more likely than cran volunteering to do it :wink:

gavinsimpson commented 10 years ago

I'm with Hadley on this one; whatever the merits (or otherwise) of Jeroen's suggestions R freezing CRAN on development cycles, i) CRAN will never do this, and ii) it bends CRAN to serve a particular community or use-case that it was never intended for and is but one part of the whole R community at large.

Progress will have to come despite (inspite of) CRAN, perhaps via more automated means of documenting R package versions used (i.e not a sessionInfo dump) that can be retrieved automatically too & acted upon, in order to set-up a "sandbox" within which a work may be made reproducible.

Re "pushing bleeding edge updates straight to users"; that is only true for users that maintain a single package library. (Ok that's perhaps the vast majority, but...) There is nothing stopping you have multiple package libraries containing different snapshots of packages required for particular projects. Well, except disk space and effort setting this up.

If we want better reproducibility we need to take the rolling CRAN release out of the equation.

G

cboettig commented 10 years ago

@gavinsimpson Just a note that as @jeroenooms points out, simply recording the version of the package you use does not address this problem, which arises because the package itself doesn't state the version of its dependencies. Without knowing what version of the dependency the package author had installed when developing a package, reproducibility may be lost from the first release to CRAN (while it would help, even 100% unit-test coverage doesn't guarantee this).

Likewise I don't think multiple package library solution is so simple -- the R environment simply does not support me loading package A depending on version 0.1 of package B, and then also loading package C depending on version 0.2 of package B. Addressing these issues requires more than superficial workarounds.

karthik commented 10 years ago

Does @rstudio's Packrat address some of these issues? Similar to having a virtualenv of sorts?

eduardszoecs commented 10 years ago

http://rstudio.github.io/packrat/ looks promising

karthik commented 10 years ago

jinx. You owe me a coke @EDiLD

gavinsimpson commented 10 years ago

@cboettig I wasn't suggesting that 1-line comment as an entire working solution :-)

The scenario in your 2nd paragraph is not really relevant if you "sandbox" for reproducibility. Without going through some rather convoluted hoops, I don't see how under that scenario you'd produce a "work" in the first place {assuming that by a work we mean something that is produced by setting up an environment and running a make script}. Knowing what was used to create the work can be reproduced in so far as R and CRAN are concerned as we can download past versions of R sources and compile, and likewise for packages (as long as some overlooked licensing issue doesn't force CRAN to remove a package).

Beyond R and CRAN you're into realms of snapshotting the actual software used; preserving the used R version and a package library with all packages required to produce the work. For packages on github or elsewhere, where we may not be able to go back and take the source tarball for some package, preserving the local setup or something more extreme would be needed.

How far do we want to go with this? Freezing CRAN won't stop the introduction of changes that could affect results; what about the libraries supplied by the underlying OS, which R and some packages use quite heavily? There is some middle ground here where we don't solve 100% of the problems but cover an important chunk of them without requiring interventions from CRAN etc. or being too unwieldy.

jeroen commented 10 years ago

I also believe that keeping it simple is essential for solutions to be practical. If every script has to be run inside a sandbox with custom libraries, it takes away much of its power. Running a bash or python script in Linux is so easy and reliable that entire distributions are based on it. I don't understand why we make our lives so difficult in R.

Sure you can create a meta-system that automatically records packages and installs them and sets up a sandbox, etc. You can even go one step further. Karim Chine has been advocating for years that we should facilitate reproducibility using cloud images, i.e. publish a snapshot of your entire machine with R and all libraries and everything. To reproduce results, anyone can just load the image on EC2 or wherever and press play. It's really not a bad idea, but it's so heavy that it doesn't scale very well.

I think that in practice most progress can be made with light weight solutions. As was pointed out in other topics, things like knitr are already pretty niche and not widely used by applied researchers. If on top of that, researchers need to execute their knitr script in a special sandbox system that records/replays the libraries, it might become even more obscure. Therefore I am trying to push for a solution that would naturally improve reliability/reproducibility of R code without any effort by the end-user.

karthik commented 10 years ago

Agree 100% with @jeroenooms which is why I never raised the issue of cloud VMs. As mentioned in the reproducibility thread, there are other solutions (we're not the first community to deal with this), but most are also too impractical for researchers to ever use.

cboettig commented 10 years ago

@gavinsimpson The scenario in the second paragraph arises all the time. Consider any two packages on CRAN that share a dependency and were submitted far enough apart such that the dependency was updated on CRAN in the interim. The scenario is described in @jeroenooms paper, http://arxiv.org/abs/1303.2140, which desribes how NPM deals with this issue in comparison. I may be mistaken but I don't believe packrat deals with this scenario.

I think we are discussing two separate things: reproducibility of whatever the end-user did, which we may capture with a sandbox, vs reproducibility of what the developer intended (maybe call this "software sustainability"). R does not require the developer to state the versions of the dependencies, so it's not reproducible. The problem is identical to users not declaring version dependencies, but exists at the level of CRAN and the R package system, rather than at the level of the R script. I don't see how sandboxing addresses this latter issue.

karthik commented 10 years ago

Just read that section on NPM from @jeroenooms paper. God damn that sounds exactly like what we need for R. @cboettig Packrat is nowhere that sophisticated. Sandboxing definitely wont address the problem.

sckott commented 10 years ago

from @jeroenooms

I am trying to push for a solution that would naturally improve reliability/reproducibility of R code without any effort by the end-user.

I think this is key. If we want people using tools that makes their work reproducible, we should make it as easy as possible for them - Esp. since at least in science there's no immediate payoffs (getting tenure, getting a grant, etc.) for making your work reproducible (at the moment anyway) even if there are great payoffs for science in the larger context

gavinsimpson commented 10 years ago

@cboettig with all due respect to the people discussing this here sorting out the software sustainability issue isn't something we can solve... unless someone is prepared to offer up a package repository that would supplement/replace CRAN and place additional constraints on what developers are expected to do re documenting dependencies. Given the observation that plenty of people have gripes with CRAN's policies but thus far no-one has stepped up to provide a competing service, I think we have to work with what we have. I don't think this is too problematic though.

R's DESCRIPTION allows for dependency ranges in the style of NPM's, although this is not enforced by CRAN other than through R CMD check errors on CRAN's tests (which would necessitate updating a package to meet new dependency requirements or removal from [and archival] CRAN). If a developer doesn't supply a known range for a package, we could assume that it only works for the current known release version on CRAN. Enforcing that on CRAN might be useful, but would doubtless cause chaos. Snapshotting what a user has installed & used for a given project/work in the form of a text file as illustrated for NPM would solve the issue in so far as the end-user is concerned, as long as two things can be arranged for the user

an utility which can, given a packages-used-and-dependency-file, recreate a package library of the require package versions as used by the user, and
arrange for a stated version of R (or range of R versions), also documented in the dependency file, to be installed/compiled if not already available.

A third requirement would be the easy generation of that dependency file - which can be done if some assumptions are made, e.g. the upper range of package requirements is the currently installed version without info to the contrary on CRAN - i.e. you are using outdated packages A and C, with C depending upon A, but latest version of package C is still known to work with the installed version of A through the dependency range specified for newer A on CRAN (or something like that...)

There is nothing stopping someone implementing a system like NPMs (in so far as I understand how it works, which may be imperfect [my knowledge]) except for the requirement for developers to document dependencies more stringently, and providing a tool that works alongside CRANs Archive and tools/packages that allow the setting up of software required to compile versions of R and packages to meet a generated dependency file. Solving the developer problem is likely beyond our ken other than to provide best practice guidelines.

I don't see how we can ever stop code breaking other code whilst maintaining a decentralised, free-for-all system of development and the current repositories we have available to us. Beyond this, the best we can do is try to look after our own little areas of the R ecosystem, and provide tools for the user to automate the generation of detailed dependency files and to install the relevant packages into a new library and a suitable version of R if give a dependency file.

Ps: What is "unreliable" about CRAN's archive? If it was on CRAN, they archive it unless retrospectively they find the maintainers realise they aren't actually allowed to redistribute the code. Changes on CRAN have made this less likely to happen of later.

cboettig commented 10 years ago

Perhaps if something like devtools automatically added the version of each dependency the package was using when built into a text file (if not in DESCRIPTION with >= ) it would be a step in the right direction. Currently we lack this info and thus it is often impossible to install packages from CRAN archive.

Carl Boettiger http://carlboettiger.info

sent from mobile device; my apologies for any terseness or typos On Mar 18, 2014 7:17 PM, "Gavin Simpson" notifications@github.com wrote:

@cboettig https://github.com/cboettig with all due respect to the people discussing this here sorting out the software sustainability issue isn't something we can solve... unless someone is prepared to offer up a package repository that would supplement/replace CRAN and place additional constraints on what developers are expected to do re documenting dependencies. Given the observation that plenty of people have gripes with CRAN's policies but thus far no-one has stepped up to provide a competing service, I think we have to work with what we have. I don't think this is too problematic though.

R's DESCRIPTION allows for dependency ranges in the style of NPM's, although this is not enforced by CRAN other than through R CMD checkerrors on CRAN's tests (which would necessitate updating a package to meet new dependency requirements or removal from [and archival] CRAN). If a developer doesn't supply a known range for a package, we could _assume_that it only works for the current known release version on CRAN. Enforcing that on CRAN might be useful, but would doubtless cause chaos. Snapshotting what a user has installed & used for a given project/work in the form of a text file as illustrated for NPM would solve the issue in so far as the end-user is concerned, as long as two things can be arranged for the user

an utility which can, given a packages-used-and-dependency-file, recreate a package library of the require package versions as used by the user, and

arrange for a stated version of R (or range of R versions), also documented in the dependency file, to be installed/compiled if not already available.

A third requirement would be the easy generation of that dependency file - which can be done if some assumptions are made, e.g. the upper range of package requirements is the currently installed version without info to the contrary on CRAN - i.e. you are using outdated packages A and C, with C depending upon A, but latest version of package C is still known to work with the installed version of A through the dependency range specified for newer A on CRAN (or something like that...)

There is nothing stopping someone implementing a system like NPMs (in so far as I understand how it works, which may be imperfect [my knowledge]) except for the requirement for developers to document dependencies more stringently, and providing a tool that works alongside CRANs Archive and tools/packages that allow the setting up of software required to compile versions of R and packages to meet a generated dependency file. Solving the developer problem is likely beyond our ken other than to provide best practice guidelines.

I don't see how we can ever stop code breaking other code whilst maintaining a decentralised, free-for-all system of development and the current repositories we have available to us. Beyond this, the best we can do is try to look after our own little areas of the R ecosystem, and provide tools for the user to automate the generation of detailed dependency files and to install the relevant packages into a new library and a suitable version of R if give a dependency file.

Ps: What is "unreliable" about CRAN's archive? If it was on CRAN, they archive it unless retrospectively they find the maintainers realise they aren't actually allowed to redistribute the code. Changes on CRAN have made this less likely to happen of later.

Reply to this email directly or view it on GitHubhttps://github.com/ropensci/hackathon/issues/19#issuecomment-38011294 .

gavinsimpson commented 10 years ago

What I think could be done, if rOpensci had some way to resource this (and was willing to), would be to set up a rOpensci repository or testing server/instance. This could provide some form of staged release cycle (stable, testing, devel branches/trees) synced to R's release cycle if so desired. And/Or provide testing against a set/range of packages required to run the rOpensci package ecosystem. The project could even provide modified versions of install.packages() to do versioned installs, and an automated test set that implemented at an R level (like a modified R CMD check) tests/checks across the stated range of dependencies.

As several people have pointed out on R-Devel in response to @jeroenooms's posting, the suggested change to CRAN is not widely desirable for the effort required. That's not to say that for certain projects or sets of packages a better "CRAN" and release cycle is not needed. Instead, we could look at what Bioconductor has done independently of and alongside CRAN to address the release cycle and interdependencies issues, and build upon that.

gavinsimpson commented 10 years ago

@cboettig Unless compiled code comes into it, R packages aren't exactly "built" against other R packages. Building an R package can entail as little as pulling together the tarball in many cases. We could deduce which versions of package B are needed to allow package A to "work" (pass R CMD check and unit tests), essentially via brute force checking of each version of A against a range of versions of B (possibly constrained in some way to avoid this ballooning to vast set of combinations...)

One thing I didn't notice about NPM when I commented last night was it allows OR statements in the dependencies. R's DESCRIPTION doesn't allow for this, so barring any petition of R Core to improve the allowed operators in DESCRIPTION, any third-party repository/ interop testing suite would need to work within the current DESCRIPTION requirements and improvements would need to be record in a separate file carried along in the package sources ($pkg_source_root/inst/DEPENDENCIES, say).

cboettig commented 10 years ago

@gavinsimpson Great points all round. Your suggestion regarding the ropensci packages is definitely something for us to think about.

As far as CRAN, I would still suggest the more modest proposal that R packages include a record of the version of their dependencies. We all know how important this is for end users, so how is it any different for developers?

I have myself had the experience of not being able to install my own package that was archived on CRAN because I could not find the correct older versions of the dependencies (I did eventually but that information was not in the package). Several of the dependencies has been updated multiple times since in a way that broke my package (one of which was caught by my unit tests and caused my package to be archived; but other changes would not be caught).

Binaries have nothing to do with it, and passing check and unit tests is no more guarantee that we have a compatible version. If we expect users to care about exactly which version of software they used, surely it is an understandable expectation of the developers as well?

gavinsimpson commented 10 years ago

@cboettig I completely agree. I don't think this is something CRAN can do much about (unless they enforce some version modifier instead of simply stating a package name). A mechanism already exists to do this through DESCRIPTION and the plethora of Depends:, Imports: fields therein. With the exception of proposing to R Core more nuanced recording of dependencies in DESCRIPTION (e.g. the OR options in NPM) I don't see this as something R Core or CRAN can do anything about. It is an issue of getting developers to document known-working versions of the packages they depend upon, using the existing mechanisms.

If I am anything to go by, R developers in general tend to be pretty poor at documenting such things. I have, following this discussion, vowed to go back and put ranged versions on all the dependencies of my packages, including R, although whether you can even do this and satisfy CRAN's desire to have current packages work with both R-release and R-devel versions is something I have yet to discover...

cboettig commented 10 years ago

Curious what folks thing of this nascent approach to the CRAN archive issues: https://github.com/metacran/tools

Slightly more automatic installation of old versions, though still likely to have issues. For instance, my archived package can install

> install_github("cran/pmc@R-2.15.3")

but won't grab the correct archived version of the geiger package to actually pass its checks. Still, seems like a work in progress with in principle the ability to resolve such things based on timestamps, etc.

The general idea of a CRAN snapshot at a particular point in time sounds very promising.

jeroen commented 10 years ago

The @cran github repo is useful, but a workaround at best. It does not really solve the core problem of dependency versioning, i.e. which versions of each cran package are required to make a particular script/package/application work.

cboettig commented 10 years ago

It seems we should at least have a chat about how to best address these issues in the context of our own packages, which are in some ways particularly susceptible to the dependencies problems. I've added this as a project on the page: https://github.com/ropensci/hackathon/wiki/Projects, hoping @jeroenooms, @hadley, @gavinsimpson or someone might take a lead organizing this. If there's not critical mass, even a lunch-time discussion that scribbles down best practice recommendations for us to follow would be good.

ropensci / unconf14

Sustainable software and how to provide a stable ecosystem of R packages #19