Beyond CRAN: modern dependency management; including older/archived versions & alternative respositories

cboettig commented 9 years ago

CRAN is a favorite topic at any R gathering; so perhaps we can channel it into something productive here.

Perhaps I'm grouping different issues under the same tent here, so feel free to propose to break these into separate issues or pursue only certain aspects of this. I'm really not sure how best to describe these issues either, so if this sparks any interest feel free hijack this issue and re-frame the discussion however you see fit.

There's a handful of very interesting packages for giving a stricter approach to dependency management: including RStudio's packrat, @gmbecker 's gRAN & switchr, and some of the related tools to which these connect (mran, @gaborcsardi 's crandb, etc). It might be nice to take stock of what issues these address and what challenges remain in managing dependencies.
The above approaches attempt to deal with the reality of installing packages whose dependencies come from locations other than CRAN (or CRAN/bioconductor/omegahat; though the latter two are clearly rather different than CRAN & rather different ways). Does this suggest that the future will be more distributed, or the need for a different approach to coordinating package dependencies?
EDIT @eddelbuettel's new drat takes the approach of making it easier for simple creation and use of alternative repositories, building around base R install functions. How does this approach (and the general ability to add arbitrary repositories into dependencies) impact with the approaches in the first bullet? (There's a question about the relative roles of install_github vs install.packages()+drat style approach, also seen in @yihui 's xran, etc here too)

jennybc commented 9 years ago

@eddelbuettel Isn't drat also relevant here?

cboettig commented 9 years ago

drat! How could I forget that! Thanks, editing original post now...

gaborcsardi commented 9 years ago

Not very surprisingly :), I would very much like to participate in this. There is also https://github.com/rpkg which I am planning to work into a usable state before the unconf. The plan is to have some sort of database/repository and minimal package manager for packages (initially) on Github, to facilitate 1) package discovery and 2) dependencies. The DB will be here: https://github.com/rpkg/rpkgdb.app, and the package manager is not yet in the works.

Alternatively/additionally, we can convince @hadley to put support for Github dependencies into devtools::install_github. :)

eddelbuettel commented 9 years ago

For R source packages, I prefer relying on R tools. Hence drat which relies on R's (existing and working) tools to build a package index, resolve R dependencies and simply permit working repositories now. I see no point in replicating this / reinventing wheels here. It works for me.

For R binary packages, though, I was wondering if I could hash things out with @gaborcsardi about revisiting what we did with cran2deb (and which lives on in Don's debian-r. That would eg feed into Rocker, and could be used for other binary builds on other OSs etc.

eddelbuettel commented 9 years ago

@cboettig: Thanks for the edit above. I think this ticket may need to get split into two or three related ones. Read your headline v e r y s l o w l y: that is really several problems at once, no?

So looking forward to these two days...

gmbecker commented 9 years ago

@cboetig A subject close to my heart :). Excited to talk to other interested parties about it!

@gaborcsardi This is one of the things that switchr provides (github-to-github dependencies), given a manifest of all necessary github packages and the ability to build the packages from source. We do this by building a just-in-time repository by recursively traversing the stated dependencies.

Given a a complete-enough manifest of packages within github repositories, this allows us to treat github as an untested "CRAN-devel", as well as a supplementary archive for older pkg versions.

@eddelbuettel Re multiple issues. They can be seperate, and that approach is not without benefits, but they don't need to be, I think...

This is going to be fun. ~G

On Tue, Feb 10, 2015 at 3:49 PM, Dirk Eddelbuettel <notifications@github.com

wrote:

@cboettig https://github.com/cboettig: Thanks for the edit above. I think this ticket may need to get split into two or three related ones. Read your headline v e r y s l o w l y: that is really several problems at once, no?

So looking forward to these two days...

— Reply to this email directly or view it on GitHub https://github.com/ropensci/unconf/issues/7#issuecomment-73809570.

Gabriel Becker, PhD Computational Biologist Bioinformatics and Computational Biology Genentech, Inc.

gaborcsardi commented 9 years ago

@eddelbuettel Well, existing and working is a good point. However, I would really like to have a repository that provides an API to query, submit, etc., plus a modern package manager that speaks this API (and handles CRAN as well, maybe through crandb).

That does not mean that we cannot start with the existing tools, and just provide an API and maybe a package manager on top of them. In fact, using the existing and working crandb with the existing and working drat is probably something we can put together in two days.

However, I also think that sometimes it is better to start from scratch. :) We would need the machinery to submit, publish, query, etc. packages, anyway.

sckott commented 9 years ago

This may seem crazy, but the conda package manager from Continuum does or soon will support R http://continuum.io/blog/preliminary-support-R-conda - and has virtual environments, and is cross platform - A possible place to start on the pkg manager front?

hafen commented 9 years ago

I was going to mention conda-R as well. Interesting idea with Binstar and all.

gaborcsardi commented 9 years ago

@gmbecker Nice, I didn't know switchr did that. That's a great first step! It would also be nice to have 1) proper versions, and 2) some DB of packages, for discovery. I would leave out virtual environments from this, I think they are great, but they can be done independently.

Re conda, I am not sure what it does exactly with R packages. Isn't conda itself written in Python? That's somewhat suboptimal, to have to install Python to manage R packages..... but conda and binstar are definitely projects to learn from, just like other package managers, e.g. the ones listed at https://github.com/showcases/package-managers

@eddelbuettel As for binary builds, I guess you realize that for that you need a farm, or SaaS? Or maybe I don't get what you mean here.

Btw. I also think that we should break up this issue.

gmbecker commented 9 years ago

@gaborcsardi Depending on what you mean by true versions, switchr does support that as well. You can tell it to install an exact (even non-current) version of a package from a github repositories. This becomes murkier when there are depednencies, as it is difficult for it to know what versions of the dependencies to grab.

Re: a database, I agree. The way I am currently envisioning it is a large manifest hosted on github with the ability for the community to make pull requests to add-to or update it. This could be wrapped in a convenience R function.

For the versions-of-dependencies issue I agree that some sort of database is probably the only way to manage that. switchr can talk to your crandb service to solve this problem for packages that lived on CRAN. As a side-note I'd like to talk to you about a) making that particular query easier in the crandb API, and b) how hard it would be to have a similar service for Bioconductor packages.

~G

On Tue, Feb 10, 2015 at 7:41 PM, Gábor Csárdi notifications@github.com wrote:

@gmbecker https://github.com/gmbecker Nice, I didn't know switchr did that. That's a great first step! It would also be nice to have 1) proper versions, and 2) some DB of packages, for discovery. I would leave out virtual environments from this, I think they are great, but they can be done independently.

Re conda, I am not sure what it does exactly with R packages. Isn't conda itself written in Python? That's somewhat suboptimal, to have to install Python to manage R packages..... but conda and binstar are definitely projects to learn from, just like other package managers, e.g. the ones listed at https://github.com/showcases/package-managers

@eddelbuettel https://github.com/eddelbuettel As for binary builds, I guess you realize that for that you need a farm, or SaaS? Or maybe I don't get what you mean here.

Btw. I also think that we should break up this issue.

— Reply to this email directly or view it on GitHub https://github.com/ropensci/unconf/issues/7#issuecomment-73829799.

Gabriel Becker, PhD Computational Biologist Bioinformatics and Computational Biology Genentech, Inc.

gaborcsardi commented 9 years ago

@gmbecker

Re versions, that's great, again, I did not know you did that. For the DB, I want people to be able to make releases, i.e. to decide which versions they want to include in the DB. And then the package manager would handle this, e.g. update packages, That's all I mean. Versioned dependencies would be great, but too difficult right now imho. I investigated this a bit here: https://github.com/metacran/camo but I don't think it is worth doing it, before we actually have a package manager.

I was thinking about the Github solution, too. However, I really don't want to handle pull requests by hand, and it is also really hard to automate them, so that people can't mess up the DB by chance. That's why I am thinking about a noSQL DB with an API, and authentication (which can be provided by Github, actually). Something very similar to crandb. All automated, with human intervention only in exceptional cases.

As for your crandb query, and BioC, why don't you open an issue in the crandb repo, and then we can discuss it.

eddelbuettel commented 9 years ago

FWIW I started to cobble together a package (for Debian-based systems including of course Ubuntu and what is used at Travis) to interface the package manager backend: RcppAPT. This may (or may not) help with gathering information about what has been built (in the "take source from CRAN and build a binary" sense) and which build-dependencies are or are not available.

It may be useful for all the cloud-based things implemented with a Debian/Ubuntu backing such as Docker (where at least @cboettig and I use a Debian backing), Travis, ... but it won't buy you lunch on Windoze, OS X, Fedora, and lots of other lovely places. Which I rarely visit :)

hafen commented 9 years ago

A late +1 for this - I'm particularly interested in the managed github database of packages / github dependency management, and would like to participate.

eddelbuettel commented 9 years ago

Have you looked at drat ?

One view is that we don't actually have to reinvent anything but rather create more repositories.

hafen commented 9 years ago

Yes - it looks great! Sorry I didn't mean to imply that these aren't solved - I have a lot to catch up on in this area. Just interested in helping push it forward.

gaborcsardi commented 9 years ago

I think the DB has different goals than drat. The main goals are

Discoverability. There are thousands of R packages on Github, and it is not easy to find them, even if Github is searchable.
Easy installation and dependency management. R's built-in package management (used by drat, too, afaik) does not really scale to hundreds of repos, imo. Specifically, every time to want to install something or even just want to list packages, it downloads PACKAGES files from all repos. This is not feasible for hundreds of repos, and as this is hardwired into R (utils), it is unlikely to change. (OK, the packages are cached within a session, that's actually good.)

eddelbuettel commented 9 years ago

I am not aware that aware that anybody ever has tried "hundreds of repos" but do agree that scaling to that size would be an issue.

I foresee drat as being useful for the range of, say, two to five repos as I don't really see people keeping track of that many more.

And I also see the MetaCRAN DB as both extremely useful and also complementary do what drat tries to do here and now.

This is not either or, and I hope I didn't portray it as such. One can use drat today to solve actual problems -- and I and some other early adopters do.

gaborcsardi commented 9 years ago

I agree completely. I think that drat is great for your own local repo management, and maybe a couple of other repos. (But we don't know how many until we try it, actually. Maybe it scales up to 50, maybe not.) It probably does not scale up to hundreds. This is not drat's fault, R's package management functions were simply not written with having many repos in mind.

Btw. I started writing the package manager: https://github.com/metacran/rpkg (Pre-alpha, so beware! :) ) It uses ./r_pkgs by default for installing packages, and you need to supply global = TRUE to most functions to use the standard R library directories in .libPaths(). Thinks like

pkg_list()
pkg_list(global = TRUE)
pkg_outdated()
pkg_outdated(global = TRUE)
pkg_tree("devtools")
pkg_install("httr")
pkg_upgrade()
pkg_info("httr")
pkg_bug("httr")
pkg_browse("Rcpp")

work, at least on OSX. There are a lot of corner cases to handle, and a lot to do in general. The good thing is, with crandb, every operation is pretty simple.

It uses CRAN packages now, the idea is to add Github packages later.

gmbecker commented 9 years ago

@gaborcsardi @eddelbuettel - switchr approaches this via non-centralized manifests. When installing from a manifest, it creates a local repository containing only the necessary packages, and installs from that by calling down to R's built-in installation mechanisms. AFAICS, this should scale indefinitely, as the complexity goes up with the number of packages that will be installed, not the number of packages available.

At that point, all we need is a manifest of where R packages live on github, and we are good to go. For example, it is easy to generate a manifest of all ROpenSci packages, so that installing any single or combination of those packages just works.

gaborcsardi commented 9 years ago

@gmbecker Yes, the DB and the accompanying web-service have essentially the same role as your Github manifest.

viking commented 6 years ago

@gaborcsardi Sorry to resurrect a dead thread, but did you end up getting anywhere in your efforts to create a package database?

eddelbuettel commented 6 years ago

@viking : Yes, it underlies R Hub and is used. But I tend to loose myself among the different repos that @gaborcsardi has and cannot immediately point you to one.

jeroen commented 6 years ago

@viking the r-hub dependency db is https://sysreqs.r-hub.io which is backed by this data: https://github.com/r-hub/sysreqsdb

viking commented 6 years ago

Hmm, so what is the primary function of R Hub? Will it be a CRAN replacement eventually or is it meant to supplement CRAN in some way?

gaborcsardi commented 6 years ago

It is a multi-platform package check service, in its current form.

eddelbuettel commented 6 years ago

Will it be a CRAN replacement eventually

That was never planned. It is independent of CRAN.

viking commented 6 years ago

Aha, I see. I was just curious about the state of some of the ideas discussed in this thread. Thanks for the info.

ropensci / unconf15

Beyond CRAN: modern dependency management; including older/archived versions & alternative respositories #7