rstudio / renv

renv: Project environments for R.
https://rstudio.github.io/renv/
MIT License
1.02k stars 155 forks source link

consider including LinkingTo entries when building hash keys #884

Open kevinushey opened 2 years ago

kevinushey commented 2 years ago

For example, if using Rcpp 1.0.7, one would want to ensure any packages pulled from the global cache were built against that same version of Rcpp.

hadley commented 1 year ago

It looks like it is already included in the hash?

https://github.com/rstudio/renv/blob/6c0e8be69ff683c0d2b7c62c371783663f293b5d/R/hash.R#L23-L27

I had expected that this issue would apply to a hash that was created recursively from all the dependencies of a package, but I can't find any code in renv that appears to do that. Maybe that's what this issue is about?

kevinushey commented 1 year ago

I had expected that this issue would apply to a hash that was created recursively from all the dependencies of a package, but I can't find any code in renv that appears to do that. Maybe that's what this issue is about?

I think that was the prior motivation when I filed this issue. My main concern here is that, in general, we don't know what packages were actually available when a package was built; e.g. package A might have been built against package B (1.0.0), but the user has package B (1.0.1) installed into their library paths.

In other words, recursive hashing requires us to make a number of somewhat tenuous assumptions about a package's recursive dependencies. Packrat did / does it, and I think it's mostly okay in the end, but I believe there have been some scenarios where seemingly unexplainable "cache misses" have occurred that have caused trouble.

tl;dr: I'm leaning towards closing this unless we feel very strongly we should change how packages are hashed in renv.

hadley commented 1 year ago

If the current system is working, I don't think there's any reason to change it. FWIW we're also going to work to switch Rcpp to cpp11 for shiny and dependencies so ABI problems mostly go away for posit packages.

aronatkins commented 1 year ago

Packrat originally did not recursively create a hash, and that ended up with us trying to use the same compiled httpuv-1.0 against both Rcpp-0.11.2 and Rcpp-0.11.2.1.

We solved that by hashing the "linked-to" package description, and using that dependency hash when computing the "linking-to" hash.

We decided that we wanted to recurse through the whole linking tree because we were worried about ABI breakages rippling through A->B->C. In truth, however, we probably only ever needed one level of recursion. If changes to C could invalidate A, then A should be declaring C as a LinkingTo.

All of the ABI breakages we have seen have been caused by direct relationships (e.g. A->B, not A->B->C).

Aside: rsconnect includes the full set of DESCRIPTION files because really old versions of Packrat computed hash without LinkingTo recursion. Connect wires in a custom lookup function to we can re-hash with that full information. It has also let us correct the hashes new Remote* fields have shown up (usually the cause of the cache misses that @kevinushey references).

We should make sure that the Renv hash computation includes the VERSION of the LinkingTo dependency, at the very least. That would create unique httpuv entries for each (httpuv, Rcpp) version combination.

This particular issue was filed coincident with the Rcpp_precious_remove ABI change.

aronatkins commented 1 year ago

The renv can link cached, incompatible packages into a project library. Here's the setup:

Create a directory structure containing an isolated renv cache directory and two project directories:

renv-caching/
    cache/
    httpuv-rcpp-1.0.6/
    httpuv-rcpp-1.0.7/

Inside each of those project directories, create:

# bootstrap.R
renv::init(
  bare = TRUE,
  settings = list(
  ),
  repos = c(CRAN = "https://cran.rstudio.com/")
)

# Make sure we are using our isolated cache.
# Only use source packages.
writeLines(
  c(
    "parentDir <- dirname(getwd())",
    "cacheDir <- file.path(parentDir, \"cache\")",
    "Sys.setenv(RENV_PATHS_CACHE = cacheDir)",
    "Sys.setenv(RENV_RSPM_ENABLED = \"false\")",
    "options(pkgType = \"source\")",
    "source(\"renv/activate.R\")"
  ),
  ".Rprofile"
)
# install.R
renv::install("Rcpp@1.0.7") # Use 1.0.6 or 1.0.7 as appropriate.
renv::install("httpuv@1.6.2")
renv::snapshot(prompt = FALSE)
# server.R
library(httpuv)
s <- startServer(host = "0.0.0.0", port = 8080, app = list())

Now, to populate the cache:

# run in httpuv-rcpp-1.0.7
R -s -f bootstrap.R
R -s -f install.R

# run in httpuv-rcpp-1.0.6
R -s -f install.R
R -s -f server.R

Now, to err!

# run in httpuv-rcpp-1.0.6
R -s -f server.R
#> Error in makeTcpServer(host, port, private$appWrapper$onHeaders, private$appWrapper$onBodyData,  :
#>   function 'Rcpp_precious_remove' not provided by package 'Rcpp'
#> Calls: startServer -> <Anonymous> -> initialize -> makeTcpServer
#> Execution halted

Some things about these projects:

  1. No binary packages; everything is compiled from source to avoid problems with repository-hosted binaries.
  2. No RSPM binary rewriting (in the end, used CRAN and source packages).
  3. No default renv cache; avoids what might already be in the cache.
kevinushey commented 1 year ago

I believe my concern remains, though: we don't know what LinkingTo dependency versions were actually used when a package was built + installed; right now, the best we can do is guess and hope that the version in the library path matches the version that the package was built against.

kevinushey commented 1 year ago

Independent of LinkingTo dependencies, there are also other build-time dependencies that can cause issues like this -- e.g. old versions of packages imported magrittr::%>% by copying the build-time definition of that function into the package, but that caused problems when the internals of magrittr were changed. I believe this is resolved by instead dynamically loading %>% at runtime in the packages where magrittr is used, but other package authors might not be aware of these sorts of problems.

aronatkins commented 1 year ago

When renv performs the package compilation, it does know all the package versions. Agree that if we are copying files from an existing R library or obtaining binaries from a repository, we do not know what version it wants to link against.

Packages that record a build-time reference are package bugs, and not a normal, tracked dependency. Are you trying to suggest that the cache entry needs to be package-bug aware?

Two examples we "fix" in Connect:

These examples feel like issues that the packages need to fix, rather than a situation that renv needs to handle.

Returning to LinkingTo, renv could incorporate linking information when building the cache key. What would need to change for renv to have enough information to address the problem exposed by my example?

kevinushey commented 1 year ago

When renv performs the package compilation, it does know all the package versions. Agree that if we are copying files from an existing R library or obtaining binaries from a repository, we do not know what version it wants to link against.

But it's not necessarily renv performing compilation or install; e.g. users might install packages without using renv. We could solve the problem if everything went through renv (and we required users to do so) but one of the selling points of renv is that it tries to work with existing workflows / tools as much as possible.

Perhaps the story is different for deployment scenarios, where we're performing a sort of clean-room install entirely through renv though.

Packages that record a build-time reference are package bugs, and not a normal, tracked dependency. Are you trying to suggest that the cache entry needs to be package-bug aware?

I think I'm saying that these LinkingTo bugs are the same in spirit to "build-time reference" bugs. That is, if a package is built against version x, but runs against version y, bad things might happen.

Taking a build-time dependency on another package's R functions implicitly ties you to the internals of that package, so I think I agree it's an issue to be fixed by the package author (it's somehow more egregious that the LinkingTo issues). But I think the mitigation for such issues, if we couldn't rely on the package author to resolve them, would be the same -- include the build-time dependencies in the hash.

Returning to LinkingTo, renv could incorporate linking information when building the cache key. What would need to change for renv to have enough information to address the problem exposed by my example?

IMHO, packages should include a mini-lockfile of the packages that were available at install time, and include those in the package installation somewhere. So you could do something like:

getPackageBuildInfo("dplyr")

and learn about what packages were actually available and used at build time for each of Depends / Imports / Suggests / LinkingTo.

If renv is the only one responsible for installing packages (as it might be in deployment scenarios) then we could possibly handle this; e.g. by writing out a "lockfile" after installing a package, declaring what packages + what versions were available at build time. But we'd have to make sure that whatever renv does is congruent with what happens in PPM, and note that users installing binaries from CRAN won't benefit from these changes.

aronatkins commented 1 year ago

On the LinkingTo question: base-R should eventually record "built-with" information - maybe in the DESCRIPTION, maybe elsewhere. I agree.

In the meantime, we could have most tools (renv, pak, PM, etc) produce this information, yes? Maybe the presence of this additional metadata means that package installers (renv, pak, etc) can make more informed decisions when choosing to use package binaries.

kevinushey commented 1 year ago

After discussion with @hadley I think we landed here:

In this world, renv would continue (by default) only doing its plain single-package DESCRIPTION hashing, but could be opted-in to an extended hash using other information provided by the installed package. For Connect, this would make sense since Connect has control over how packages are installed and so could require that this metadata be produced when packages are built.

Does this sound like a reasonable way forward @hadley / @aronatkins?

aronatkins commented 1 year ago

There are different problems dependent on how renv obtains the package installation.

When renv installs from source, as in https://github.com/rstudio/renv/issues/884#issuecomment-1545949051, renv should do all it can to make sure that the cache it constructs is consistent and usable by multiple projects. This probably means adding LinkingTo dependency versions to the hash. The resulting cache would contain two installations of httpuv, each linked against a different Rcpp version.

When renv installs from binaries -- either because they are copied from the user library or because they are obtained from a repository as binary -- renv needs to make use of whatever metadata is made available. I agree that this path is dangerous today, as we have no knowledge of build-time dependencies. Tools like renv / pak / remotes / PM, and eventually base R need to document the requirements used to compile package binaries.

I think that we need both strategies.

I do not see this as a Connect problem, but one that addresses the ability of the renv cache to be used by multiple projects with disparate dependency versions.