rstudio / renv

renv: Project environments for R.
https://rstudio.github.io/renv/
MIT License
1.02k stars 155 forks source link

What part of renv infrastucture needs to be backed up to ensure reproducibility? #1154

Closed PeteHaitch closed 1 year ago

PeteHaitch commented 1 year ago

Hi Kevin,

Our HPC administrators have advised me that the renv/ subdirectories we have for our projects are causing the HPC backup system some grief (it basically doesn't like lots of directories with lots of small files). Each project is its own R project with renv/ subdirectory and we have hundreds of such projects.

I'm looking for a solution so that I continue to happily use renv and make life easier for our HPC admin.

One idea is to tar the renv directory once a project is no longer being actively analysed. But that's a manual process to tar/untar that I can imagine might get forgotten and cause some grief to people interacting with the project or managing its backup.

I then wondered if it was sufficient to backup just the renv.lock file or some other small set of files? Could you please advise the minimal set of renv-related files required to ensure reproducibility of an analysis.

Thanks, Pete

kevinushey commented 1 year ago

Our HPC administrators have advised me that the renv/ subdirectories we have for our projects are causing the HPC backup system some grief (it basically doesn't like lots of directories with lots of small files).

Presumedly, the abundance of small files is from the renv/library folder, where installed packages live? (Although those are by default symlinks to a global cache location; not sure if that changes things)

The project library path can be controlled via the RENV_PATHS_LIBRARY folder, documented at https://rstudio.github.io/renv/reference/paths.html. I mention this in case re-organizing the project layout on the HPC might be helpful.

I then wondered if it was sufficient to backup just the renv.lock file or some other small set of files? Could you please advise the minimal set of renv-related files required to ensure reproducibility of an analysis.

The answer to this depends on what you can make externally available to the project. At a minimum, you only need renv.lock -- you could then re-initialize a project using renv::init(), or renv::restore(). However, this assumes that the packages recorded in the lockfile are available at their declared sources (e.g. CRAN; GitHub; other possible sources).

See also the snippet at https://rstudio.github.io/renv/articles/renv.html#reproducibility.

Let me know if this helps, or if I can provide further guidance.

PeteHaitch commented 1 year ago

Thanks, Kevin, that's really helpful.

Presumedly, the abundance of small files is from the renv/library folder, where installed packages live? (Although those are by default symlinks to a global cache location; not sure if that changes things)

Yes, I believe so. I've asked our admin the same question about whether symlinks change things, but haven't yet received a response.

The answer to this depends on what you can make externally available to the project. At a minimum, you only need renv.lock -- you could then re-initialize a project using renv::init(), or renv::restore(). However, this assumes that the packages recorded in the lockfile are available at their declared sources (e.g. CRAN; GitHub; other possible sources).

(Almost) all packages we use for analysis are from CRAN or Bioconductor (we occasionally use a GitHub-only dependency, but try to avoid it).

The project library path can be controlled via the RENV_PATHS_LIBRARY folder, documented at https://rstudio.github.io/renv/reference/paths.html. I mention this in case re-organizing the project layout on the HPC might be helpful.

Ah that could be useful to me! I knew I could move the cache location (although currently I use the default) but I didn't know that I could specify a path for the project library via RENV_PATHS_LIBRARY.

I will look into setting up path customisation for new projects. If I want to then post-hoc modify existing projects, is that feasible? Are there any gotchas to be aware of?

Thanks, Pete

kevinushey commented 1 year ago

(Almost) all packages we use for analysis are from CRAN or Bioconductor (we occasionally use a GitHub-only dependency, but try to avoid it).

Do you need to be robust against (1) CRAN / Bioconductor going away (or being inaccessible in your computation environment), or (2) be prepared in case any packages you use are removed / archived from CRAN? (It's rare, but it does happen).

You can often still find these packages in other locations; e.g. via MRAN snapshots (https://mran.microsoft.com/) or Posit Package Manager (https://packagemanager.rstudio.com/client/#/), but raising this in case it's a concern.

If I want to then post-hoc modify existing projects, is that feasible? Are there any gotchas to be aware of?

You definitely can. I don't anticipate any problems, but it's worth testing on some smaller example projects first to be certain.

philibe commented 1 year ago

About MRAN:

The Microsoft R Application Network website will be shut down on July 1st, 2023

https://techcommunity.microsoft.com/t5/azure-sql-blog/microsoft-r-application-network-retirement/ba-p/3707161