rstudio / renv

renv: Project environments for R.
https://rstudio.github.io/renv/
MIT License
1.02k stars 155 forks source link

support binary + source packages in Nexus repositories #1074

Closed kevinushey closed 1 year ago

kevinushey commented 2 years ago

Many thanks @kevinushey for your very quick and positive answer !

To make it simple, Nexus (aka Nexus Repository Manager, by Sonatype) is deployed by my IT teams on our company's servers. Nexus aims at achieving two main goals:

In the Rprofile.site we deploy on our machines, the repository is set using the Nexus root URL for R repository (miror of CRAN). Thanks to this configuration:

The architecture of Nexus storage is the same as the original CRAN repository. For example :

The main difference with CRAN is that older versions are kept, and still available thanks to Nexus internal storage functionality.

Binary packages

Here is an example for binary package rlang in my Nexus architecture:

___ bin
    |___ windows
         |___ contrib
              |___ 4.1
                   |___ rlang_1.0.3.zip
                   |___ rlang_1.0.4.zip
                   |___ rlang_1.0.5.zip
                   |___ PACKAGES
                   |___ PACKAGES.gz      
                   |___ PACKAGES.rds

As said above, older versions are still available for download, even if they are not explicitly listed in PACKAGES.gz (which only includes latest version since it is the proxyfied version of CRAN’s PACKAGES.gz file):

Package: rlang
Version: 1.0.5
Depends: R (>= 3.4.0)
Imports: utils
Suggests: cli (>= 3.1.0), covr, crayon, fs, glue, knitr, magrittr,
        methods, pillar, rmarkdown, stats, testthat (>= 3.0.0), tibble,
        usethis, vctrs (>= 0.2.3), withr
License: MIT + file LICENSE

The idea would be to determine the theoretical URL and try such URL when restoring the project, if the requested package is not explicitly listed in PACKAGES.gz. In our example, if the renv.lock file contains the record rlang@1.0.4, we should try URL bin/windows/contrib/4.1/rlang_1.0.4.zip.

Source packages

The same idea could also be used for source packages, in order to take advantage of Nexus storage functionnality. Here is an example for source package rlang in my Nexus architecture:

___ src
    |___ contrib
         |___ rlang_1.0.3.tar.gz
         |___ rlang_1.0.4.tar.gz
         |___ rlang_1.0.5.tar.gz
         |___ PACKAGES
         |___ PACKAGES.gz        
         |___ PACKAGES.rds

Again, older versions are still available for download, even if they are not explicitly listed in PACKAGES.gz (which only includes latest version since it is the proxyfied version of CRAN’s PACKAGES.gz file):

Package: rlang
Version: 1.0.5
Depends: R (>= 3.4.0)
Imports: utils
Suggests: cli (>= 3.1.0), covr, crayon, fs, glue, knitr, magrittr,
        methods, pillar, rmarkdown, stats, testthat (>= 3.0.0), tibble,
        usethis, vctrs (>= 0.2.3), withr
Enhances: winch
License: MIT + file LICENSE
MD5sum: 0419b9b94b400f3ec1f2792ab7f228e2
NeedsCompilation: yes

I saw the code you wrote retrieve.R file, for renv_retrieve_repos_archive_path() function. I understand (but not 100% sure) that when requested package is neither in binary PACKAGES.gz file nor in source PACKAGES.gz, this function:

(If voluntary ignore the step related to issue #602, to keep this post as simple as possible !)

If the package has been moved to Archive subfolder in CRAN, Nexus will download it and store it with the same architecture, in folder "src/contrib/Archive//".

At the end of the day, the same source package will be duplicated in Nexus storage : both in "src/contrib/" (initial download) and in "src/contrib/Archive//" (second download, when package is moved to Archive in CRAN) :

___ src
    |___ contrib
         |___ Archive
              |___ rlang
                   |___ rlang_1.0.3.tar.gz          # redudant storage since already available
                   |___ rlang_1.0.4.tar.gz          # in Nexus outside of Archive folder
         |___ rlang_1.0.3.tar.gz
         |___ rlang_1.0.4.tar.gz
         |___ rlang_1.0.5.tar.gz
         |___ PACKAGES
         |___ PACKAGES.gz        
         |___ PACKAGES.rds

If we could avoid this, this would prevent from unnecessarily increase the storage volumetry. For Nexus’like configurations, I would suggest to try first in "src/contrib/", and then in "src/contrib/Archive//".

Conclusion

To put in a nutshell, my suggested sequence would be :

  1. renv_retrieve_repos_binary (when binary explicitly requested by user)
  2. renv_retrieve_repos_binary_older (when binary explicitly requested by user) – new step
  3. renv_retrieve_repos_mran (when binary explicitly requested by user and MRAN enabled)
  4. renv_retrieve_repos_source
  5. renv_retrieve_repos_source_older – new step
  6. renv_retrieve_repos_archive

I don't know if repository managers like Nexus are largely used by R users. If you don't want to systematically try oldest versions, a user-level configuration to trigger steps 2 and 5 could make sense: For instance, using a new option renv.config.retrieve.try.older (or RENV_CONFIG_RETRIEVE_TRY_OLDER as environment variable), with default to FALSE.

Would you have any question, please contact me. And if you prefer that I create an new issue on Github, please tell me. Kind regards Arnaud

Originally posted by @arnauddeblic in https://github.com/rstudio/renv/issues/595#issuecomment-1239255666

kevinushey commented 2 years ago

@arnauddeblic: do you know if there's a way for me to determine whether a repository URL is associated with a Nexus repository? E.g. is there some file or header I can query at the repository URL to determine that?

kevinushey commented 2 years ago

I've tried making some changes to support this in https://github.com/rstudio/renv/commit/a5822317dd0438c69752e24565bc0cdf7d88aa19; if you want to test you can try something like:

options(renv.nexus.enabled = TRUE)
renv::install(<package>)

and see if renv is able to find a binary package at the Nexus "fallback" location.

If there's a way for me to query whether a repository is a Nexus repository, then I could eliminate the need to set an R option to opt-in to this behavior.

arnauddeblic commented 2 years ago

Dear @kevinushey,

Many thanks for addressing this issue so quickly!

To answer both your questions: 1) Concerning your implementation:

The fallback function is called and an URL is requested - this is a good start. However, the requested URL is not correct. Based on the original example, with rlang@1.0.4, instead of requesting <repo>/bin/windows/contrib/4.1/rlang_1.0.4.zip, the code requests <repo>/rlang_1.0.4.zip : image

Same behavior for source packages: Instead of requesting <repo>/src/contrib/rlang_1.0.4.tar.gz, the code requests <repo>/rlang_1.0.4.tar.gz : image

2) Concerning the way of querying whether a repository is a Nexus repository:

Since Nexus is a proxy (and cache) system, I'm afraid there is no special file that could help. The HTTP header could be a solution (at least from what I can observe using my company's installation of Nexus). When I request the repo using Postman, I get in the response a Server header with value Nexus/3.25.1-04 (OSS): image In this configuration, looking whether the Server header contains nexus (with no case sensitivity) or not would provide you with the information.

Please note:

Since we are not sure Nexus will always send this header (companies sometimes change their name or their products name, you know it better than me ;) ), maybe you could secure this header request with your renv.nexus.enabled option): if header Server contains nexus or if option renv.nexus.enabled is TRUE, then...

Other question I have a last remark / question concerning some part of the code I have just seen in this last version of your retrieve.R file. In the CRAN version renv@0.15.5, retrieving from source was always added to the methods list - unless pkgType option was not source. With such an algorithm, when pkgType option was set to binary, retrieving from source was tried, if no binary was found. I was quite confortable with such implementation. In this new version, I understand that retrieving from source is no more added to the methods list if pkgType option is set to binary: srcok <- pkgtype %in% c("both", "source"). I'm not sure I understand the reason of this modification. From what I understand, pkgType option is supposed to set the preferred installation method. Have you considered using option install.packages.check.source ? Maybe renv should add "retrieve from source" to methods list, unless install.packages.check.source is explicitely set to no.

Would you have any question, please contact me. Kind regards Arnaud

kevinushey commented 2 years ago

Thanks! I've made the changes required (I think) to support the Nexus URLs properly. It might take a bit more iteration to refine but I think we're getting there.

Re: your question on srcok <- pkgtype %in% c("both", "source"); in R, the pkgType option defaults to "both":

> getOption("pkgType")
[1] "both"

and renv tries to respect that choice. In this situation, R (and renv) prefer installing binaries if available, but will fall back to source packages if not.

From what I can see in the R sources:

https://github.com/wch/r-source/blob/18d16095f36e28862d125d88659bda28d93d0269/src/library/utils/R/packages2.R#L547-L548

R uses the install.packages.check.source option to allow a fallback to the source repository even if a binary repository was explicitly requested.

arnauddeblic commented 2 years ago

Many thanks @kevinushey for your support.

I tried your new implementation:

Diagnostics:

I spent some time debugging and found the problem: Nexus serveur throws a 404 status code when you curl with HEAD parameter, see renv-headers temp file:

HTTP/1.1 404 Not Found
Date: Fri, 16 Sep 2022 19:54:00 GMT
Server: Nexus/3.25.1-04 (OSS)
X-Content-Type-Options: nosniff
Content-Security-Policy: sandbox allow-forms allow-modals allow-popups allow-presentation allow-scripts allow-top-navigation
X-XSS-Protection: 1; mode=block
Pragma: no-cache
Cache-Control: no-cache, no-store, max-age=0, must-revalidate, post-check=0, pre-check=0
Expires: 0
X-Frame-Options: DENY
Content-Type: text/html
Content-Length: 2071
Set-Cookie: e1c2a849e31cf572844da4b9bd2d0f31=bd4f901c2a2c2c04b50478be164e36fa; path=/; HttpOnly

When removing HEAD parameter from curl configuration file, Nexus server throws a 200 status code. The full page is served; this is very small data when repo is Nexus, since basically the page tells you:

This r group repository is not directly browseable at this URL.

Please use the [browse] or [HTML index] views to inspect the contents of this repository.
HTTP/1.1 200 OK
Date: Fri, 16 Sep 2022 20:10:58 GMT
Server: Nexus/3.25.1-04 (OSS)
X-Content-Type-Options: nosniff
Content-Security-Policy: sandbox allow-forms allow-modals allow-popups allow-presentation allow-scripts allow-top-navigation
X-XSS-Protection: 1; mode=block
Content-Type: text/html
Content-Length: 2403
Set-Cookie: e1c2a849e31cf572844da4b9bd2d0f31=bd4f901c2a2c2c04b50478be164e36fa; path=/; HttpOnly
Cache-control: private

<!DOCTYPE html>
<html lang="en">
<head>
  <title>Repository - Nexus Repository Manager</title>
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>

  <!--[if lt IE 9]>
  <script>(new Image).src="https://**********************.fr/favicon.ico?3.25.1-04"</script>
  <![endif]-->
  <link rel="icon" type="image/png" href="https://**********************.fr/favicon-32x32.png?3.25.1-04" sizes="32x32">
  <link rel="mask-icon" href="https://**********************.fr/safari-pinned-tab.svg?3.25.1-04" color="#5bbad5">
  <link rel="icon" type="image/png" href="https://**********************.fr/favicon-16x16.png?3.25.1-04" sizes="16x16">
  <link rel="shortcut icon" href="https://**********************.fr/favicon.ico?3.25.1-04">
  <meta name="msapplication-TileImage" content="https://**********************.fr/mstile-144x144.png?3.25.1-04">
  <meta name="msapplication-TileColor" content="#00a300">

  <link rel="stylesheet" type="text/css" href="https://**********************.fr/static/css/nexus-content.css?3.25.1-04"/>
</head>
<body>
<div class="nexus-header">
  <a href="https://**********************.fr">
    <div class="product-logo">
      <img src="https://**********************.fr/static/images/nexus.png?3.25.1-04" alt="Product logo"/>
    </div>
    <div class="product-id">
      <div class="product-id__line-1">
        <span class="product-name">Nexus Repository Manager</span>
      </div>
      <div class="product-id__line-2">
        <span class="product-spec">OSS 3.25.1-04</span>
      </div>
    </div>
  </a>
</div>

<div class="nexus-body">
  <div class="content-header">
    <img src="https://**********************.fr/static/rapture/resources/icons/x32/database.png?3.25.1-04" alt="Repository image"/>
    <span class="title">Repository</span>
    <span class="description">r</span>
  </div>

  <div class="content-body">
    <div class="content-section">
      <p>
        This r group repository is not directly browseable at this URL.
      </p>

      <p>
        Please use the <a href="https://**********************.fr/#browse/browse:r">browse</a>
        or <a href="https://**********************.fr/service/rest/repository/browse/r/">HTML index</a>
        views to inspect the contents of this repository.
      </p>
    </div>
  </div>
</body>
</html>

I don't know if there is another option than HEAD to limit the amount of data to be served when requesting an URL. If there is no option, maybe we could store the result of the renv_nexus_enabled() function for a given repo, so that the function is not triggered for all packages to be restored.

Additional question concerning download method:

In your implemention, you seem to prefer curl method for downloads (see function renv_repos_info_impl()). On the Windows servers of my intensive computing grid, I had to set RENV_DOWNLOAD_FILE_METHOD = wininet in the Renviron.site, since i did not manage to renv::install packages with default settings. Do you think this could be a problem, for future renv users using wininet like me ? As far as i am concerned, I will set the renv.nexus.enabled option to TRUE on theses machines, so it will be OK ;)

Kind regards

Arnaud

kevinushey commented 2 years ago

Thanks! I think it should be okay to just perform a regular web request at that endpoint; it's unlikely the returned data would be large from any typical CRAN mirror. It would also allow us to use arbitrary downloaders as well (so no need to force the use of curl).

Loose ends should be tied up on the main branch now. Thanks for the feedback; fingers crossed that this gets us over the finish line!

arnauddeblic commented 2 years ago

I will give you feedback as soon as I test. Kind regards

arnauddeblic commented 2 years ago

Dear @kevinushey,

Thanks for your reply and for new improved implementation.

I tested renv@0.15.5-58 with 4 configurations:

on several Windows environments:

using 2 different methods:

Diagnostics

Results are OK everywhere, with all 4 configurations, apart from a strange behavior, see below: Environment Method Status
Labtop RStudio, within a Rstudio project OK
Labtop R.exe run in command line, within a directory project OK
VDI RStudio, within a Rstudio project OK
VDI R.exe run in command line, within a directory project OK
Server R.exe run in command line, within a directory project OK - but strange behavior, see below

Strange behavior observed on Windows Server

When restoring a renv project on Windows Server (using R.exe run from command line):

To make sure, I tested with renv@0.15.5 CRAN version, and I confirm this stange behavior does not occur with released version: empty NULL directory is not created when restoring a renv project using renv@0.15.5. I don't think it is related to recent developpements dealing with Nexus issue. Probably another developpment made between renv@0.15.5 and renv@0.15.5-58 ?

Would you have any question, please contact me. Kind regards Arnaud

P.S. :

kevinushey commented 2 years ago

Great news -- thanks for taking the time to test.

Do you already have a rough idea concerning next CRAN release date, including this new Nexus feature ?

I'm hoping to prepare a new release in the coming weeks.

In the meantime, can I consider that "renv.nexus.enabled" is the definitive option name ? (I'm currently preparing Renviron.site Rprofile.site configuration files for deployment in production in my company)

Yes, we can consider the option here stable.

I don't think it is related to recent developpements dealing with Nexus issue. Probably another developpment made between renv@0.15.5 and renv@0.15.5-58 ?

Thanks for the heads up here -- I'll see if I can figure out where this is coming from.

kevinushey commented 2 years ago

Regarding the NULL directory, it might be helpful if you could also test with code of the following form:

trace(dir.create, quote({
  if (grepl("NULL", path)) { print(rlang::trace_back()) }
}))

(please also make sure rlang is also installed)

That might give a hint as to where that directory is coming from.

kevinushey commented 2 years ago

My only other guess is that this could be related to us setting R_LIBS_USER and R_LIBS_SITE here:

https://github.com/rstudio/renv/blob/630d5effa65f4dc9ce8b523f365f115a05886e87/R/r.R#L10-L13

Maybe something is auto-creating those directories?

arnauddeblic commented 2 years ago

Dear @kevinushey, Your last guess is the good one: I indeed deployed such code on my server:

To be sure this auto-creation is responsible for the NULL directory, I've just added the same kind of parameter in my VDI environment:

This configuration was not already set on VDI when I performed the tests this morning. I planned to do it, since otherwise, from what I understand, R does not take into account R_LIBS_USER when the corresponding directory does not exists. And I really need to specify a user library. This is even more necessary on my VDI configuration, since I do deploy hundreads of VDIs, and I need to store user data in a shared network dedicated to every user (mapped on the U: drive), rather than in C:/Users/... of the VDI).

Do you know if there is another way to auto-create those directories ? Otherwise, to you think you can adjust renv behavior, so that it does not create the NULL directory ?

Kind regards Arnaud

kevinushey commented 2 years ago

The R documentation suggests that R_LIBS_USER and R_LIBS_SITE can be set to NULL if you'd like them to be ignored or set as empty; e.g.

https://github.com/wch/r-source/blob/18d16095f36e28862d125d88659bda28d93d0269/src/library/base/man/libPaths.Rd#L58-L66

And those NULL values get handled by R's built-in base Rprofile, e.g. for Unix:

https://github.com/wch/r-source/blob/18d16095f36e28862d125d88659bda28d93d0269/src/library/profile/Rprofile.unix#L5-L15

In this case, I believe renv is doing the right thing; I think you need to validate that R_LIBS_USER and R_LIBS_SITE are not equivalent to NULL before choosing to create them.

arnauddeblic commented 2 years ago

Dear @kevinushey, Thanks to your advice, I adjusted my Rprofile.site files as below. It's now OK : the undue directory creation no longer occurs. I think we can consider this issue #1074 as ready to be closed. Many thanks for your help - I really appreciated our collaboration ! Kind regards Arnaud


On server:

local({
  R_LIBS_SITE <- Sys.getenv("R_LIBS_SITE", unset = "NULL")
  if (R_LIBS_SITE != "NULL" & !dir.exists(R_LIBS_SITE)) {
    dir.create(R_LIBS_SITE, recursive = TRUE)
  }
})

On VDI:

local({
  R_LIBS_USER <- Sys.getenv("R_LIBS_USER", unset = "NULL")
  if (R_LIBS_USER != "NULL" & !dir.exists(R_LIBS_USER)) {
    dir.create(R_LIBS_USER, recursive = TRUE)
  }
})
kevinushey commented 2 years ago

Great, I'm glad to hear it! Thanks for taking the time to report back.