sonatype-nexus-community / nexus-repository-r

R, v data science, much functional programming, doge
Eclipse Public License 1.0
31 stars 17 forks source link

Multiple versions of package - archive #21

Open kenahoo opened 6 years ago

kenahoo commented 6 years ago

When I upload a new version of an R package to Nexus, it properly puts it in the (dynamically generated) PACKAGES.gz index. However, it's not clear what happens to the previously uploaded versions. They still show as assets, of course, and one can download them directly from their URL. But I'm not sure how an R client can "discover" those old versions.

In a "regular" CRAN repository, there's an archive.rds file in the src/contrib/Meta/ folder that clients (e.g. the install_version function in the remotes package from the RStudio folks) can use to discover what previous versions are available. It looks like Nexus doesn't supply a route for that, though.

Has this been considered before? Is there a different mechanism a client can use to discover all versions of packages? Either way, is the archive.rds file standard enough that Nexus should generate it? I might be able to help make that happen if development tuits are short.

Thanks.

kenahoo commented 6 years ago

I also opened this topic at https://community.rstudio.com/t/discovering-archived-packages/3449 , and got a helpful reply that a Path entry in the PACKAGES.gz index might be sufficient to get multiple version discovery working correctly.

DarthHater commented 6 years ago

Thanks for filing this! Been a bit busy with the holidays, etc... I'll pop back in after the NY and give it a better look.

You are correct there is no route for archive.rds as of yet. When @fjmilens3 and I initially wrote this, our understanding of R was fairly basic.

On that note, send us a PR :)

DarthHater commented 6 years ago

Also on the RDS note (saw in your post), yeah, I had thought that would be a PAIN in Java too. There are a few options though:

You might be able to use one of the three of those to either call R at some level, or to embed it and use it in Nexus. This has been kinda problematic in the past (we tend to write just plain ole Java code that emulates things as often as we can), but I figured I'd put these out here for you to gander at, and for me to look at later as well.

cderv commented 6 years ago

Looking deeply on how things work to install package with R, it seems there is no there is no need of archive.rds.

I wanted to share my findings about this topic:

How R works ?

Base R assumes you want to install a package from CRAN. Thus, it implements all the rule for this specific repo, but leave some customization possible for other repo.

To install a package in R with install.packages, everything relies on available.packages that creates a db for install.packages to look for. The db is used to build the download url base on package name, package version (the last one), and type of package. (source or binary). In fact, some filters are applied to get only those packages (see ?available.packages)

available.packages creates the db by parsing the PACKAGES files, generated by write_PACKAGES. write_PACKAGES parses DESCRIPTION of each packages and generates the three files PACKAGES.rds, PACKAGES.gz, PACKAGES. Only one of them is needed for available.packages to work. There are two fields that could impact the behavior of install.packages:

About old version support, install.packages does not provide support for old package version. You need to download the tar.gz file of the old version manually and install with this local file using install.packages("pkg_file.tar.gz", repos = NULL). It means you don't need to provide a archive.rds for installing old package. You need nothing really, but it helps to have a database to look for the url.

Simply, you can provide package name and version directly, build the url and try to download it. In fact, devtools::install_version and remotes::install_version just parse the archive.rds to check before downloading that the package exists, based on a build url by default as <repo>/src/contrib/Archive/<package.path>. On the other hand Packrat just build the url, and try to download and through an error if not successful.

If you know the organization of the package in the repo, and the filename convention, it is easy to provide a wrapper. (see below)

In every case, the challenge is the dependency chain. Basically when installing from specific version, it is better to install manually all the dependencies because I think they are not resolved correctly otherwise. It is what packrat do using a packrat.lock file. install_version gets the last version of dependencies. This is not always wanted.

How nexus currently works and what are the impact ?

Currently, NEXUS advices to store each version in the same repository, at the root of src/contrib. It is fine to do that. Let's note that one can publish a package in a subdir of /src/contrib. There is no error message. However, when it's done, the package seems not be listed in the PACKAGES.gz file, so can't be installed. Also, I am not sure how it is handle when trying to push the same file but in another path. Thins are not going so well. (Be the is another issue). Let's say everything is on the root of /src/contrib

With this organization, you can install an old package using

install_packages_old <- function(pkgs, version = NULL, repos, ...) {
  # Build the package name
  pkg_name <- paste0(pkgs, "_", version, ".tar.gz")
  # build the url knowing it should be in root /src/contrib
  url <- paste(repos, "src/contrib", pkg_name, sep = "/")
  # try to download
  try <- tryCatch({
    path <- file.path(tempdir(), pkg_name)
    suppressWarnings(download.file(url, path, mode = "wb"))},
    # catch the error
    error = function(e) 1L
  )
  # if error, it means specific version is not available
  if (try == 1L) stop("\nError: ", pkgs, " not available in version ", version, call. = FALSE)
  on.exit(unlink(path))
  # if no error, install the package using tar.gz so repos = NULL. (no dependency resolution)
  install.packages(path, repos = NULL, ...)
}

If you try this function, it will work as expected for installing an old package without any need of PACKAGE files or archive.rds. (this function is inspired by packrat behavior)

If we don't want to tryCatch error, we need to create a way for R to know if a package is in the repo or not. So, what is currently missing is the listing of all packages version in the PACKAGES.gz file. That way, install.packages will have all the information and will still get the last one available, because "duplicates" filters is set by default. With all the info in Packages.gz, it is then easy to create a custom function to get a specific version, just by filtering correctly from the info of PACKAGES.gz.

As complement, for hosted repository, the File field could also be added to take into account someone who does not publish a file of the form <pkg_names>_<pkg_vers>.<ext>. It would work no matter the name then. Without the field, not working. The Path field would be required if it is ok for NEXUS r plugin to deal with subdirectory in /src/contrib.

About devtools or remotes

This two 📦 are often use to install a specific version with install_version. Currently, this function uses Meta/archive.rds file but it is pretty easy to add support for Packages.gz.

Also, a nexus 📦 could be worth developing for use with this plugin. It could offer install.packages version that works correctly. I am willing to do that if needed. in fact, with this kind of solution, we could leverage NEXUS API to get the database of what is available and deal with this information to get the url of what to install.

In any case, dependency resolution is not done automatically. But this is another issue has one need to know which package was available when another was published.

What can be done ?

unfortunately, I do not know Java, so hard for me to make a PR. However, I will try to see where things needs to be change for any of these scenario.

Basically, the plugin could reproduce the write_PACKAGES(".", lastestOnly = FALSE, addFiles = TRUE, subdirs = TRUE). It parses the DESCRIPTION file to get all the information and write them in the dcf format. I think this could be done without needing R, and it could stay Java only. it could also stay as it is, and deal with specificity on the R side by custom function.

I hope this investigation could help adding features and improve the plugin.

kenahoo commented 5 years ago

I revisited this thread the other day, and decided I'd try making an RDS serializer capable of writing the data formats present in the archive.rds file. I got a prototype working in Perl. I'll try to clean it up & make it available soon, shouldn't be too crazy to convert to Java.

kenahoo commented 5 years ago

I pushed my code to a new project on GitHub: https://github.com/focusenergy/JavaRDS/tree/master .

I don't know what the best way would be to package it for inclusion into this repository plugin - to me it would make sense to bundle it up into a Maven artifact and add that as a dependency, but I don't have much experience doing that.

kenahoo commented 5 years ago

Hi @DarthHater & @fjmilens3 , any thoughts on incorporating this?

DarthHater commented 5 years ago

@kenahoo ideally you'd put the library up on Maven Central so we can import it that way. Looks cool too! I'll have some time to look at this next week, or @bhamail can too!

kenahoo commented 5 years ago

Hi @DarthHater , I'm following up on this - after publishing to Maven (see focusenergy/JavaRDS#1), has anyone had a chance to see how incorporating it might work?

DarthHater commented 5 years ago

I have not yet, but I'll set some time aside to do that next week!

kenahoo commented 5 years ago

Hi @DarthHater , did you find time for this? I don't think I mentioned, there are some unit tests that correspond closely with the data structures that would need to be created for an archive.rds file.

aornatovskyy commented 4 years ago

Hi, @kenahoo would you please clarify. Are you talking about R hosted or proxy repository?

kenahoo commented 4 years ago

@aornatovskyy This is for R hosted.

aornatovskyy commented 4 years ago

I'm following up on this - after publishing to Maven (see focusenergy/JavaRDS#1), has anyone had a chance to see how incorporating it might work?

@kenahoo I will try to check your RDS implementation this year. Thanks for your implementation BTW! Please find me if I will not answer in 2019. =)

mlukaretkyi commented 4 years ago

Hi, we are moving R source code to nexus public. This github page will be archived. Your issue https://issues.sonatype.org/browse/NEXUS-25130