r-lib / pak

A fresh approach to package installation
https://pak.r-lib.org
639 stars 56 forks source link

`available_packages_filters` disregarded during install #645

Open dgkf-roche opened 2 weeks ago

dgkf-roche commented 2 weeks ago

Not sure if this is necessarily a bug - maybe it's an intentional omission.

We were hoping to leverage this as a universal mechanism for applying a selection criteria to a repository of packages based on quality measures over in pharmaR/pharmapkgs.

Using a simple example, I tried to make a function that ideally would only permit (at least without some intentional side-stepping of common install tools) installation of packages that start with "c".

options(
  available_packages_filters = list(
    add = TRUE, 
    starts_with_c = function(ap) ap[startsWith(ap[,"Package"], "c"),]
  )
)

head(available.packages(ignore_repo_cache = TRUE), 3)
#      Package Version ...
# c060 "c060"  "0.3-0" ...
# c212 "c212"  "0.98"  ...
# c2c  "c2c"   "0.1.0" ...

pak::cache_clean()
remove.packages("pkgconfig")
pak::pkg_install("pkgconfig")
# succeeds

Though I would expect this to fail, given that the filter should prevent these packages from being available.

Substituting with a function(ap) browser() function also never hits a debug session, so my impression is that available.packages is either used internally but with some default filters, or an alternative mechanism is used that doesn't implement this behavior.

I'm curious to hear your thoughts. It would be a tremendously valuable feature for us.

dgkf-roche commented 1 week ago

Hey @gaborcsardi, is this something that you'd be interested in supporting? On our end, the filtering feature of available.packages, and its ubiquity across most mechanisms of interfacing with repositories, is a core feature of our repo tools.

gaborcsardi commented 1 week ago

Yes, I would like to have a way to prioritize repositories, but it would be another way, as we don't use available.packages().

gaborcsardi commented 1 week ago

What kind of filters do you use in available_packages_filters?

dgkf-roche commented 6 days ago

This is related to our work in our (currently private) fork of r-lib/rhub re-purposed for regulated industries. As packages are updated, we calculate a number of quantifiable indicators of the package's quality. We embed these indicators inside the PACKAGES file with the hope of allowing the end-user to specify some quality selection criteria. We've piloted using the available_packages_filters option as a universal mechanism of applying a policy.

There's a brief demo in the README of this package

We use some helper functions in the demo to simplify the syntax, but it amounts to doing something like:

options(available_packages_filters = list(add = TRUE, function(ap) {
  dplyr::as_tibble(ap) |>
    dplyr::select(
      QualityLineCoverage >= 0.5,
      QualityExportCoverage >= 0.9,
      QualityExportDocumentationCoverage >= 0.9
    )
}))

Here the logic is just a series of conditions, but we'd like to keep it arbitrary - it could be a decision tree or some aggregation of different qualities.

The ability to provide a function that can arbitrarily filter the available packages pulled from repos in options(repos) is pretty core to our design and our hope is that this can be applied by an administrator, ensuring that all well-intentioned user-facing mechanisms of installing packages apply the filtering criteria.

Speaking only for my company, we also use this behavior to force R to prioritize repositories by their order in options(repos). I've informally chatted with folks from other companies that mentioned they had to enforce this policy as well, so I think it's a rather frequent pitfall that needs to be addressed when locking down systems.

gaborcsardi commented 6 days ago

So you basically want to be able to specify arbitrary conditions on arbitrary fields from your package metadata. This is certainly possible, but needs quite a lot of changes, as currently we don't even read in all metadata from PACKAGES* files.

dgkf-roche commented 5 days ago

So you basically want to be able to specify arbitrary conditions on arbitrary fields from your package metadata

Yes, exactly. Glad to hear you're open to supporting it - please let us know if there's anything we can take on to help support it.

From what I saw in the PACKAGES parsing, it looked like it supported up to ~1000 fields which should be plenty for our needs. Are there constraints on the field names? We haven't set any standard yet, so we can definitely consider a convention that makes your life easier.