pharmaR / pharmapkgs

Work in progress - demo of integrations with a pharma repository
Other
5 stars 0 forks source link

Update generated `repo_filter` function to avoid requiring special upstream `fields` #4

Closed dgkf closed 4 months ago

dgkf commented 5 months ago

We use options("available_packages_filters" = repo_filter(...)) to set a special function to filter the repository package down to only a subset with acceptable risk.

This function operates on the matrix returned by available.packages to filter packages. By default, available.packages will only return a set of canonical fields from the PACKAGES file. However, as we've injected our own risk fields, we want to include those as well so that we can use them for filtering.

Currently, we use a small hack to change the default value of the available.packages() fields parameter. This means that any time available.packages() is used (for example, internally within pak), these fields are included and our risk filters can be applied.

However, it is not good R etiquette to modify base functions like this. Instead, risk_filter may generate a function like this (in pseudocode):

function(avail_pkgs_mat) {
  # 1) when passed to this function `avail_pkgs_mat` is already filtered by upstream filters

  # 2) get a set of available packages
  #      - Q: Should this just be the default filters?
  #      - Q: Is it worth trying to derive and provide filters up to this point?
  risk_ap <- available_packages(fields = risk_fields(), filters = <filters applied up to this point>)

  # 3) apply risk filter

  # 4) Maybe merge back in filtered set of packages, if step 2 only filtered the default set
}

It's worth doing some exploration, because I'm not totally sure which path would be easier here. We might need to do a little re-creating of the available.packages filter scheme, which is found in these lines.

llrs commented 5 months ago

I don't understand your proposal, why it is better to provide a function that generates the filter and the call to available.packages vs just wrapping available.packages and do it once?

Wouldn't just be better wrapping available.packages to provide the fields of interest for this repo?

Note that there is the special filter add = TRUE to add the new filters to existing ones (the default or others).

Simplifying the code from https://github.com/pharmaR/regulatory-r-repo-wg/issues/88:

available_packages <- function(pharmapkgs = <folder/URL>, custom_filters = <custom_filters>, ...) {
    available.packages(
        repos = c(getOption("repo"), pharmapkgs),
        fields = risk_fields(), 
        filters = list(add = TRUE, custom_filters),
        ...)

}
dgkf commented 5 months ago

The benefit of an available.packages solution isn't strictly to produce the listing of available packages. The primary benefit is the integration with other tools (install.packages, remotes::install_version, pak::pkg_install, renv::install). For all of these, we need to hook into how available.packages works to limit availability during install.

I don't understand your proposal, why it is better to provide a function that generates the filter and the call to available.packages vs just wrapping available.packages and do it once?

Yeah - admittedly it's a weird design. In order to both hook into installation tools and also provide a filter that makes use of risk scores we somehow would like to both use the available_packages_filters option, but also have access to our custom risk fields, even if they're not provided to the call to available.packages. This is where all the complexity of this proposal stems from.

I was hoping to do a little proof-of-concept today. Would you mind if I tagged you for review? I think seeing it in code might communicate the goal better than I can in an issue.

Note that there is the special filter add = TRUE to add the new filters to existing ones (the default or others).

Thanks, yes! I just saw this yesterday when reading through these docs. We should definitely make use of this.

llrs commented 5 months ago

The benefit of an available.packages solution isn't strictly to produce the listing of available packages. The primary benefit is the integration with other tools (install.packages, remotes::install_version, pak::pkg_install, renv::install). For all of these, we need to hook into how available.packages works to limit availability during install.

I thought the whole point of this package was to interact with the pharma repository, I wouldn't attempt to keep up with a third party tool (at least not initially for a proof of concept). A similar approach was taken by Bioconductor with BiocManager::available. It has its own package and function to check which packages are available and to install them. Even if it uses other packages, it is all wrapped by their repository package manager BiocManager (and it doesn't work well with renv, and renv doesn't work well with Bioconductor).

I don't understand your proposal, why it is better to provide a function that generates the filter and the call to available.packages vs just wrapping available.packages and do it once?

Yeah - admittedly it's a weird design. In order to both hook into installation tools and also provide a filter that makes use of risk scores we somehow would like to both use the available_packages_filters option, but also have access to our custom risk fields, even if they're not provided to the call to available.packages. This is where all the complexity of this proposal stems from.

I was hoping to do a little proof-of-concept today. Would you mind if I tagged you for review? I think seeing it in code might communicate the goal better than I can in an issue.

Yes, please tag me.

dgkf commented 5 months ago

I thought the whole point of this package was to interact with the pharma repository ... A similar approach was taken by Bioconductor with BiocManager::available. It has its own package and function to check which packages are available and to install them.

Ah, I see where you're coming from now. Yeah, I think I had something different in mind - which isn't to say I was right, just that I was thinking about the problem differently.

I hadn't considered a workflow where the intention is to work solely with this repository, as one might with BioConductor - and maybe this is a blindspot of mine. One requirement we definitely have is the ability to intermix a public regulatory package repository with privately hosted enterprise package repositories (proprietary in-house packages). With that in mind, having BioConductor-style tooling might pose a barrier.

That said, it would definitely give us more control over how package installations may happen.

I'll share a POC of what this might look like shortly.

llrs commented 5 months ago

I still haven't reviewed the PR, but I was reading renv documentation and I think I found something relevant for this issue and wanted to document that for renv compatibility there is the dependencies argument of renv::install:

A vector of DESCRIPTION field names that should be used for package dependency resolution. When NULL (the default), the value of renv::settings$package.dependency.fields is used. The aliases "strong", "most", and "all" are also supported