:pie: :cloud: Repo Filters

Discussed in https://github.com/pharmaR/regulatory-r-repo-wg/discussions/3

Migrated following decision in #20

^{Originally posted by **dgkf** October 28, 2022} This is an idea that I pitched on behalf of Roche to the [R Consortium's Repositories Working Group](https://github.com/RConsortium/r-repositories-wg/blob/main/Documents/RValHub-Wishlist.md#repo-endpoint-behaviors) ### Overview This idea stems from the realization that any repository for regulatory purposes is going to need some gating criteria that restricts which packages can be used. After seeing how difficult it is to settle on a single algorithm for [`riskmetric`](https://github.com/pharmaR/riskmetric), I'm under the impression that any gating factor will inevitably need to be customized for use cases. ### Details Let's assume that whatever solution we settle on has some sort of metadata attributed to each package. | package | version | has_tests | perc_export_examples | |---|---|---|---| | `pkgA` | `0.1.2` | `TRUE` | 76% | | `pkgB` | `1.2.3` | `FALSE` | 83% | | `pkgC` | `2.3.4` | `TRUE` | 98% | | `pkgD` | `3.4.5` | `FALSE` | 23% | Then it would help to make this repository more flexible if packages weren't outright excluded from submission if they failed some gating criteria (let say, `per_export_examples >= 80%`), but rather this gating criteria was only applied when R requests an index of available packages. That is to say, all the packages are available and a user might select a repository such as: ```r options("repos" = "repo.org/latest?filters=has_tests,perc_exported_examples>=80") ``` Provided this url, `available.packages()` would return just the subset of packages that satisfy this criteria. ### Challenges As I understand them, all repos are assumed to be static and a `src/contrib` url suffix is used. In this case, I don't think you can tack a url suffix on the end of a url with query parameters, so this might cause issues. If this is the case these filters might need to be communicated through non-query-paramter parts of the url, or a short url might need to be generated given the filter preferences.

^{Originally posted by @kkmann in https://github.com/pharmaR/regulatory-r-repo-wg/discussions/3#discussioncomment-4275812} Agreed that it is probably not realistic to have "the" regulatory R repo ;) However, the challenges run deeper. To me a repository only makes sense if there is some form of integration testing. This means that package interdependencies need to be respected. The expectation would be that I can co install packages from the repository and all dependencies would be available as well. This might require a slower speed of change since breaking the "valdated" state of a downstream dependency might block updating the packages that depend on it. Did you look into how posit package manager handles updates of packages to snapshotted repos? Maybe a system like that with biannual releases could work?

> > ^{Originally posted by @dgkf in https://github.com/pharmaR/regulatory-r-repo-wg/discussions/3#discussioncomment-4276993} > > This means that package interdependencies need to be respected. > > Totally agree - that's a detail I didn't hit on, but could totally be accommodated within this concept I think. For example, if `available.packages()` were filtered based on some quality condition, the server could return a set of available packages where the entire dependency stack for all provided packages meet that condition. If a top-level package were to meet the criteria, but rely on a hard dependency that fails, that could cause it to be struck from the listing. > > > This might require a slower speed of change since breaking the "valdated" state of a downstream dependency might block updating the packages that depend on it. > > There's already a process around this, even on CRAN, which encourages package authors to run reverse dependency checks before publishing package updates and asks that reverse dependency maintainers are contacted about any breaking changes before they are added. This effectively checks any affected packages with each package update. > > From [CRAN submission guidance](https://cran.r-project.org/web/packages/policies.html): > > _If an update will change the package’s API and hence affect packages depending on it, it is expected that you will contact the maintainers of affected packages and suggest changes, and give them time (at least 2 weeks, ideally more) to prepare updates before submitting your updated package. Do mention in the submission email which packages are affected and that their maintainers have been informed. In order to derive the reverse dependencies of a package including the addresses of maintainers who have to be notified upon changes, the function [reverse_dependencies_with_maintainers](https://developer.r-project.org/CRAN/Scripts/depends.R) is available from the developer website._ > > > Did you look into how posit package manager handles updates of packages to snapshotted repos? > > Totally agree. snapshotting is a really nice feature for reproducibility.

^{Originally posted by @pedrobtz in https://github.com/pharmaR/regulatory-r-repo-wg/discussions/3#discussioncomment-4548079} Why would one need a new CRAN-like package repository? What is the motivation to have a new repository? It should be possible to do ```r #' retrieve pre-computed package metrics/stats #' for CRAN, BIOC packages packages_metrics <- function(){ # TO BE implemented } ``` ```r pkgs <- available.packages() metrics <- packages_metrics() pkgs <- merge(pkgs, metrics, by = "Package") # filter pkgs[ has_tests & perc_exported_examples>=80 ] ```

^{Originally posted by @dgkf in https://github.com/pharmaR/regulatory-r-repo-wg/discussions/3#discussioncomment-4584384} Cool idea, especially as a proof-of-concept. I think that having them computed server-side is a nice feature for transparency and reproducibility, but comes at an enormous technical overhead. I definitely agree that we should only resort to that if we feel the benefits outweigh the technical needs. Building on this idea, you can also use `options(available_package_filters)` (see `?available.packages`) to embed filters into the `available.packages` query. Generally this is used for filtering on fields already in the packages matrix, but does provide a mechanism of doing arbitrary filtering - even using additional metrics computed outside the provided fields.

^{Originally posted by @pedrobtz in https://github.com/pharmaR/regulatory-r-repo-wg/discussions/3#discussioncomment-4599153} A User Case, from my perspective would be, given a R project to be able to check/assess the overall "quality" of the packages used in that project (dependencies). And then, decide to remove/replace some of those packages to increase project quality. ```r # these numbers are just for illustration > metric_deps_tree(path = ".", metric = "rank") my_project ✨ ⬇ (rank = 0.85) # overall aggregated rank ├─R6 2.5.1 ✨ ⬇ (rank = 0.83) ├─Rcpp 1.0.9 ✨ ⬇ (rank = 0.95) ├─later 1.3.0 ✨ ⬇ (rank = 0.60) │ ├─Rcpp │ └─rlang 1.0.6 ✨ ├─rlang └─magrittr 2.0.3 ✨ ⬇ (rank = 0.43) ``` Perhaps, enable investigation on individual package metrics: ```r # show details about rank = 0.43 metric_pkg_explain(package = "magtrittr") > details ... > ... > ... ``` Developer can then take a risk-based decision to keep or remove this dependency.

^{Originally posted by @kkmann in https://github.com/pharmaR/regulatory-r-repo-wg/discussions/3#discussioncomment-4601446} I think this would be a cool feature to add to the riskmetric package (i.e. project risk and dependency pruning) However, in the reg repo context as I understand it, we would want to move away from the need to assess on a project basis in the first place and provide a means of installing or at least a collection of packages that underwent some sort of established QC process aligned with relevant regulators in the field (to the extent that that is even possible) to be used directly after minimal inhouse QC.

^{Originally posted by @pedrobtz in https://github.com/pharmaR/regulatory-r-repo-wg/discussions/3#discussioncomment-4609771} How is package instalation handled in your companies? From what I know, many (sensitive) companies use certain type of Package repository (e.g. [NEXUS](https://www.sonatype.com/products/nexus-repository)) which creates internal replicas or mirrors to external package repositories, such as MAVEN, NPM, PyPI, CRAN, ... In this case, developers can install any available package in those external repositories. In some instance, packages can be blocked/quarantine if flagged with CVE vulnerabilities, for example. Is it the case that, the regulatory-r-repo would be a solution for companies that want to put some QC (restriction) in the universe of packages available for installation?

^{Originally posted by @kkmann in https://github.com/pharmaR/regulatory-r-repo-wg/discussions/3#discussioncomment-4610334} Nexus is certainly a possibility. What I feel is still unclear in the scoping is whether this WG wants to build a prototype for an actual repository (should probably integrate with NEXUS then) or open-source QC metadata (like peer reviewed QC reports and a transparent catalogue of criteria for being included in the collectyion) on a subset of packages or both. I still doubt that a fully automated way of determining quality is sufficient, hence getting the transparent QC process right seems the natural first step for me. The resulting collection of packages can then either be installed from other sources or be hosted in a separate repo for sake of convenience.

^{Originally posted by @pedrobtz in https://github.com/pharmaR/regulatory-r-repo-wg/discussions/3#discussioncomment-4612813} It seems one has (at least) two problems to solve: - First, non-technical, how to define a transparent QC process which can be implementable `is_certified()` and that it is consensual across the industry (Pharma, ...). And, what is the governance around the QC process, i.e. who owns this process? ```r is_certified <- function(pkg : Package) -> Boolean score <- is_cran(pkg) || is_bioc(pkg) score <- score & test_coverage(pkg) > threshold score <- score & has_vignettes(pkg) ... if (score) return(TRUE) else return(FALSE) ``` - Second, more technical, [Hard approach] do you want to a-prior (up front) restrict developers to use/install only packages that are certificated for any project? ```r options(repos = c(EXTERNAL_CERF = "http://reg-r-repo.com", INTERNAL_REPO = "http://internal-cran.com") ```` [soft approach] Or do you allow developers to use/install all packages (CRAN + BIOC), but for certain projects, one (Company policy) requires that all packages are Certificated? ``` r library(riskmetrics) # checks if all project dependencies are certified check_project_deps(path = ".") ```

pharmaR / regulatory-r-repo-wg

:pie: :cloud: Repo Filters #34

Discussed in https://github.com/pharmaR/regulatory-r-repo-wg/discussions/3