r-multiverse / help

Discussions, issues, and feedback for R-multiverse
https://r-multiverse.org
MIT License
2 stars 2 forks source link

An alternative to #6: Gabe Becker's proposed 2-repo solution #10

Closed wlandau closed 1 week ago

wlandau commented 4 months ago

Suppose r-releases.r-universe.dev is a repo with all the releases, and there is downstream universe with just the ones that pass R CMD check and revdep checks, just as @gmbecker originally proposed in https://github.com/r-universe-org/help/issues/363. It should be simple to scrape the check results from https://github.com/r-universe/r-releases/actions, select a subset of https://github.com/r-releases/r-releases.r-universe.dev/blob/main/packages.json with non-broken packages, and then create a different universe downstream.

As part of that selection process, maybe we could impose version number etiquette too. Suppose we get the version numbers and their commit hashes when we scrape https://github.com/r-universe/r-releases/actions. (@jeroen, this may rely on the nice titles you give the jobs, such as r.releases.utils 0.0.5 and sys 3.3.) If we detect that the commit hashes are different but the latest version is not strictly greater than the previous one, then we can omit the package from the production repo.

The advantages over #6 are:

  1. The ability to use install.pacakges() normally.
  2. Faster installation and no risk of hitting rate limits because there would be no GitHub API calls at installation time.

To me (2) is more important than (1).

The challenges relative to #6 are:

  1. Learning how to scrape https://github.com/r-universe/r-releases/actions.
  2. Figuring out where to put that downstream universe.

I was hoping to have all repos part of https://github.com/r-releases, but I think the creation of a new universe would mean the creation of a new special repo, e.g. https://github.com/r-prd/r-prd.r-universe.dev. I would be open to a better name than this.

wlandau commented 4 months ago

What would be a good GitHub owner name for this new downstream production-level R universe? r-prd? r-releases-prod? r-valid?

shikokuchuo commented 4 months ago

I might be missing something, but whether a package is 'broken' or not depends on the cohort of packages the user actually has installed, doesn't it? If only 2 repos, then a package can only be 'broken' or not. There may be many valid dependency chains, with only one broken.

A -> B -> C ....... B -> D

Where A is upstream. An update to A causes B's tests to fail. It is put in the broken repo along with C and D.

However, in actuality C's tests all pass, only D's fail. This is as C and D use different subsets of functions from B. That means that A -> B -> C is a valid dependency chain that would be broken by this 2 repo arrangement.

Then just using a 'normal' install.packages() won't find any of B, C or D any more.

wlandau commented 4 months ago

Yeah, the whole revdep chain would need to go down too. It’s a little extra work up front, but then we could skip scraping those revdeps altogether. Not impossible for this way of doing things.

wlandau commented 4 months ago

As you say, maybe that’s heavy-handed. However, I don’t see a generic way to find out which subset of a package is failing, just from information in logs.

shikokuchuo commented 4 months ago

If the test suite is adequate, then a package only needs to pass its own tests right. It doesn't need to know if an upstream dependency passes all of its tests, or even further removed whether that package's 100 revdeps pass theirs.

So I'm quite in favour of the checks dashboard type thing, or a function that returns this. You only need to know for the package you are installing. Then on an ongoing basis, the checker function can come in handy.

It's the power of decentralisation. Let each individual community decide what it wants to use.

shikokuchuo commented 4 months ago

sorry maybe this belongs in #6. Discussion continued at https://github.com/r-releases/help/issues/6#issuecomment-1974669760

wlandau commented 4 months ago

From https://github.com/r-releases/help/issues/6#issuecomment-1974901062

  1. After https://github.com/r-universe-org/help/issues/370, implementation can begin.
  2. After https://github.com/r-universe-org/help/issues/369, user-side package correctness/compatibility guarantees will exceed those of CRAN.

These points also support #10. With https://github.com/r-universe-org/help/issues/369, it will only be necessary to scrape the existing check results (no need for revdep checks).

wlandau commented 3 months ago

For a downstream production-level repo, it would be ideal to leverage R-universe as much as possible. My only concern is that we may get a duplicated (and possibly conflicting) set of health checks.

wlandau commented 3 months ago

Actually, it could be important to pass health checks in both production and QA. So we would want to pull from both https://r-releases.r-universe.dev and "https://r-production.r-universe.dev" to decide whether to keep a package on "https://r-production.r-universe.dev".

wlandau commented 3 months ago

On second thought: to have the right user-side guarantees, I think we would need to remove reverse dependencies from "https://r-production.r-universe.dev" if something goes wrong with a package. If that is the case, then https://r-releases.r-universe.dev/ and "https://r-production.r-universe.dev" will have the exact same dependency graphs for every hosted package. Which means that any test failure in "https://r-production.r-universe.dev" is random and probably a false positive.

So my current preference is to:

  1. If a package checks fail in https://r-releases.r-universe.dev, remove both the package and all its strong reverse dependencies from "https://r-production.r-universe.dev".
  2. Ignore checks from "https://r-production.r-universe.dev" when deciding (1).
  3. In fact, consider suppressing R CMD check in "https://r-production.r-universe.dev" to avoid confusion and duplication.
shikokuchuo commented 3 months ago

Yes, I think 3 is the logical conclusion, you'd be able to rely on the checks from R-releases.

wlandau commented 3 months ago

To recap recent discussions: we decided to put #6 on hold as we pursue #10. If the dual-repo option works well, then we will close #6 as "not planned".

gmbecker commented 3 months ago

I might be missing something, but whether a package is 'broken' or not depends on the cohort of packages the user actually has installed, doesn't it? If only 2 repos, then a package can only be 'broken' or not. There may be many valid dependency chains, with only one broken.

A -> B -> C ....... B -> D

Where A is upstream. An update to A causes B's tests to fail. It is put in the broken repo along with C and D.

However, in actuality C's tests all pass, only D's fail. This is as C and D use different subsets of functions from B. That means that A -> B -> C is a valid dependency chain that would be broken by this 2 repo arrangement.

Then just using a 'normal' install.packages() won't find any of B, C or D any more.

If B isnt passing its own tests, then B is broken, meaning it should only be offered in a "use at your own risk" capacity. That risk may sometimes be quite small, e.g., the notorious 1 test breaks on M1 macs case, but without an evolution of how tests are treated in R packages, similar to what @HenrikBengtsson brought up in the latest working group call, install.packages doesn't have the ability to differentiate quantify risk.

Given then that there is some risk, my argument is that that risk should be opt-in rather than opt-out. Users can opt into that risk by adding the unsafe repo (or whatever we end up calling it if that is too pejorative) to their repos, either via option or via the argument to install.packages. If they did that, they would be able to get all of {A, B, C, D}).

I think making risk like this opt-out would be detrimental to end users, particularly novice ones, since the tooling is insufficient to even tell them that the risk exists, much less to help them assess it. Furthermore it would be antithethical to the concept of production, as while you might need to do this but it would need to be a manual intervention by the admin in my experience, and may (reasonably) not be allowed at all in a validated context, regardless of how unbroken we might expect C's functionality to be.

The other thing to keep in mind is that just because someone does isntall.packages("C"), does not mean that they won't also sometimes directly use functionality from B in their scripts, including parts of B that aren't the bits that C use. B could still be broken for some of their intended purposes, even if C itself "works fine", which would mean that the repo is still serving a package broken to its intended purpose to the user.

shikokuchuo commented 3 months ago

Thank you @gmbecker, we are taking all of these considerations into account. For these and other reasons, i.e. prior expectations for novice users using install.packages(), we are actually looking at your 2-repo proposal as a priority. The 'production' repo could then be the default as you describe above, with the choice of opting out to the wider 'community' or 'QA' repo or whatever you want to call it.

wlandau commented 1 month ago

We now have space to host the two repos:

repo QA production
install.packages(repos = "...") https://multiverse.r-multiverse.org https://production.-multiverse.org
packages.json https://github.com/r-multiverse/multiverse https://github.com/r-multiverse/production
R-universe https://github.com/r-universe/r-multiverse https://github.com/r-universe/r-production

I am about to start working on:

  1. Migrating existing infrastructure to the new location for the QA universe.
  2. Building the production packages.json based on the results of automated checks.
wlandau commented 1 week ago

The two-repo strategy is well underway, and given https://github.com/r-multiverse/help/issues/57, I think we can close the thread above.