r-universe-org / help

Support and bug tracker for R-universe
https://docs.r-universe.dev/
8 stars 2 forks source link

Include RemoteSha in the PACKAGES field of each universe? #377

Closed wlandau closed 3 months ago

wlandau commented 4 months ago

Motivation

@gmbecker mentioned how important it is for users to be able to trust the versions numbers of packages. For R-releases, we will not impose any pre-release gatekeeping, but @shikokuchuo and I are working on a service that checks all the versions and hashes and reports which packages are not in compliance. We are having trouble building this service given what we currently know about R-universe. C.f. https://github.com/r-releases/help/issues/21.

Implementation in R-releases

In https://github.com/r-releases/r.releases.utils/pull/9 and https://github.com/r-releases/r-releases.r-universe.dev/pull/6, I wrote a service that runs once a day and gets the version and hash of every package in the universe. Every time the service runs, it keeps track of the highest version number ever released, as well as the hash of that release. We want it to flag a package for non-compliance if:

  1. The current version number is less than the highest version ever released, or
  2. If the current and highest ever versions agree, but their hashes disagree. (I.e. if the latest release is highest, but it was deployed without changing the version number.)

These non-compliant packages are written to a small file version_issues.json, which either Gabe's "safe" repo or "install_safe()" could leverage for choosing which packages are safe to install.

Challenge

We are having trouble getting reliable hashes. utils::available.packages(repos = "https://r-releases.r-universe.dev", fields = "RemoteSha") is fast, but it returns NAs for RemoteSha. And as @shikokuchuo mentioned, MD5s are brittle because R-universe rebuilds the current version periodically with potentially different metadata.

The API for https://r-releases.r-universe.dev/api/packages/ returns information for multiple packages, but the payload is large, and not all packages may be returned. (https://cran.r-universe.dev/api/packages/ shows only a few hundred.) Hitting the API for each package individually is slow, and I am concerned it may overburden R-universe.

Proposal

Would it be possible to include the GitHub SHA in the RemoteSha field of ~the DESCRIPTION file for packages built on R-universe~ the PACKAGES file of each universe, such as https://r-releases.r-universe.dev/src/contrib/PACKAGES? That way, unless I am missing something, available.packages() should work with https://github.com/r-releases/help/issues/21, and it may even make the end product of #149 more trustworthy.

(I'm not sure whether https://r-releases.r-universe.dev/src/contrib would have include that field too.)

jeroen commented 4 months ago

Maybe it is in the description but not included in the PACKAGES.tar.gz index of the repository.

Ill have a look tonight.

Op di 5 mrt. 2024 14:10 schreef Will Landau @.***>:

Actually, I do see the hashes are already in the DESCRIPTION:

install.packages("gh", repos = "https://r-releases.r-universe.dev") packageDescription("gh")$RemoteSha#> [1] "ab056d6322064295432d4e9c08143c2c99c028e4"

So it's odd that availablePackages() does not show it.

— Reply to this email directly, view it on GitHub https://github.com/r-universe-org/help/issues/377#issuecomment-1978858338, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABUZ73B6EDVWDKJFJSF33LYWXG5TAVCNFSM6AAAAABEHGRKJ6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZYHA2TQMZTHA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

jeroen commented 4 months ago

So these are fields that are included in the individual DESCRIPTION: https://jeroen.r-universe.dev/jsonlite/DESCRIPTION

But only a few of them are included in the index (to save space): https://jeroen.r-universe.dev/src/contrib/PACKAGES

wlandau commented 4 months ago

I see. Would it be feasible to add RemoteSha to PACKAGES to support https://github.com/r-releases/help/issues/21 and #149, or is a light PACKAGES file more of a priority for R-universe? In the latter case, what would you recommend for https://github.com/r-releases/help/issues/21?

jeroen commented 4 months ago

Perhaps we can make it opt-in via a parameter. Does it really need to work with base R available.packages() or are you more flexible? We could also include it just for the JSONLD index only e.g.: https://jeroen.r-universe.dev/src/contrib/

So then instead of base available.packages() you would need to use e.g.

df <- jsonlite::stream_in(url("https://jeroen.r-universe.dev/src/contrib/"), verbose = F)
wlandau commented 4 months ago

Does it really need to work with base R available.packages() or are you more flexible?

I am flexible. I am good with anything that pulls the package names, version numbers, and RemoteShas of all packages quickly.

We could also include it just for the JSONLD index only e.g.: https://jeroen.r-universe.dev/src/contrib/. So then instead of base available.packages() you would need to use e.g...

Perfect!

jeroen commented 4 months ago

Not sure if it's a good idea to deploy from my flight but here is something you can test now:

https://jeroen.r-universe.dev/src/contrib/PACKAGES?fields=RemoteSha,RemoteUrl

https://jeroen.r-universe.dev/src/contrib/PACKAGES.json?fields=RemoteSha,RemoteUrl

So using this fields parameter you can request any additional fields (comma separated and case sensitive) from the DESCRIPTION files in the PACKAGES index.

wlandau commented 4 months ago

Cool! Your query parameter idea looks like an elegant way to handle this, and it works for me in both cases:

system.time(
  packages_file <- utils::available.packages(
    contriburl = paste0(
      contrib.url("https://jeroen.r-universe.dev", type = "source"),
      "/PACKAGES?fields=RemoteSha,RemoteUrl"
    ),
    fields = "RemoteSha"
  )
)
#>    user  system elapsed 
#>   0.033   0.010   2.086
head(packages_file[, "RemoteSha"])
#>                                  RAppArmor 
#> "f437c1a926e7f5c225003738bca46584ee1a1f51" 
#>                                         V8 
#> "8adfc4c5ffc1f2da45206a53927d14046dfaa141" 
#>                                     badgen 
#> "57af6a1eab06369730a9ca520375ed6b78a0e5d6" 
#>                                     base64 
#> "0b8294d5d2ea1f1d1d069ef5ff681d90bdbc38ab" 
#>                                     bcrypt 
#> "49eb9da001cc6d3f118521d6e5221fb8909cfa6e" 
#>                                     brotli 
#> "00a9aa6a84cfcf2da6184a32a0ce7a7f1b9a8211"

system.time(
  json <- jsonlite::stream_in(
    url("https://jeroen.r-universe.dev/src/contrib/PACKAGES.json?fields=RemoteSha,RemoteUrl"),
    verbose = FALSE
  )
)
#>    user  system elapsed 
#>   0.036   0.003   0.858
head(json$RemoteSha)
#> [1] "f437c1a926e7f5c225003738bca46584ee1a1f51"
#> [2] "8adfc4c5ffc1f2da45206a53927d14046dfaa141"
#> [3] "57af6a1eab06369730a9ca520375ed6b78a0e5d6"
#> [4] "0b8294d5d2ea1f1d1d069ef5ff681d90bdbc38ab"
#> [5] "49eb9da001cc6d3f118521d6e5221fb8909cfa6e"
#> [6] "00a9aa6a84cfcf2da6184a32a0ce7a7f1b9a8211"

Created on 2024-03-05 with reprex v2.1.0

wlandau commented 3 months ago

I noticed the query also works in the R-releases universe too: https://r-releases.r-universe.dev/src/contrib/PACKAGES?fields=RemoteSha,RemoteUrl. Okay if I use it in R-releases? Would you still rather me use the JSON route, or is PACKAGES/available.packages() okay too?

jeroen commented 3 months ago

Yes go for it, I was only mentioning mine as example. The API is the same for any universe of course.

jeroen commented 3 months ago

Can I close this as solved?

wlandau commented 3 months ago

Certainly! Thank you for your help.