r-lib / pkgcache

Cache CRAN-like metadata and package files
https://r-lib.github.io/pkgcache/
Other
28 stars 14 forks source link

`meta_cache_list()` - empty `published` column for recent packages #109

Open pawelru opened 5 months ago

pawelru commented 5 months ago
Sys.Date()
#> [1] "2024-04-19"

library(pkgcache)
meta_cache_update()
#> 
#> ℹ Updating metadata database
#> ✔ Updating metadata database ... done
#> 
max(meta_cache_list()$published, na.rm = T) # note the diff from today!
#> [1] "2024-04-09 16:50:05 GMT"

# an example - package `tensorflow` released on 15th of Apr
library(rvest)
read_html("https://cran.r-project.org/web/packages/tensorflow/index.html") |> 
    html_element("table") |> 
    html_table() |> 
    head(x = _, 5)
#> # A tibble: 5 × 2
#>   X1         X2                                                                 
#>   <chr>      <chr>                                                              
#> 1 Version:   "2.16.0"                                                           
#> 2 Depends:   "R (≥ 3.6)"                                                        
#> 3 Imports:   "config, processx, reticulate (≥ 1.32), tfruns (≥ 1.0), utils, yam…
#> 4 Suggests:  "testthat (≥ 2.1.0), keras3, pillar, withr, callr"                 
#> 5 Published: "2024-04-15"

meta_cache_list(packages = "tensorflow")[, c("package", "version", "published")]
#> # A data frame: 2 × 3
#>   package    version published
#> * <chr>      <chr>   <dttm>   
#> 1 tensorflow 2.16.0  NA       
#> 2 tensorflow 2.16.0  NA

Created on 2024-04-19 with reprex v2.1.0

Is this a bug? What I can do to force update the cache? I am analysing CRAN data and the release / publish date is one of my inputs.

gaborcsardi commented 5 months ago

That column is from metadata that is not on CRAN and we need to collect it separately. Unfortunately I had to shut down the infrastructure that collects it, so it hasn't been updated for a couple of days.

The metadata itself is now here: https://github.com/r-hub/cran-metadata/tree/gh-pages but until I write the code that updates it, it won't be updated. The old update code used a local CRAN mirror, which I don't have any more, so we need a completely new way of updating.

The published field is actually easy, so maybe I'll do that first. The hard ones are the hashes, for those I need to download the package files, and Windows binaries are rebuilt all the time, so that's a lot of downloads, potentially.

Anyway, I wan't aware of any use for that metadata, apart from pak printing the file sizes, so opening this issue was a good idea.

pawelru commented 5 months ago

Thanks @gaborcsardi for a prompt reply. I'll have a look what you linked and consider this as an alternative to rvest-ing this from CRAN webpage. Definitely looking forward to bring this back. pkgcache API is so convenient to my use case. If it's comes to me - I don't use hashes at all so if this is a biggest piece of work then this can be definitely postponed.

gaborcsardi commented 5 months ago

No need to scrape this field, you can also do something like

db <- tools::CRAN_package_db()
db$`Date/Publication`