r-hub / pkgsearch

Search R packages on CRAN
https://r-hub.github.io/pkgsearch
Other
108 stars 9 forks source link

`cran_package_history("-")` fails #126

Open mccarthy-m-g opened 2 weeks ago

mccarthy-m-g commented 2 weeks ago

Problem

https://crandb.r-pkg.org/-/<redacted> is a valid API call that gets history for all packages, but cran_package_history("-") results in an error. I did some debugging and the error happens here:

https://github.com/r-hub/pkgsearch/blob/d613795b68e7408cdeb9038c8534d5c3fede72a8/R/crandb-public-api.R#L313

The problem is that resp$versions ends up querying the list for the {versions} package, instead of the versions index in each package.

Would you be open to expanding the function to support the cran_package_history("-") call? Happy to start a PR for it!

reprex

library(pkgsearch)

cran_package_history("-")
#> Warning in description_list$releases <- NULL: Coercing LHS to a list

#> Warning in description_list$releases <- NULL: Coercing LHS to a list

#> Warning in description_list$releases <- NULL: Coercing LHS to a list
#> Error: Inputs can't be recycled to a common size.

Created on 2024-08-26 with reprex v2.0.2

gaborcsardi commented 2 weeks ago

cran_package_history("-")was never supposed to do anything meaningful.

mccarthy-m-g commented 2 weeks ago

Does that mean you aren't interested in supporting "-" within {pkgsearch}?

I was hoping to get the version history for every CRAN package---and the https://crandb.r-pkg.org/-/<redacted> endpoint does this---but it would be nice to be able to do it from {pkgsearch}. Otherwise I'd have to map over cran_package_history() for each package, which is a lot of API calls.

gaborcsardi commented 2 weeks ago

That endpoint is pretty heavy on the DB, so I definitely don't want to support it in pkgsearch. In fact I might need to remove it completely, or heavily cache it in cloidflare.

In fact, I'll remove the link from your comments, because people and/or crawlers clicking on it will kill the server.

mccarthy-m-g commented 2 weeks ago

Ah, fair enough. Is there a responsible way to get the history for every package? I'm assuming calling cran_package_history() for every package is also heavy on the DB?

gaborcsardi commented 2 weeks ago

No, that's not heavy at all, but it also takes a very long time to make thousands of HTTP queries. I don't know of any good way currently.

mccarthy-m-g commented 2 weeks ago

Thanks, that's good to know. Maybe a regularly updated duckdb database would be a good way to share the history for every package?

Just for context, the reason I wanted this data was for a {shinylive} dashboard that would provide download analytics for every CRAN package, and my original plan was to make said database with GitHub Actions (so I didn't want something that would take forever to run). I'm probably going to pivot from my original plan now though, so feel free to close this.

gaborcsardi commented 2 weeks ago

I like the idea of having a daily Parquet file available with all the data.