Closed sckott closed 6 years ago
All the data for the check results are stored in rds files available from CRAN. There are 2 main files one with just the summary of the checks (https://cran.r-project.org/web/checks/check_results.rds), and another one with the details of the all the check results (https://cran.r-project.org/web/checks/check_details.rds). I have (unexported) functions in foghorn
to download and read them: https://github.com/fmichonneau/foghorn/blob/master/R/cran_files.R
So if you can have a cron job that download these files regularly, and a way to make the data available to the API, then there is no need to rely on scrapping the website. One of the reasons I had opted for plumber
(https://github.com/fmichonneau/cranstatus/) was to be able to serve this data directly.
Thanks @fmichonneau - thx for letting me know. I didn't know about the rds files. Though i guess that means I would have to include R in the stack which there isn't right now. Hmmm. The scraping isn't ideal, but i do only do it once a day. How often are the rds files updated? I had assumed they only ran checks once a day?
my understanding is that it's updated each time a package check completes, so probably many times a day. yeah you'd have to include R (unless there is a way to extract data from the rds files outside of R?)
with new async http requests and parallel html parsing, i think it's fast enough now
pkg scraping is down to about 4 min or so, and maintainers to 10 min or so
instead of scraping via web requests - maybe too much complexity
https://cran.r-project.org/mirror-howto.html